1. 18 8月, 2010 4 次提交
    • N
      fs: scale files_lock · 6416ccb7
      Nick Piggin 提交于
      fs: scale files_lock
      
      Improve scalability of files_lock by adding per-cpu, per-sb files lists,
      protected with an lglock. The lglock provides fast access to the per-cpu lists
      to add and remove files. It also provides a snapshot of all the per-cpu lists
      (although this is very slow).
      
      One difficulty with this approach is that a file can be removed from the list
      by another CPU. We must track which per-cpu list the file is on with a new
      variale in the file struct (packed into a hole on 64-bit archs). Scalability
      could suffer if files are frequently removed from different cpu's list.
      
      However loads with frequent removal of files imply short interval between
      adding and removing the files, and the scheduler attempts to avoid moving
      processes too far away. Also, even in the case of cross-CPU removal, the
      hardware has much more opportunity to parallelise cacheline transfers with N
      cachelines than with 1.
      
      A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
      degenerates to contending on a single lock, which is no worse than before. When
      more than one CPU are allocating files, even if they are always freed by
      different CPUs, there will be more parallelism than the single-lock case.
      
      Testing results:
      
      On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
      to remove the file, the number of times it is removed by the same CPU that
      added it, and the number of times it is removed by the same node that added it.
      
      Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
      kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
      dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
      
      So a file is removed from the same CPU it was added by over 90% of the time.
      It remains within the same node 95% of the time.
      
      Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
      
                      throughput
      2.6.34-rc2      24.5
      +patch          24.9
      
                      us      sys     idle    IO wait (in %)
      2.6.34-rc2      51.25   28.25   17.25   3.25
      +patch          53.75   18.5    19      8.75
      
      So significantly less CPU time spent in kernel code, higher idle time and
      slightly higher throughput.
      
      Single threaded performance difference was within the noise of microbenchmarks.
      That is not to say penalty does not exist, the code is larger and more memory
      accesses required so it will be slightly slower.
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6416ccb7
    • N
      tty: fix fu_list abuse · d996b62a
      Nick Piggin 提交于
      tty: fix fu_list abuse
      
      tty code abuses fu_list, which causes a bug in remount,ro handling.
      
      If a tty device node is opened on a filesystem, then the last link to the inode
      removed, the filesystem will be allowed to be remounted readonly. This is
      because fs_may_remount_ro does not find the 0 link tty inode on the file sb
      list (because the tty code incorrectly removed it to use for its own purpose).
      This can result in a filesystem with errors after it is marked "clean".
      
      Taking idea from Christoph's initial patch, allocate a tty private struct
      at file->private_data and put our required list fields in there, linking
      file and tty. This makes tty nodes behave the same way as other device nodes
      and avoid meddling with the vfs, and avoids this bug.
      
      The error handling is not trivial in the tty code, so for this bugfix, I take
      the simple approach of using __GFP_NOFAIL and don't worry about memory errors.
      This is not a problem because our allocator doesn't fail small allocs as a rule
      anyway. So proper error handling is left as an exercise for tty hackers.
      
      [ Arguably filesystem's device inode would ideally be divorced from the
      driver's pseudo inode when it is opened, but in practice it's not clear whether
      that will ever be worth implementing. ]
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d996b62a
    • N
      fs: cleanup files_lock locking · ee2ffa0d
      Nick Piggin 提交于
      fs: cleanup files_lock locking
      
      Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
      manipulate the per-sb files list; unexport the files_lock spinlock.
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ee2ffa0d
    • C
      remove SWRITE* I/O types · 9cb569d6
      Christoph Hellwig 提交于
      These flags aren't real I/O types, but tell ll_rw_block to always
      lock the buffer instead of giving up on a failed trylock.
      
      Instead add a new write_dirty_buffer helper that implements this semantic
      and use it from the existing SWRITE* callers.  Note that the ll_rw_block
      code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
      this patch fixes.
      
      In the ufs code clean up the helper that used to call ll_rw_block
      to mirror sync_dirty_buffer, which is the function it implements for
      compound buffers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9cb569d6
  2. 14 8月, 2010 2 次提交
  3. 12 8月, 2010 1 次提交
  4. 11 8月, 2010 1 次提交
  5. 10 8月, 2010 15 次提交
    • J
      mm: implement writeback livelock avoidance using page tagging · f446daae
      Jan Kara 提交于
      We try to avoid livelocks of writeback when some steadily creates dirty
      pages in a mapping we are writing out.  For memory-cleaning writeback,
      using nr_to_write works reasonably well but we cannot really use it for
      data integrity writeback.  This patch tries to solve the problem.
      
      The idea is simple: Tag all pages that should be written back with a
      special tag (TOWRITE) in the radix tree.  This can be done rather quickly
      and thus livelocks should not happen in practice.  Then we start doing the
      hard work of locking pages and sending them to disk only for those pages
      that have TOWRITE tag set.
      
      Note: Adding new radix tree tag grows radix tree node from 288 to 296
      bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
      However, the number of slab/slub items per page remains the same (13 and 7
      respectively).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f446daae
    • A
      Fix sget() race with failing mount · 7a4dec53
      Al Viro 提交于
      If sget() finds a matching superblock being set up, it'll
      grab an active reference to it and grab s_umount.  That's
      fine - we'll wait for completion of foofs_get_sb() that way.
      However, if said foofs_get_sb() fails we'll end up holding
      the halfway-created superblock.  deactivate_locked_super()
      called by foofs_get_sb() will just unlock the sucker since
      we are holding another active reference to it.
      
      What we need is a way to tell if superblock has been successfully
      set up.  Unfortunately, neither ->s_root nor the check for
      MS_ACTIVE quite fit.  Cheap and easy way, suitable for backport:
      new flag set by the (only) caller of ->get_sb().  If that flag
      isn't present by the time sget() grabbed s_umount on preexisting
      superblock it has found, it's seeing a stillborn and should
      just bury it with deactivate_locked_super() (and repeat the search).
      
      Longer term we want to set that flag in ->get_sb() instances (and
      check for it to distinguish between "sget() found us a live sb"
      and "sget() has allocated an sb, we need to set it up" in there,
      instead of checking ->s_root as we do now).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@kernel.org
      7a4dec53
    • C
      pass a struct path to vfs_statfs · ebabe9a9
      Christoph Hellwig 提交于
      We'll need the path to implement the flags field for statvfs support.
      We do have it available in all callers except:
      
       - ecryptfs_statfs.  This one doesn't actually need vfs_statfs but just
         needs to do a caller to the lower filesystem statfs method.
       - sys_ustat.  Add a non-exported statfs_by_dentry helper for it which
         doesn't won't be able to fill out the flags field later on.
      
      In addition rename the helpers for statfs vs fstatfs to do_*statfs instead
      of the misleading vfs prefix.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ebabe9a9
    • A
      convert remaining ->clear_inode() to ->evict_inode() · b57922d9
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b57922d9
    • A
      Make ->drop_inode() just return whether inode needs to be dropped · 45321ac5
      Al Viro 提交于
      ... and let iput_final() do the actual eviction or retention
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      45321ac5
    • A
      fs/inode.c:clear_inode() is gone · 30140837
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      30140837
    • A
      ->delete_inode() is gone · 07958f9f
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      07958f9f
    • A
      new helper: end_writeback() · b0683aa6
      Al Viro 提交于
      Essentially, the minimal variant of ->evict_inode().  It's
      a trimmed-down clear_inode(), sans any fs callbacks.  Once
      it returns we know that no async writeback will be happening;
      every ->evict_inode() instance should do that once and do that
      before doing anything ->write_inode() could interfere with
      (e.g. freeing the on-disk inode).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b0683aa6
    • A
      generic_detach_inode() can be static now · c6287315
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c6287315
    • A
      New method - evict_inode() · be7ce416
      Al Viro 提交于
      Hybrid of ->clear_inode() and ->delete_inode(); if present, does
      all fs work to be done when in-core inode is about to be gone,
      for whatever reason.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      be7ce416
    • A
      simplify checks for I_CLEAR/I_FREEING · a4ffdde6
      Al Viro 提交于
      add I_CLEAR instead of replacing I_FREEING with it.  I_CLEAR is
      equivalent to I_FREEING for almost all code looking at either;
      it's there to keep track of having called clear_inode() exactly
      once per inode lifetime, at some point after having set I_FREEING.
      I_CLEAR and I_FREEING never get set at the same time with the
      current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
      instead of I_CLEAR without loss of information.  As the result of
      such change, checks become simpler and the amount of code that needs
      to know about I_CLEAR shrinks a lot.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a4ffdde6
    • C
      check ATTR_SIZE contraints in inode_change_ok · 2c27c65e
      Christoph Hellwig 提交于
      Make sure we check the truncate constraints early on in ->setattr by adding
      those checks to inode_change_ok.  Also clean up and document inode_change_ok
      to make this obvious.
      
      As a fallout we don't have to call inode_newsize_ok from simple_setsize and
      simplify it down to a truncate_setsize which doesn't return an error.  This
      simplifies a lot of setattr implementations and means we use truncate_setsize
      almost everywhere.  Get rid of fat_setsize now that it's trivial and mark
      ext2_setsize static to make the calling convention obvious.
      
      Keep the inode_newsize_ok in vmtruncate for now as all callers need an
      audit for its removal anyway.
      
      Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
      needs a deeper audit, but that is left for later.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2c27c65e
    • C
      remove inode_setattr · 1025774c
      Christoph Hellwig 提交于
      Replace inode_setattr with opencoded variants of it in all callers.  This
      moves the remaining call to vmtruncate into the filesystem methods where it
      can be replaced with the proper truncate sequence.
      
      In a few cases it was obvious that we would never end up calling vmtruncate
      so it was left out in the opencoded variant:
      
       spufs: explicitly checks for ATTR_SIZE earlier
       btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
       ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above
      
      In addition to that ncpfs called inode_setattr with handcrafted iattrs,
      which allowed to trim down the opencoded variant.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1025774c
    • C
      rename generic_setattr · 6a1a90ad
      Christoph Hellwig 提交于
      Despite its name it's now a generic implementation of ->setattr, but
      rather a helper to copy attributes from a struct iattr to the inode.
      Rename it to setattr_copy to reflect this fact.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6a1a90ad
    • C
      sort out blockdev_direct_IO variants · eafdc7d1
      Christoph Hellwig 提交于
      Move the call to vmtruncate to get rid of accessive blocks to the callers
      in prepearation of the new truncate calling sequence.  This was only done
      for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
      was not needed anyway.  Get rid of blockdev_direct_IO_no_locking and
      its _newtrunc variant while at it as just opencoding the two additional
      paramters is shorted than the name suffix.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eafdc7d1
  6. 08 8月, 2010 4 次提交
    • T
      bio, fs: separate out bio_types.h and define READ/WRITE constants in terms of BIO_RW_* flags · 7cc01581
      Tejun Heo 提交于
      linux/fs.h hard coded READ/WRITE constants which should match BIO_RW_*
      flags.  This is fragile and caused breakage during BIO_RW_* flag
      rearrangement.  The hardcoding is to avoid include dependency hell.
      
      Create linux/bio_types.h which contatins definitions for bio data
      structures and flags and include it from bio.h and fs.h, and make fs.h
      define all READ/WRITE related constants in terms of BIO_RW_* flags.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7cc01581
    • T
      bio, fs: update RWA_MASK, READA and SWRITE to match the corresponding BIO_RW_* bits · aca27ba9
      Tejun Heo 提交于
      Commit a82afdfc (block: use the same failfast bits for bio and request)
      moved BIO_RW_* bits around such that they match up with REQ_* bits.
      Unfortunately, fs.h hard coded RW_MASK, RWA_MASK, READ, WRITE, READA
      and SWRITE as 0, 1, 2 and 3, and expected them to match with BIO_RW_*
      bits.  READ/WRITE didn't change but BIO_RW_AHEAD was moved to bit 4
      instead of bit 1, breaking RWA_MASK, READA and SWRITE.
      
      This patch updates RWA_MASK, READA and SWRITE such that they match the
      BIO_RW_* bits again.  A follow up patch will update the definitions to
      directly use BIO_RW_* bits so that this kind of breakage won't happen
      again.
      
      Neil also spotted missing RWA_MASK conversion.
      
      Stable: The offending commit a82afdfc was released with v2.6.32, so
      this patch should be applied to all kernels since then but it must
      _NOT_ be applied to kernels earlier than that.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-bisected-by: NVladislav Bolkhovitin <vst@vlnb.net>
      Root-caused-by: NNeil Brown <neilb@suse.de>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      aca27ba9
    • C
      block: unify flags for struct bio and struct request · 7b6d91da
      Christoph Hellwig 提交于
      Remove the current bio flags and reuse the request flags for the bio, too.
      This allows to more easily trace the type of I/O from the filesystem
      down to the block driver.  There were two flags in the bio that were
      missing in the requests:  BIO_RW_UNPLUG and BIO_RW_AHEAD.  Also I've
      renamed two request flags that had a superflous RW in them.
      
      Note that the flags are in bio.h despite having the REQ_ name - as
      blkdev.h includes bio.h that is the only way to go for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7b6d91da
    • C
      block: BARRIER request should imply SYNC · 41f2df62
      Christoph Hellwig 提交于
      A barrier request should by defintion have priority in get_request
      and let the queue be unplugged immediately as it's blocking all forward
      progress due to the queue draining.
      
      Most filesystems already get this implicitly by the way how submit_bh
      treats the buffer_ordered flag, and gfs2 sets it explicitly.  But btrfs
      and XFS are still forgetting to set the flag, as is blkdev_issue_flush
      and some places in DM/MD.
      
      For XFS on metadata heavy workloads this gives a consistent speedup
      in the 2-3% range.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      41f2df62
  7. 02 8月, 2010 1 次提交
    • E
      vfs: re-introduce MAY_CHDIR · 9cfcac81
      Eric Paris 提交于
      Currently MAY_ACCESS means that filesystems must check the permissions
      right then and not rely on cached results or the results of future
      operations on the object.  This can be because of a call to sys_access() or
      because of a call to chdir() which needs to check search without relying on
      any future operations inside that dir.  I plan to use MAY_ACCESS for other
      purposes in the security system, so I split the MAY_ACCESS and the
      MAY_CHDIR cases.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NStephen D. Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      9cfcac81
  8. 28 7月, 2010 5 次提交
  9. 27 7月, 2010 2 次提交
  10. 07 7月, 2010 1 次提交
    • A
      VFS: introduce s_dirty accessors · 140236b4
      Artem Bityutskiy 提交于
      This patch introduces 3 VFS accessors: 'sb_mark_dirty()',
      'sb_mark_clean()', and 'sb_is_dirty()'. They simply
      set 'sb->s_dirt' or test 'sb->s_dirt'. The plan is to make
      every FS use these accessors later instead of manipulating
      the 'sb->s_dirt' flag directly.
      
      Ultimately, this change is a preparation for the periodic
      superblock synchronization optimization which is about
      preventing the "sync_supers" kernel thread from waking up
      even if there is nothing to synchronize.
      
      This patch does not do any functional change, just adds
      accessor functions.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      140236b4
  11. 05 6月, 2010 1 次提交
  12. 28 5月, 2010 3 次提交
    • N
      fs: introduce new truncate sequence · 7bb46a67
      npiggin@suse.de 提交于
      Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
      setattr > vmtruncate > truncate, have filesystems call their truncate sequence
      from ->setattr if filesystem specific operations are required. vmtruncate is
      deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
      previously should be used.
      
      simple_setattr is introduced for simple in-ram filesystems to implement
      the new truncate sequence. Eventually all filesystems should be converted
      to implement a setattr, and the default code in notify_change should go
      away.
      
      simple_setsize is also introduced to perform just the ATTR_SIZE portion
      of simple_setattr (ie. changing i_size and trimming pagecache).
      
      To implement the new truncate sequence:
      - filesystem specific manipulations (eg freeing blocks) must be done in
        the setattr method rather than ->truncate.
      - vmtruncate can not be used by core code to trim blocks past i_size in
        the event of write failure after allocation, so this must be performed
        in the fs code.
      - convert usage of helpers block_write_begin, nobh_write_begin,
        cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
        variants. These avoid calling vmtruncate to trim blocks (see previous).
      - inode_setattr should not be used. generic_setattr is a new function
        to be used to copy simple attributes into the generic inode.
      - make use of the better opportunity to handle errors with the new sequence.
      
      Big problem with the previous calling sequence: the filesystem is not called
      until i_size has already changed.  This means it is not allowed to fail the
      call, and also it does not know what the previous i_size was. Also, generic
      code calling vmtruncate to truncate allocated blocks in case of error had
      no good way to return a meaningful error (or, for example, atomically handle
      block deallocation).
      
      Cc: Christoph Hellwig <hch@lst.de>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7bb46a67
    • C
      rename the generic fsync implementations · 1b061d92
      Christoph Hellwig 提交于
      We don't name our generic fsync implementations very well currently.
      The no-op implementation for in-memory filesystems currently is called
      simple_sync_file which doesn't make too much sense to start with,
      the the generic one for simple filesystems is called simple_fsync
      which can lead to some confusion.
      
      This patch renames the generic file fsync method to generic_file_fsync
      to match the other generic_file_* routines it is supposed to be used
      with, and the no-op implementation to noop_fsync to make it obvious
      what to expect.  In addition add some documentation for both methods.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1b061d92
    • C
      drop unused dentry argument to ->fsync · 7ea80859
      Christoph Hellwig 提交于
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7ea80859