1. 29 10月, 2010 5 次提交
  2. 26 10月, 2010 1 次提交
    • A
      split invalidate_inodes() · 63997e98
      Al Viro 提交于
      Pull removal of fsnotify marks into generic_shutdown_super().
      Split umount-time work into a new function - evict_inodes().
      Make sure that invalidate_inodes() will be able to cope with
      I_FREEING once we change locking in iput().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      63997e98
  3. 18 8月, 2010 1 次提交
    • N
      fs: scale files_lock · 6416ccb7
      Nick Piggin 提交于
      fs: scale files_lock
      
      Improve scalability of files_lock by adding per-cpu, per-sb files lists,
      protected with an lglock. The lglock provides fast access to the per-cpu lists
      to add and remove files. It also provides a snapshot of all the per-cpu lists
      (although this is very slow).
      
      One difficulty with this approach is that a file can be removed from the list
      by another CPU. We must track which per-cpu list the file is on with a new
      variale in the file struct (packed into a hole on 64-bit archs). Scalability
      could suffer if files are frequently removed from different cpu's list.
      
      However loads with frequent removal of files imply short interval between
      adding and removing the files, and the scheduler attempts to avoid moving
      processes too far away. Also, even in the case of cross-CPU removal, the
      hardware has much more opportunity to parallelise cacheline transfers with N
      cachelines than with 1.
      
      A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
      degenerates to contending on a single lock, which is no worse than before. When
      more than one CPU are allocating files, even if they are always freed by
      different CPUs, there will be more parallelism than the single-lock case.
      
      Testing results:
      
      On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
      to remove the file, the number of times it is removed by the same CPU that
      added it, and the number of times it is removed by the same node that added it.
      
      Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
      kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
      dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
      
      So a file is removed from the same CPU it was added by over 90% of the time.
      It remains within the same node 95% of the time.
      
      Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
      
                      throughput
      2.6.34-rc2      24.5
      +patch          24.9
      
                      us      sys     idle    IO wait (in %)
      2.6.34-rc2      51.25   28.25   17.25   3.25
      +patch          53.75   18.5    19      8.75
      
      So significantly less CPU time spent in kernel code, higher idle time and
      slightly higher throughput.
      
      Single threaded performance difference was within the noise of microbenchmarks.
      That is not to say penalty does not exist, the code is larger and more memory
      accesses required so it will be slightly slower.
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6416ccb7
  4. 10 8月, 2010 3 次提交
    • A
      no need for list_for_each_entry_safe()/resetting with superblock list · dca33252
      Al Viro 提交于
      just delay __put_super() a bit
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      dca33252
    • A
      Fix sget() race with failing mount · 7a4dec53
      Al Viro 提交于
      If sget() finds a matching superblock being set up, it'll
      grab an active reference to it and grab s_umount.  That's
      fine - we'll wait for completion of foofs_get_sb() that way.
      However, if said foofs_get_sb() fails we'll end up holding
      the halfway-created superblock.  deactivate_locked_super()
      called by foofs_get_sb() will just unlock the sucker since
      we are holding another active reference to it.
      
      What we need is a way to tell if superblock has been successfully
      set up.  Unfortunately, neither ->s_root nor the check for
      MS_ACTIVE quite fit.  Cheap and easy way, suitable for backport:
      new flag set by the (only) caller of ->get_sb().  If that flag
      isn't present by the time sget() grabbed s_umount on preexisting
      superblock it has found, it's seeing a stillborn and should
      just bury it with deactivate_locked_super() (and repeat the search).
      
      Longer term we want to set that flag in ->get_sb() instances (and
      check for it to distinguish between "sget() found us a live sb"
      and "sget() has allocated an sb, we need to set it up" in there,
      instead of checking ->s_root as we do now).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@kernel.org
      7a4dec53
    • T
      vfs: don't hold s_umount over close_bdev_exclusive() call · 4f331f01
      Tejun Heo 提交于
      Fix an obscure AB-BA deadlock in get_sb_bdev().
      
      When a superblock is mounted more than once get_sb_bdev() calls
      close_bdev_exclusive() to drop the extra bdev reference while holding
      s_umount.  However, sb->s_umount nests inside bd_mutex during
      __invalidate_device() and close_bdev_exclusive() acquires bd_mutex during
      blkdev_put(); thus creating an AB-BA deadlock.
      
      This condition doesn't trigger frequently.  For this condition to be
      visible to lockdep, the filesystem must occupy the whole device (as
      __invalidate_device() only grabs bd_mutex for the whole device), the FS
      must be mounted more than once and partition rescan should be issued while
      the FS is still mounted.
      
      Fix it by dropping s_umount over close_bdev_exclusive().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NCiprian Docan <docan@eden.rutgers.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4f331f01
  5. 30 6月, 2010 1 次提交
    • N
      fs: fix superblock iteration race · 57439f87
      npiggin@suse.de 提交于
      list_for_each_entry_safe is not suitable to protect against concurrent
      modification of the list. 6754af64 introduced a race in sb walking.
      
      list_for_each_entry can use the trick of pinning the current entry in
      the list before we drop and retake the lock because it subsequently
      follows cur->next. However list_for_each_entry_safe saves n=cur->next
      for following before entering the loop body, so when the lock is
      dropped, n may be deleted.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: John Stultz <johnstul@us.ibm.com>
      Cc: Frank Mayhar <fmayhar@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57439f87
  6. 28 5月, 2010 1 次提交
  7. 24 5月, 2010 3 次提交
    • C
      quota: explicitly set ->dq_op and ->s_qcop · 123e9caf
      Christoph Hellwig 提交于
      Only set the quota operation vectors if the filesystem actually supports
      quota instead of doing it for all filesystems in alloc_super().
      
      [Jan Kara: Export dquot_operations and vfs_quotactl_ops]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      123e9caf
    • C
      quota: move unmount handling into the filesystem · e0ccfd95
      Christoph Hellwig 提交于
      Currently the VFS calls into the quotactl interface for unmounting
      filesystems.  This means filesystems with their own quota handling
      can't easily distinguish between user-space originating quotaoff
      and an unount.  Instead move the responsibily of the unmount handling
      into the filesystem to be consistent with all other dquot handling.
      
      Note that we do call dquot_disable a lot later now, e.g. after
      a sync_filesystem.  But this is fine as the quota code does all its
      writes via blockdev's mapping and that is synced even later.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      e0ccfd95
    • C
      quota: move remount handling into the filesystem · c79d967d
      Christoph Hellwig 提交于
      Currently do_remount_sb calls into the dquot code to tell it about going
      from rw to ro and ro to rw.  Move this code into the filesystem to
      not depend on the dquot code in the VFS - note ocfs2 already ignores
      these calls and handles remount by itself.  This gets rid of overloading
      the quotactl calls and allows to unify the VFS and XFS codepaths in
      that area later.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      c79d967d
  8. 22 5月, 2010 18 次提交
    • R
      vfs: add lockdep annotation to s_vfs_rename_key for ecryptfs · 51ee049e
      Roland Dreier 提交于
       >  =============================================
       >  [ INFO: possible recursive locking detected ]
       >  2.6.31-2-generic #14~rbd3
       >  ---------------------------------------------
       >  firefox-3.5/4162 is trying to acquire lock:
       >   (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0
       >
       >  but task is already holding lock:
       >   (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0
       >
       >  other info that might help us debug this:
       >  3 locks held by firefox-3.5/4162:
       >   #0:  (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0
       >   #1:  (&sb->s_type->i_mutex_key#11/1){+.+.+.}, at: [<ffffffff81139d5a>] lock_rename+0x6a/0xf0
       >   #2:  (&sb->s_type->i_mutex_key#11/2){+.+.+.}, at: [<ffffffff81139d6f>] lock_rename+0x7f/0xf0
       >
       >  stack backtrace:
       >  Pid: 4162, comm: firefox-3.5 Tainted: G         C 2.6.31-2-generic #14~rbd3
       >  Call Trace:
       >   [<ffffffff8108ae74>] print_deadlock_bug+0xf4/0x100
       >   [<ffffffff8108ce26>] validate_chain+0x4c6/0x750
       >   [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430
       >   [<ffffffff8108d585>] lock_acquire+0xa5/0x150
       >   [<ffffffff81139d31>] ? lock_rename+0x41/0xf0
       >   [<ffffffff815526ad>] __mutex_lock_common+0x4d/0x3d0
       >   [<ffffffff81139d31>] ? lock_rename+0x41/0xf0
       >   [<ffffffff81139d31>] ? lock_rename+0x41/0xf0
       >   [<ffffffff8120eaf9>] ? ecryptfs_rename+0x99/0x170
       >   [<ffffffff81552b36>] mutex_lock_nested+0x46/0x60
       >   [<ffffffff81139d31>] lock_rename+0x41/0xf0
       >   [<ffffffff8120eb2a>] ecryptfs_rename+0xca/0x170
       >   [<ffffffff81139a9e>] vfs_rename_dir+0x13e/0x160
       >   [<ffffffff8113ac7e>] vfs_rename+0xee/0x290
       >   [<ffffffff8113c212>] ? __lookup_hash+0x102/0x160
       >   [<ffffffff8113d512>] sys_renameat+0x252/0x280
       >   [<ffffffff81133eb4>] ? cp_new_stat+0xe4/0x100
       >   [<ffffffff8101316a>] ? sysret_check+0x2e/0x69
       >   [<ffffffff8108c34d>] ? trace_hardirqs_on_caller+0x14d/0x190
       >   [<ffffffff8113d55b>] sys_rename+0x1b/0x20
       >   [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
      
      The trace above is totally reproducible by doing a cross-directory
      rename on an ecryptfs directory.
      
      The issue seems to be that sys_renameat() does lock_rename() then calls
      into the filesystem; if the filesystem is ecryptfs, then
      ecryptfs_rename() again does lock_rename() on the lower filesystem, and
      lockdep can't tell that the two s_vfs_rename_mutexes are different.  It
      seems an annotation like the following is sufficient to fix this (it
      does get rid of the lockdep trace in my simple tests); however I would
      like to make sure I'm not misunderstanding the locking, hence the CC
      list...
      Signed-off-by: NRoland Dreier <rdreier@cisco.com>
      Cc: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
      Cc: Dustin Kirkland <kirkland@canonical.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      51ee049e
    • J
      Introduce freeze_super and thaw_super for the fsfreeze ioctl · 18e9e510
      Josef Bacik 提交于
      Currently the way we do freezing is by passing sb>s_bdev to freeze_bdev and then
      letting it do all the work.  But freezing is more of an fs thing, and doesn't
      really have much to do with the bdev at all, all the work gets done with the
      super.  In btrfs we do not populate s_bdev, since we can have multiple bdev's
      for one fs and setting s_bdev makes removing devices from a pool kind of tricky.
      This means that freezing a btrfs filesystem fails, which causes us to corrupt
      with things like tux-on-ice which use the fsfreeze mechanism.  So instead of
      populating sb->s_bdev with a random bdev in our pool, I've broken the actual fs
      freezing stuff into freeze_super and thaw_super.  These just take the
      super_block that we're freezing and does the appropriate work.  It's basically
      just copy and pasted from freeze_bdev.  I've then converted freeze_bdev over to
      use the new super helpers.  I've tested this with ext4 and btrfs and verified
      everything continues to work the same as before.
      
      The only new gotcha is multiple calls to the fsfreeze ioctl will return EBUSY if
      the fs is already frozen.  I thought this was a better solution than adding a
      freeze counter to the super_block, but if everybody hates this idea I'm open to
      suggestions.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      18e9e510
    • A
      Trim includes in fs/super.c · e1e46bf1
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e1e46bf1
    • A
      Move grabbing s_umount to callers of grab_super() · d3f21473
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d3f21473
    • A
      Take statfs variants to fs/statfs.c · 7ed1ee61
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7ed1ee61
    • A
      new helper: iterate_supers() · 01a05b33
      Al Viro 提交于
      ... and switch the simple "loop over superblocks and do something"
      loops to it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      01a05b33
    • A
      Bury __put_super_and_need_restart() · 35cf7ba0
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      35cf7ba0
    • A
      In get_super() and user_get_super() restarts are unconditional · df40c01a
      Al Viro 提交于
      If superblock had been still alive, we would've returned it...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      df40c01a
    • A
      fix get_active_super()/umount() race · 1494583d
      Al Viro 提交于
      This one needs restarts...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1494583d
    • A
      fix do_emergency_remount()/umount() races · e7fe0585
      Al Viro 提交于
      need list_for_each_entry_safe() here.  Original didn't even
      have restart logics, so if you race with umount() it blew up.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e7fe0585
    • A
      6754af64
    • A
      get rid of restarts in sync_filesystems() · 8edd64bd
      Al Viro 提交于
      At the same time we can kill s_need_restart and local mutex in there.
      __put_super() made public for a while; will be gone later.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8edd64bd
    • A
      Leave superblocks on s_list until the end · 551de6f3
      Al Viro 提交于
      We used to remove from s_list and s_instances at the same
      time.  So let's *not* do the former and skip superblocks
      that have empty s_instances in the loops over s_list.
      
      The next step, of course, will be to get rid of rescan logics
      in those loops.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      551de6f3
    • A
      Saner locking around deactivate_super() · 1712ac8f
      Al Viro 提交于
      Make sure that s_umount is acquired *before* we drop the final
      active reference; we still have the fast path (atomic_dec_unless)
      and we have gotten rid of the window between the moment when
      s_active hits zero and s_umount is acquired.  Which simplifies
      the living hell out of grab_super() and inotify pin_to_kill()
      stuff.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1712ac8f
    • A
      get rid of S_BIAS · b20bd1a5
      Al Viro 提交于
      use atomic_inc_not_zero(&sb->s_active) instead of playing games with
      checking ->s_count > S_BIAS
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b20bd1a5
    • A
      389b8be6
    • C
      a135aa2c
    • J
      writeback: fix problem with !CONFIG_BLOCK compilation · c2c4986e
      Jens Axboe 提交于
      When CONFIG_BLOCK isn't enabled:
      
      mm/page-writeback.c: In function 'laptop_mode_timer_fn':
      mm/page-writeback.c:708: error: dereferencing pointer to incomplete type
      mm/page-writeback.c:709: error: dereferencing pointer to incomplete type
      
      Fix this by essentially eliminating the laptop sync handlers when
      CONFIG_BLOCK isn't set, as most are only used from the block layer code.
      The exception is laptop_sync_completion() which is used from sys_sync(),
      make that an empty declaration in that case.
      Reported-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c2c4986e
  9. 30 4月, 2010 1 次提交
  10. 25 4月, 2010 1 次提交
    • J
      Catch filesystems lacking s_bdi · 5129a469
      Jörn Engel 提交于
      noop_backing_dev_info is used only as a flag to mark filesystems that
      don't have any backing store, like tmpfs, procfs, spufs, etc.
      Signed-off-by: NJoern Engel <joern@logfs.org>
      
      Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes
      to the noop_backing_dev_info is not legal and will not result in
      them being flushed, but we already catch this condition in
      __mark_inode_dirty() when checking for a registered bdi.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      5129a469
  11. 04 3月, 2010 2 次提交
    • A
      Mirror MS_KERNMOUNT in ->mnt_flags · 8089352a
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8089352a
    • N
      fs: improve remount,ro vs buffercache coherency · d208bbdd
      Nick Piggin 提交于
      Invalidate sb->s_bdev on remount,ro.
      
      Fixes a problem reported by Jorge Boncompte who is seeing corruption
      trying to snapshot a minix filesystem image.  Some filesystems modify
      their metadata via a path other than the bdev buffer cache (eg.  they may
      use a private linear mapping for their metadata, or implement directories
      in pagecache, etc).  Also, file data modifications usually go to the bdev
      via their own mappings.
      
      These updates are not coherent with buffercache IO (eg.  via /dev/bdev)
      and never have been.  However there could be a reasonable expectation that
      after a mount -oremount,ro operation then the buffercache should
      subsequently be coherent with previous filesystem modifications.
      
      So invalidate the bdev mappings on a remount,ro operation to provide a
      coherency point.
      
      The problem was exposed when we switched the old rd to brd because old rd
      didn't really function like a normal block device and updates to rd via
      mappings other than the buffercache would still end up going into its
      buffercache.  But the same problem has always affected other "normal"
      block devices, including loop.
      
      [akpm@linux-foundation.org: repair comment layout]
      Reported-by: N"Jorge Boncompte [DTI2]" <jorge@dti2.net>
      Tested-by: N"Jorge Boncompte [DTI2]" <jorge@dti2.net>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d208bbdd
  12. 24 12月, 2009 1 次提交
  13. 24 9月, 2009 2 次提交
    • C
      freeze_bdev: grab active reference to frozen superblocks · 4504230a
      Christoph Hellwig 提交于
      Currently we held s_umount while a filesystem is frozen, despite that we
      might return to userspace and unlock it from a different process.  Instead
      grab an active reference to keep the file system busy and add an explicit
      check for frozen filesystems in remount and reject the remount instead
      of blocking on s_umount.
      
      Add a new get_active_super helper to super.c for use by freeze_bdev that
      grabs an active reference to a superblock from a given block device.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4504230a
    • C
      freeze_bdev: kill bd_mount_sem · 4fadd7bb
      Christoph Hellwig 提交于
      Now that we have the freeze count there is not much reason for bd_mount_sem
      anymore.  The actual freeze/thaw operations are serialized using the
      bd_fsfreeze_mutex, and the only other place we take bd_mount_sem is
      get_sb_bdev which tries to prevent mounting a filesystem while the block
      device is frozen.  Instead of add a check for bd_fsfreeze_count and
      return -EBUSY if a filesystem is frozen.  While that is a change in user
      visible behaviour a failing mount is much better for this case rather
      than having the mount process stuck uninterruptible for a long time.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4fadd7bb