1. 20 7月, 2013 1 次提交
    • A
      livelock avoidance in sget() · acfec9a5
      Al Viro 提交于
      Eric Sandeen has found a nasty livelock in sget() - take a mount(2) about
      to fail.  The superblock is on ->fs_supers, ->s_umount is held exclusive,
      ->s_active is 1.  Along comes two more processes, trying to mount the same
      thing; sget() in each is picking that superblock, bumping ->s_count and
      trying to grab ->s_umount.  ->s_active is 3 now.  Original mount(2)
      finally gets to deactivate_locked_super() on failure; ->s_active is 2,
      superblock is still ->fs_supers because shutdown will *not* happen until
      ->s_active hits 0.  ->s_umount is dropped and now we have two processes
      chasing each other:
      s_active = 2, A acquired ->s_umount, B blocked
      A sees that the damn thing is stillborn, does deactivate_locked_super()
      s_active = 1, A drops ->s_umount, B gets it
      A restarts the search and finds the same superblock.  And bumps it ->s_active.
      s_active = 2, B holds ->s_umount, A blocked on trying to get it
      ... and we are in the earlier situation with A and B switched places.
      
      The root cause, of course, is that ->s_active should not grow until we'd
      got MS_BORN.  Then failing ->mount() will have deactivate_locked_super()
      shut the damn thing down.  Fortunately, it's easy to do - the key point
      is that grab_super() is called only for superblocks currently on ->fs_supers,
      so it can bump ->s_count and grab ->s_umount first, then check MS_BORN and
      bump ->s_active; we must never increment ->s_count for superblocks past
      ->kill_sb(), but grab_super() is never called for those.
      
      The bug is pretty old; we would've caught it by now, if not for accidental
      exclusion between sget() for block filesystems; the things like cgroup or
      e.g. mtd-based filesystems don't have anything of that sort, so they get
      bitten.  The right way to deal with that is obviously to fix sget()...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      acfec9a5
  2. 28 2月, 2013 2 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
    • T
      idr: remove MAX_IDR_MASK and move left MAX_IDR_* into idr.c · e8c8d1bc
      Tejun Heo 提交于
      MAX_IDR_MASK is another weirdness in the idr interface.  As idr covers
      whole positive integer range, it's defined as 0x7fffffff or INT_MAX.
      
      Its usage in idr_find(), idr_replace() and idr_remove() is bizarre.
      They basically mask off the sign bit and operate on the rest, so if
      the caller, by accident, passes in a negative number, the sign bit
      will be masked off and the remaining part will be used as if that was
      the input, which is worse than crashing.
      
      The constant is visible in idr.h and there are several users in the
      kernel.
      
      * drivers/i2c/i2c-core.c:i2c_add_numbered_adapter()
      
        Basically used to test if adap->nr is a negative number which isn't
        -1 and returns -EINVAL if so.  idr_alloc() already has negative
        @start checking (w/ WARN_ON_ONCE), so this can go away.
      
      * drivers/infiniband/core/cm.c:cm_alloc_id()
        drivers/infiniband/hw/mlx4/cm.c:id_map_alloc()
      
        Used to wrap cyclic @start.  Can be replaced with max(next, 0).
        Note that this type of cyclic allocation using idr is buggy.  These
        are prone to spurious -ENOSPC failure after the first wraparound.
      
      * fs/super.c:get_anon_bdev()
      
        The ID allocated from ida is masked off before being tested whether
        it's inside valid range.  ida allocated ID can never be a negative
        number and the masking is unnecessary.
      
      Update idr_*() functions to fail with -EINVAL when negative @id is
      specified and update other MAX_IDR_MASK users as described above.
      
      This leaves MAX_IDR_MASK without any user, remove it and relocate
      other MAX_IDR_* constants to lib/idr.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jean Delvare <khali@linux-fr.org>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Cc: "Marciniszyn, Mike" <mike.marciniszyn@intel.com>
      Cc: Jack Morgenstein <jackm@dev.mellanox.co.il>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: NWolfram Sang <wolfram@the-dreams.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8c8d1bc
  3. 10 10月, 2012 1 次提交
  4. 06 10月, 2012 1 次提交
  5. 03 10月, 2012 1 次提交
  6. 04 8月, 2012 1 次提交
    • A
      vfs: kill write_super and sync_supers · f0cd2dbb
      Artem Bityutskiy 提交于
      Finally we can kill the 'sync_supers' kernel thread along with the
      '->write_super()' superblock operation because all the users are gone.
      Now every file-system is supposed to self-manage own superblock and
      its dirty state.
      
      The nice thing about killing this thread is that it improves power management.
      Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
      every 5 seconds no matter what - even if there were no dirty superblocks and
      even if there were no file-systems using this service (e.g., btrfs and
      journalled ext4 do not need it). So it was wasting power most of the time. And
      because the thread was in the core of the kernel, all systems had to have it.
      So I am quite happy to make it go away.
      
      Interestingly, this thread is a left-over from the pdflush kernel thread which
      was a self-forking kernel thread responsible for all the write-back in old
      Linux kernels. It was turned into per-block device BDI threads, and
      'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f0cd2dbb
  7. 01 8月, 2012 1 次提交
  8. 31 7月, 2012 2 次提交
  9. 14 7月, 2012 1 次提交
  10. 09 6月, 2012 1 次提交
  11. 29 2月, 2012 1 次提交
  12. 14 2月, 2012 2 次提交
  13. 24 1月, 2012 1 次提交
    • D
      mm: cleancache: s/flush/invalidate/ · 3167760f
      Dan Magenheimer 提交于
      Per akpm suggestions alter the use of the term flush to be
      invalidate. The next patch will do this across all MM.
      
      This change is completely cosmetic.
      
      [v9: akpm@linux-foundation.org: change "flush" to "invalidate", part 3]
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      Reviewed-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Rik Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      [v10: Fixed  fs: move code out of buffer.c conflict change]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      3167760f
  14. 18 1月, 2012 1 次提交
    • K
      wake up s_wait_unfrozen when ->freeze_fs fails · e1616300
      Kazuya Mio 提交于
      dd slept infinitely when fsfeeze failed because of EIO.
      To fix this problem, if ->freeze_fs fails, freeze_super() wakes up
      the tasks waiting for the filesystem to become unfrozen.
      
      When s_frozen isn't SB_UNFROZEN in __generic_file_aio_write(),
      the function sleeps until FITHAW ioctl wakes up s_wait_unfrozen.
      
      However, if ->freeze_fs fails, s_frozen is set to SB_UNFROZEN and then
      freeze_super() returns an error number. In this case, FITHAW ioctl returns
      EINVAL because s_frozen is already SB_UNFROZEN. There is no way to wake up
      s_wait_unfrozen, so __generic_file_aio_write() sleeps infinitely.
      Signed-off-by: NKazuya Mio <k-mio@sx.jp.nec.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e1616300
  15. 07 1月, 2012 3 次提交
  16. 04 1月, 2012 3 次提交
    • A
      vfs: fix the rest of sget() races · dabe0dc1
      Al Viro 提交于
      unfortunately, just checking MS_BORN after having grabbed ->s_umount in
      sget() is not enough; places that pick superblock from a list and
      grab s_umount shared need the same check in addition to checking for
      ->s_root; otherwise three-way race between failing mount, sget() and
      such list-walker can leave us with list-walker coming *second*, when
      temporary active ref grabbed by sget() (to be dropped when sget()
      notices that original mount has failed by checking MS_BORN) has
      lead to deactivate_locked_super() from failing ->mount() *not* doing
      ->kill_sb() and just releasing ->s_umount.  Once sget() gets through
      and notices that MS_BORN had never been set it will drop the active
      ref and fs will be shut down and kicked out of all lists, but it's
      too late for something like sync_supers().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      dabe0dc1
    • A
      vfs: convert fs_supers to hlist · a5166169
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a5166169
    • A
      trim fs/internal.h · f47ec3f2
      Al Viro 提交于
      some stuff in there can actually become static; some belongs to pnode.h
      as it's a private interface between namespace.c and pnode.c...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f47ec3f2
  17. 02 11月, 2011 1 次提交
  18. 01 11月, 2011 1 次提交
  19. 21 7月, 2011 3 次提交
    • D
      vfs: increase shrinker batch size · 8ab47664
      Dave Chinner 提交于
      Now that the per-sb shrinker is responsible for shrinking 2 or more
      caches, increase the batch size to keep econmies of scale for
      shrinking each cache.  Increase the shrinker batch size to 1024
      objects.
      
      To allow for a large increase in batch size, add a conditional
      reschedule to prune_icache_sb() so that we don't hold the LRU spin
      lock for too long. This mirrors the behaviour of the
      __shrink_dcache_sb(), and allows us to increase the batch size
      without needing to worry about problems caused by long lock hold
      times.
      
      To ensure that filesystems using the per-sb shrinker callouts don't
      cause problems, document that the object freeing method must
      reschedule appropriately inside loops.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8ab47664
    • D
      superblock: add filesystem shrinker operations · 0e1fdafd
      Dave Chinner 提交于
      Now we have a per-superblock shrinker implementation, we can add a
      filesystem specific callout to it to allow filesystem internal
      caches to be shrunk by the superblock shrinker.
      
      Rather than perpetuate the multipurpose shrinker callback API (i.e.
      nr_to_scan == 0 meaning "tell me how many objects freeable in the
      cache), two operations will be added. The first will return the
      number of objects that are freeable, the second is the actual
      shrinker call.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0e1fdafd
    • D
      superblock: introduce per-sb cache shrinker infrastructure · b0d40c92
      Dave Chinner 提交于
      With context based shrinkers, we can implement a per-superblock
      shrinker that shrinks the caches attached to the superblock. We
      currently have global shrinkers for the inode and dentry caches that
      split up into per-superblock operations via a coarse proportioning
      method that does not batch very well.  The global shrinkers also
      have a dependency - dentries pin inodes - so we have to be very
      careful about how we register the global shrinkers so that the
      implicit call order is always correct.
      
      With a per-sb shrinker callout, we can encode this dependency
      directly into the per-sb shrinker, hence avoiding the need for
      strictly ordering shrinker registrations. We also have no need for
      any proportioning code for the shrinker subsystem already provides
      this functionality across all shrinkers. Allowing the shrinker to
      operate on a single superblock at a time means that we do less
      superblock list traversals and locking and reclaim should batch more
      effectively. This should result in less CPU overhead for reclaim and
      potentially faster reclaim of items from each filesystem.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b0d40c92
  20. 20 7月, 2011 5 次提交
  21. 12 7月, 2011 1 次提交
    • J
      fixlet: Remove fs_excl from struct task. · 4aede84b
      Justin TerAvest 提交于
      fs_excl is a poor man's priority inheritance for filesystems to hint to
      the block layer that an operation is important. It was never clearly
      specified, not widely adopted, and will not prevent starvation in many
      cases (like across cgroups).
      
      fs_excl was introduced with the time sliced CFQ IO scheduler, to
      indicate when a process held FS exclusive resources and thus needed
      a boost.
      
      It doesn't cover all file systems, and it was never fully complete.
      Lets kill it.
      Signed-off-by: NJustin TerAvest <teravest@google.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4aede84b
  22. 04 6月, 2011 1 次提交
    • A
      more conservative S_NOSEC handling · 9e1f1de0
      Al Viro 提交于
      Caching "we have already removed suid/caps" was overenthusiastic as merged.
      On network filesystems we might have had suid/caps set on another client,
      silently picked by this client on revalidate, all of that *without* clearing
      the S_NOSEC flag.
      
      AFAICS, the only reasonably sane way to deal with that is
      	* new superblock flag; unless set, S_NOSEC is not going to be set.
      	* local block filesystems set it in their ->mount() (more accurately,
      mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
      local block ones clear it)
      	* if any network filesystem (or a cluster one) wants to use S_NOSEC,
      it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
      inode attribute changes are picked from other clients.
      
      It's not an earth-shattering hole (anybody that can set suid on another client
      will almost certainly be able to write to the file before doing that anyway),
      but it's a bug that needs fixing.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9e1f1de0
  23. 27 5月, 2011 1 次提交
    • D
      mm/fs: add hooks to support cleancache · c515e1fd
      Dan Magenheimer 提交于
      This fourth patch of eight in this cleancache series provides the
      core hooks in VFS for: initializing cleancache per filesystem;
      capturing clean pages reclaimed by page cache; attempting to get
      pages from cleancache before filesystem read; and ensuring coherency
      between pagecache, disk, and cleancache.  Note that the placement
      of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
      change was required due to a patchset in 2.6.39.
      
      All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
      a check of a boolean global if CONFIG_CLEANCACHE is set but no
      cleancache "backend" has claimed cleancache_ops.
      
      Details and a FAQ can be found in Documentation/vm/cleancache.txt
      
      [v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Reviewed-by: NJeremy Fitzhardinge <jeremy@goop.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik Van Riel <riel@redhat.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Andreas Dilger <adilger@sun.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      c515e1fd
  24. 19 5月, 2011 1 次提交
  25. 18 3月, 2011 1 次提交
    • A
      vfs: split off vfsmount-related parts of vfs_kern_mount() · 9d412a43
      Al Viro 提交于
      new function: mount_fs().  Does all work done by vfs_kern_mount()
      except the allocation and filling of vfsmount; returns root dentry
      or ERR_PTR().
      
      vfs_kern_mount() switched to using it and taken to fs/namespace.c,
      along with its wrappers.
      
      alloc_vfsmnt()/free_vfsmnt() made static.
      
      functions in namespace.c slightly reordered.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9d412a43
  26. 17 3月, 2011 2 次提交