1. 07 1月, 2011 3 次提交
    • N
      fs: scale mntget/mntput · b3e19d92
      Nick Piggin 提交于
      The problem that this patch aims to fix is vfsmount refcounting scalability.
      We need to take a reference on the vfsmount for every successful path lookup,
      which often go to the same mount point.
      
      The fundamental difficulty is that a "simple" reference count can never be made
      scalable, because any time a reference is dropped, we must check whether that
      was the last reference. To do that requires communication with all other CPUs
      that may have taken a reference count.
      
      We can make refcounts more scalable in a couple of ways, involving keeping
      distributed counters, and checking for the global-zero condition less
      frequently.
      
      - check the global sum once every interval (this will delay zero detection
        for some interval, so it's probably a showstopper for vfsmounts).
      
      - keep a local count and only taking the global sum when local reaches 0 (this
        is difficult for vfsmounts, because we can't hold preempt off for the life of
        a reference, so a counter would need to be per-thread or tied strongly to a
        particular CPU which requires more locking).
      
      - keep a local difference of increments and decrements, which allows us to sum
        the total difference and hence find the refcount when summing all CPUs. Then,
        keep a single integer "long" refcount for slow and long lasting references,
        and only take the global sum of local counters when the long refcount is 0.
      
      This last scheme is what I implemented here. Attached mounts and process root
      and working directory references are "long" references, and everything else is
      a short reference.
      
      This allows scalable vfsmount references during path walking over mounted
      subtrees and unattached (lazy umounted) mounts with processes still running
      in them.
      
      This results in one fewer atomic op in the fastpath: mntget is now just a
      per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
      and non-atomic decrement in the common case. However code is otherwise bigger
      and heavier, so single threaded performance is basically a wash.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b3e19d92
    • N
      fs: rename vfsmount counter helpers · c6653a83
      Nick Piggin 提交于
      Suggested by Andreas, mnt_ prefix is clearer namespace, follows kernel
      conventions better, and is easier for tab complete. I introduced these
      names so I'll admit they were not good choices.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      c6653a83
    • N
      fs: dcache remove d_mounted · 5f57cbcc
      Nick Piggin 提交于
      Rather than keep a d_mounted count in the dentry, set a dentry flag instead.
      The flag can be cleared by checking the hash table to see if there are any
      mounts left, which is not time critical because it is performed at detach time.
      
      The mounted state of a dentry is only used to speculatively take a look in the
      mount hash table if it is set -- before following the mount, vfsmount lock is
      taken and mount re-checked without races.
      
      This saves 4 bytes on 32-bit, nothing on 64-bit but it does provide a hole I
      might use later (and some configs have larger than 32-bit spinlocks which might
      make use of the hole).
      
      Autofs4 conversion and changelog by Ian Kent <raven@themaw.net>:
      In autofs4, when expring direct (or offset) mounts we need to ensure that we
      block user path walks into the autofs mount, which is covered by another mount.
      To do this we clear the mounted status so that follows stop before walking into
      the mount and are essentially blocked until the expire is completed. The
      automount daemon still finds the correct dentry for the umount due to the
      follow mount logic in fs/autofs4/root.c:autofs4_follow_link(), which is set as
      an inode operation for direct and offset mounts only and is called following
      the lookup that stopped at the covered mount.
      
      At the end of the expire the covering mount probably has gone away so the
      mounted status need not be restored. But we need to check this and only restore
      the mounted status if the expire failed.
      
      XXX: autofs may not work right if we have other mounts go over the top of it?
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      5f57cbcc
  2. 18 11月, 2010 1 次提交
  3. 26 10月, 2010 1 次提交
  4. 05 10月, 2010 1 次提交
    • J
      BKL: Remove BKL from do_new_mount() · 6841c050
      Jan Blunck 提交于
      After pushing down the BKL to the get_sb/fill_super operations of the
      filesystems that still make usage of the BKL it is safe to remove it from
      do_new_mount().
      
      I've read through all the code formerly covered by the BKL inside
      do_kern_mount() and have satisfied myself that it doesn't need the BKL
      any more.
      Signed-off-by: NJan Blunck <jblunck@infradead.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      6841c050
  5. 08 9月, 2010 1 次提交
    • V
      VFS: Sanity check mount flags passed to change_mnt_propagation() · 7a2e8a8f
      Valerie Aurora 提交于
      Sanity check the flags passed to change_mnt_propagation().  Exactly
      one flag should be set.  Return EINVAL otherwise.
      
      Userspace can pass in arbitrary combinations of MS_* flags to mount().
      do_change_type() is called if any of MS_SHARED, MS_PRIVATE, MS_SLAVE,
      or MS_UNBINDABLE is set.  do_change_type() clears MS_REC and then
      calls change_mnt_propagation() with the rest of the user-supplied
      flags.  change_mnt_propagation() clearly assumes only one flag is set
      but do_change_type() does not check that this is true.  For example,
      mount() with flags MS_SHARED | MS_RDONLY does not actually make the
      mount shared or read-only but does clear MNT_UNBINDABLE.
      Signed-off-by: NValerie Aurora <vaurora@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a2e8a8f
  6. 18 8月, 2010 1 次提交
    • N
      fs: brlock vfsmount_lock · 99b7db7b
      Nick Piggin 提交于
      fs: brlock vfsmount_lock
      
      Use a brlock for the vfsmount lock. It must be taken for write whenever
      modifying the mount hash or associated fields, and may be taken for read when
      performing mount hash lookups.
      
      A new lock is added for the mnt-id allocator, so it doesn't need to take
      the heavy vfsmount write-lock.
      
      The number of atomics should remain the same for fastpath rlock cases, though
      code would be slightly slower due to per-cpu access. Scalability is not not be
      much improved in common cases yet, due to other locks (ie. dcache_lock) getting
      in the way. However path lookups crossing mountpoints should be one case where
      scalability is improved (currently requiring the global lock).
      
      The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
      Altix system (high latency to remote nodes), a simple umount microbenchmark
      (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
      took 6.8s, afterwards took 7.1s, about 5% slower.
      
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      99b7db7b
  7. 11 8月, 2010 2 次提交
  8. 10 8月, 2010 1 次提交
    • A
      Fix sget() race with failing mount · 7a4dec53
      Al Viro 提交于
      If sget() finds a matching superblock being set up, it'll
      grab an active reference to it and grab s_umount.  That's
      fine - we'll wait for completion of foofs_get_sb() that way.
      However, if said foofs_get_sb() fails we'll end up holding
      the halfway-created superblock.  deactivate_locked_super()
      called by foofs_get_sb() will just unlock the sucker since
      we are holding another active reference to it.
      
      What we need is a way to tell if superblock has been successfully
      set up.  Unfortunately, neither ->s_root nor the check for
      MS_ACTIVE quite fit.  Cheap and easy way, suitable for backport:
      new flag set by the (only) caller of ->get_sb().  If that flag
      isn't present by the time sget() grabbed s_umount on preexisting
      superblock it has found, it's seeing a stillborn and should
      just bury it with deactivate_locked_super() (and repeat the search).
      
      Longer term we want to set that flag in ->get_sb() instances (and
      check for it to distinguish between "sget() found us a live sb"
      and "sget() has allocated an sb, we need to set it up" in there,
      instead of checking ->s_root as we do now).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@kernel.org
      7a4dec53
  9. 28 7月, 2010 2 次提交
  10. 15 5月, 2010 1 次提交
    • A
      Fix the regression created by "set S_DEAD on unlink()..." commit · d83c49f3
      Al Viro 提交于
      1) i_flags simply doesn't work for mount/unlink race prevention;
      we may have many links to file and rm on one of those obviously
      shouldn't prevent bind on top of another later on.  To fix it
      right way we need to mark _dentry_ as unsuitable for mounting
      upon; new flag (DCACHE_CANT_MOUNT) is protected by d_flags and
      i_mutex on the inode in question.  Set it (with dont_mount(dentry))
      in unlink/rmdir/etc., check (with cant_mount(dentry)) in places
      in namespace.c that used to check for S_DEAD.  Setting S_DEAD
      is still needed in places where we used to set it (for directories
      getting killed), since we rely on it for readdir/rmdir race
      prevention.
      
      2) rename()/mount() protection has another bogosity - we unhash
      the target before we'd checked that it's not a mountpoint.  Fixed.
      
      3) ancient bogosity in pivot_root() - we locked i_mutex on the
      right directory, but checked S_DEAD on the different (and wrong)
      one.  Noticed and fixed.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d83c49f3
  11. 12 4月, 2010 6 次提交
  12. 04 3月, 2010 7 次提交
  13. 17 1月, 2010 4 次提交
  14. 18 12月, 2009 1 次提交
    • L
      Revert "fix mismerge with Trond's stuff (create_mnt_ns() export is gone now)" · a2770d86
      Linus Torvalds 提交于
      This reverts commit e9496ff4. Quoth Al:
      
       "it's dependent on a lot of other stuff not currently in mainline
        and badly broken with current fs/namespace.c.  Sorry, badly
        out-of-order cherry-pick from old queue.
      
        PS: there's a large pending series reworking the refcounting and
        lifetime rules for vfsmounts that will, among other things, allow to
        rip a subtree away _without_ dissolving connections in it, to be
        garbage-collected when all active references are gone.  It's
        considerably saner wrt "is the subtree busy" logics, but it's nowhere
        near being ready for merge at the moment; this changeset is one of the
        things becoming possible with that sucker, but it certainly shouldn't
        have been picked during this cycle.  My apologies..."
      Noticed-by: NEric Paris <eparis@redhat.com>
      Requested-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2770d86
  15. 17 12月, 2009 1 次提交
  16. 12 10月, 2009 1 次提交
  17. 24 9月, 2009 1 次提交
    • V
      fs: fix overflow in sys_mount() for in-kernel calls · eca6f534
      Vegard Nossum 提交于
      sys_mount() reads/copies a whole page for its "type" parameter.  When
      do_mount_root() passes a kernel address that points to an object which is
      smaller than a whole page, copy_mount_options() will happily go past this
      memory object, possibly dereferencing "wild" pointers that could be in any
      state (hence the kmemcheck warning, which shows that parts of the next
      page are not even allocated).
      
      (The likelihood of something going wrong here is pretty low -- first of
      all this only applies to kernel calls to sys_mount(), which are mostly
      found in the boot code.  Secondly, I guess if the page was not mapped,
      exact_copy_from_user() _would_ in fact handle it correctly because of its
      access_ok(), etc.  checks.)
      
      But it is much nicer to avoid the dubious reads altogether, by stopping as
      soon as we find a NUL byte.  Is there a good reason why we can't do
      something like this, using the already existing strndup_from_user()?
      
      [akpm@linux-foundation.org: make copy_mount_string() static]
      [AV: fix compat mount breakage, which involves undoing akpm's change above]
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: Nal <al@dizzy.pdmi.ras.ru>
      eca6f534
  18. 08 8月, 2009 1 次提交
  19. 09 7月, 2009 1 次提交
  20. 24 6月, 2009 2 次提交
  21. 23 6月, 2009 1 次提交
    • T
      VFS: Add VFS helper functions for setting up private namespaces · cf8d2c11
      Trond Myklebust 提交于
      The purpose of this patch is to improve the remote mount path lookup
      support for distributed filesystems such as the NFSv4 client.
      
      When given a mount command of the form "mount server:/foo/bar /mnt", the
      NFSv4 client is required to look up the filehandle for "server:/", and
      then look up each component of the remote mount path "foo/bar" in order
      to find the directory that is actually going to be mounted on /mnt.
      Following that remote mount path may involve following symlinks,
      crossing server-side mount points and even following referrals to
      filesystem volumes on other servers.
      
      Since the standard VFS path lookup code already supports walking paths
      that contain all these features (using in-kernel automounts for
      following referrals) we would like to be able to reuse that rather than
      duplicate the full path traversal functionality in the NFSv4 client code.
      
      This patch therefore defines a VFS helper function create_mnt_ns(), that
      sets up a temporary filesystem namespace and attaches a root filesystem to
      it. It exports the create_mnt_ns() and put_mnt_ns() function for use by
      filesystem modules.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf8d2c11