1. 18 3月, 2011 4 次提交
    • A
      change the locking order for namespace_sem · b12cea91
      Al Viro 提交于
      Have it nested inside ->i_mutex.  Instead of using follow_down()
      under namespace_sem, followed by grabbing i_mutex and checking that
      mountpoint to be is not dead, do the following:
      	grab i_mutex
      	check that it's not dead
      	grab namespace_sem
      	see if anything is mounted there
      	if not, we've won
      	otherwise
      		drop locks
      		put_path on what we had
      		replace with what's mounted
      		retry everything with new mountpoint to be
      
      New helper (lock_mount()) does that.  do_add_mount(), do_move_mount(),
      do_loopback() and pivot_root() switched to it; in case of the last
      two that eliminates a race we used to have - original code didn't
      do follow_down().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b12cea91
    • A
      fix deadlock in pivot_root() · 27cb1572
      Al Viro 提交于
      Don't hold vfsmount_lock over the loop traversing ->mnt_parent;
      do check_mnt(new.mnt) under namespace_sem instead; combined with
      namespace_sem held over all that code it'll guarantee the stability
      of ->mnt_parent chain all the way to the root.
      
      Doing check_mnt() outside of namespace_sem in case of pivot_root()
      is wrong anyway.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      27cb1572
    • A
      vfs: split off vfsmount-related parts of vfs_kern_mount() · 9d412a43
      Al Viro 提交于
      new function: mount_fs().  Does all work done by vfs_kern_mount()
      except the allocation and filling of vfsmount; returns root dentry
      or ERR_PTR().
      
      vfs_kern_mount() switched to using it and taken to fs/namespace.c,
      along with its wrappers.
      
      alloc_vfsmnt()/free_vfsmnt() made static.
      
      functions in namespace.c slightly reordered.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9d412a43
    • A
      kill simple_set_mnt() · 474a00ee
      Al Viro 提交于
      not needed anymore, since all users (->get_sb() instances) are gone.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      474a00ee
  2. 17 3月, 2011 1 次提交
  3. 15 3月, 2011 1 次提交
  4. 04 3月, 2011 1 次提交
    • E
      LSM: Pass -o remount options to the LSM · ff36fe2c
      Eric Paris 提交于
      The VFS mount code passes the mount options to the LSM.  The LSM will remove
      options it understands from the data and the VFS will then pass the remaining
      options onto the underlying filesystem.  This is how options like the
      SELinux context= work.  The problem comes in that -o remount never calls
      into LSM code.  So if you include an LSM specific option it will get passed
      to the filesystem and will cause the remount to fail.  An example of where
      this is a problem is the 'seclabel' option.  The SELinux LSM hook will
      print this word in /proc/mounts if the filesystem is being labeled using
      xattrs.  If you pass this word on mount it will be silently stripped and
      ignored.  But if you pass this word on remount the LSM never gets called
      and it will be passed to the FS.  The FS doesn't know what seclabel means
      and thus should fail the mount.  For example an ext3 fs mounted over loop
      
      # mount -o loop /tmp/fs /mnt/tmp
      # cat /proc/mounts | grep /mnt/tmp
      /dev/loop0 /mnt/tmp ext3 rw,seclabel,relatime,errors=continue,barrier=0,data=ordered 0 0
      # mount -o remount /mnt/tmp
      mount: /mnt/tmp not mounted already, or bad option
      # dmesg
      EXT3-fs (loop0): error: unrecognized mount option "seclabel" or missing value
      
      This patch passes the remount mount options to an new LSM hook.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Reviewed-by: NJames Morris <jmorris@namei.org>
      ff36fe2c
  5. 24 2月, 2011 1 次提交
  6. 17 1月, 2011 6 次提交
    • A
      tidy up around finish_automount() · b1e75df4
      Al Viro 提交于
      do_add_mount() and mnt_clear_expiry() are not needed outside of
      namespace.c anymore, now that namei has finish_automount() to
      use.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b1e75df4
    • A
      don't drop newmnt on error in do_add_mount() · 15f9a3f3
      Al Viro 提交于
      That gets rid of the kludge in finish_automount() - we need
      to keep refcount on the vfsmount as-is until we evict it from
      expiry list.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      15f9a3f3
    • A
      Take the completion of automount into new helper · 19a167af
      Al Viro 提交于
      ... and shift it from namei.c to namespace.c
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      19a167af
    • A
      VFS: Fix UP compile error in fs/namespace.c · 7e3d0eb0
      Al Viro 提交于
      mnt_longterm is there only on SMP
      Reported-and-tested-by: NJoachim Eastwood <manabian@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e3d0eb0
    • A
      sanitize vfsmount refcounting changes · f03c6599
      Al Viro 提交于
      Instead of splitting refcount between (per-cpu) mnt_count
      and (SMP-only) mnt_longrefs, make all references contribute
      to mnt_count again and keep track of how many are longterm
      ones.
      
      Accounting rules for longterm count:
      	* 1 for each fs_struct.root.mnt
      	* 1 for each fs_struct.pwd.mnt
      	* 1 for having non-NULL ->mnt_ns
      	* decrement to 0 happens only under vfsmount lock exclusive
      
      That allows nice common case for mntput() - since we can't drop the
      final reference until after mnt_longterm has reached 0 due to the rules
      above, mntput() can grab vfsmount lock shared and check mnt_longterm.
      If it turns out to be non-zero (which is the common case), we know
      that this is not the final mntput() and can just blindly decrement
      percpu mnt_count.  Otherwise we grab vfsmount lock exclusive and
      do usual decrement-and-check of percpu mnt_count.
      
      For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
      namespace.c uses the latter in places where we don't already hold
      vfsmount lock exclusive and opencodes a few remaining spots where
      we need to manipulate mnt_longterm.
      
      Note that we mostly revert the code outside of fs/namespace.c back
      to what we used to have; in particular, normal code doesn't need
      to care about two kinds of references, etc.  And we get to keep
      the optimization Nick's variant had bought us...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f03c6599
    • A
      fix old umount_tree() breakage · 7b8a53fd
      Al Viro 提交于
      Expiry-related code calls umount_tree() several times with
      the same list to collect vfsmounts to.  Which is fine, except
      that umount_tree() implicitly assumed that the list would
      be empty on each call - it moves the victims over there and
      then iterates through the list kicking them out.  It's *almost*
      idempotent, so everything nearly worked.  However, mnt->ghosts
      handling (and thus expirability checks) had been broken - that
      part was not idempotent...
      
      The fix is trivial - use local temporary list, splice it to
      the the collector list when we are through.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7b8a53fd
  7. 16 1月, 2011 2 次提交
    • D
      Unexport do_add_mount() and add in follow_automount(), not ->d_automount() · ea5b778a
      David Howells 提交于
      Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
      added rather than calling do_add_mount() itself.  follow_automount() will then
      do the addition.
      
      This slightly complicates things as ->d_automount() normally wants to add the
      new vfsmount to an expiration list and start an expiration timer.  The problem
      with that is that the vfsmount will be deleted if it has a refcount of 1 and
      the timer will not repeat if the expiration list is empty.
      
      To this end, we require the vfsmount to be returned from d_automount() with a
      refcount of (at least) 2.  One of these refs will be dropped unconditionally.
      In addition, follow_automount() must get a 3rd ref around the call to
      do_add_mount() lest it eat a ref and return an error, leaving the mount we
      have open to being expired as we would otherwise have only 1 ref on it.
      
      d_automount() should also add the the vfsmount to the expiration list (by
      calling mnt_set_expiry()) and start the expiration timer before returning, if
      this mechanism is to be used.  The vfsmount will be unlinked from the
      expiration list by follow_automount() if do_add_mount() fails.
      
      This patch also fixes the call to do_add_mount() for AFS to propagate the mount
      flags from the parent vfsmount.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ea5b778a
    • D
      Add a dentry op to allow processes to be held during pathwalk transit · cc53ce53
      David Howells 提交于
      Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
      sleep when it tries to transit away from one of that filesystem's directories
      during a pathwalk.  The operation is keyed off a new dentry flag
      (DCACHE_MANAGE_TRANSIT).
      
      The filesystem is allowed to be selective about which processes it holds and
      which it permits to continue on or prohibits from transiting from each flagged
      directory.  This will allow autofs to hold up client processes whilst letting
      its userspace daemon through to maintain the directory or the stuff behind it
      or mounted upon it.
      
      The ->d_manage() dentry operation:
      
      	int (*d_manage)(struct path *path, bool mounting_here);
      
      takes a pointer to the directory about to be transited away from and a flag
      indicating whether the transit is undertaken by do_add_mount() or
      do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.
      
      It should return 0 if successful and to let the process continue on its way;
      -EISDIR to prohibit the caller from skipping to overmounted filesystems or
      automounting, and to use this directory; or some other error code to return to
      the user.
      
      ->d_manage() is called with namespace_sem writelocked if mounting_here is true
      and no other locks held, so it may sleep.  However, if mounting_here is true,
      it may not initiate or wait for a mount or unmount upon the parameter
      directory, even if the act is actually performed by userspace.
      
      Within fs/namei.c, follow_managed() is extended to check with d_manage() first
      on each managed directory, before transiting away from it or attempting to
      automount upon it.
      
      follow_down() is renamed follow_down_one() and should only be used where the
      filesystem deliberately intends to avoid management steps (e.g. autofs).
      
      A new follow_down() is added that incorporates the loop done by all other
      callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
      and CIFS do use it, their use is removed by converting them to use
      d_automount()).  The new follow_down() calls d_manage() as appropriate.  It
      also takes an extra parameter to indicate if it is being called from mount code
      (with namespace_sem writelocked) which it passes to d_manage().  follow_down()
      ignores automount points so that it can be used to mount on them.
      
      __follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
      DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
      sleep.  It would be possible to enter d_manage() in rcu-walk mode too, and have
      that determine whether to abort or not itself.  That would allow the autofs
      daemon to continue on in rcu-walk mode.
      
      Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
      required as every tranist from that directory will cause d_manage() to be
      invoked.  It can always be set again when necessary.
      
      ==========================
      WHAT THIS MEANS FOR AUTOFS
      ==========================
      
      Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
      trigger the automounting of indirect mounts, and both of these can be called
      with i_mutex held.
      
      autofs knows that the i_mutex will be held by the caller in lookup(), and so
      can drop it before invoking the daemon - but this isn't so for d_revalidate(),
      since the lock is only held on _some_ of the code paths that call it.  This
      means that autofs can't risk dropping i_mutex from its d_revalidate() function
      before it calls the daemon.
      
      The bug could manifest itself as, for example, a process that's trying to
      validate an automount dentry that gets made to wait because that dentry is
      expired and needs cleaning up:
      
      	mkdir         S ffffffff8014e05a     0 32580  24956
      	Call Trace:
      	 [<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
      	 [<ffffffff80127f7d>] avc_has_perm+0x46/0x58
      	 [<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
      	 [<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
      	 [<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
      	 [<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
      	 [<ffffffff80057a2f>] lookup_create+0x46/0x80
      	 [<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4
      
      versus the automount daemon which wants to remove that dentry, but can't
      because the normal process is holding the i_mutex lock:
      
      	automount     D ffffffff8014e05a     0 32581      1              32561
      	Call Trace:
      	 [<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
      	 [<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
      	 [<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
      	 [<ffffffff800e6d55>] do_rmdir+0x77/0xde
      	 [<ffffffff8005d229>] tracesys+0x71/0xe0
      	 [<ffffffff8005d28d>] tracesys+0xd5/0xe0
      
      which means that the system is deadlocked.
      
      This patch allows autofs to hold up normal processes whilst the daemon goes
      ahead and does things to the dentry tree behind the automouter point without
      risking a deadlock as almost no locks are held in d_manage() and none in
      d_automount().
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Was-Acked-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cc53ce53
  8. 07 1月, 2011 3 次提交
    • N
      fs: scale mntget/mntput · b3e19d92
      Nick Piggin 提交于
      The problem that this patch aims to fix is vfsmount refcounting scalability.
      We need to take a reference on the vfsmount for every successful path lookup,
      which often go to the same mount point.
      
      The fundamental difficulty is that a "simple" reference count can never be made
      scalable, because any time a reference is dropped, we must check whether that
      was the last reference. To do that requires communication with all other CPUs
      that may have taken a reference count.
      
      We can make refcounts more scalable in a couple of ways, involving keeping
      distributed counters, and checking for the global-zero condition less
      frequently.
      
      - check the global sum once every interval (this will delay zero detection
        for some interval, so it's probably a showstopper for vfsmounts).
      
      - keep a local count and only taking the global sum when local reaches 0 (this
        is difficult for vfsmounts, because we can't hold preempt off for the life of
        a reference, so a counter would need to be per-thread or tied strongly to a
        particular CPU which requires more locking).
      
      - keep a local difference of increments and decrements, which allows us to sum
        the total difference and hence find the refcount when summing all CPUs. Then,
        keep a single integer "long" refcount for slow and long lasting references,
        and only take the global sum of local counters when the long refcount is 0.
      
      This last scheme is what I implemented here. Attached mounts and process root
      and working directory references are "long" references, and everything else is
      a short reference.
      
      This allows scalable vfsmount references during path walking over mounted
      subtrees and unattached (lazy umounted) mounts with processes still running
      in them.
      
      This results in one fewer atomic op in the fastpath: mntget is now just a
      per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
      and non-atomic decrement in the common case. However code is otherwise bigger
      and heavier, so single threaded performance is basically a wash.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b3e19d92
    • N
      fs: rename vfsmount counter helpers · c6653a83
      Nick Piggin 提交于
      Suggested by Andreas, mnt_ prefix is clearer namespace, follows kernel
      conventions better, and is easier for tab complete. I introduced these
      names so I'll admit they were not good choices.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      c6653a83
    • N
      fs: dcache remove d_mounted · 5f57cbcc
      Nick Piggin 提交于
      Rather than keep a d_mounted count in the dentry, set a dentry flag instead.
      The flag can be cleared by checking the hash table to see if there are any
      mounts left, which is not time critical because it is performed at detach time.
      
      The mounted state of a dentry is only used to speculatively take a look in the
      mount hash table if it is set -- before following the mount, vfsmount lock is
      taken and mount re-checked without races.
      
      This saves 4 bytes on 32-bit, nothing on 64-bit but it does provide a hole I
      might use later (and some configs have larger than 32-bit spinlocks which might
      make use of the hole).
      
      Autofs4 conversion and changelog by Ian Kent <raven@themaw.net>:
      In autofs4, when expring direct (or offset) mounts we need to ensure that we
      block user path walks into the autofs mount, which is covered by another mount.
      To do this we clear the mounted status so that follows stop before walking into
      the mount and are essentially blocked until the expire is completed. The
      automount daemon still finds the correct dentry for the umount due to the
      follow mount logic in fs/autofs4/root.c:autofs4_follow_link(), which is set as
      an inode operation for direct and offset mounts only and is called following
      the lookup that stopped at the covered mount.
      
      At the end of the expire the covering mount probably has gone away so the
      mounted status need not be restored. But we need to check this and only restore
      the mounted status if the expire failed.
      
      XXX: autofs may not work right if we have other mounts go over the top of it?
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      5f57cbcc
  9. 18 11月, 2010 1 次提交
  10. 26 10月, 2010 1 次提交
  11. 05 10月, 2010 1 次提交
    • J
      BKL: Remove BKL from do_new_mount() · 6841c050
      Jan Blunck 提交于
      After pushing down the BKL to the get_sb/fill_super operations of the
      filesystems that still make usage of the BKL it is safe to remove it from
      do_new_mount().
      
      I've read through all the code formerly covered by the BKL inside
      do_kern_mount() and have satisfied myself that it doesn't need the BKL
      any more.
      Signed-off-by: NJan Blunck <jblunck@infradead.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      6841c050
  12. 08 9月, 2010 1 次提交
    • V
      VFS: Sanity check mount flags passed to change_mnt_propagation() · 7a2e8a8f
      Valerie Aurora 提交于
      Sanity check the flags passed to change_mnt_propagation().  Exactly
      one flag should be set.  Return EINVAL otherwise.
      
      Userspace can pass in arbitrary combinations of MS_* flags to mount().
      do_change_type() is called if any of MS_SHARED, MS_PRIVATE, MS_SLAVE,
      or MS_UNBINDABLE is set.  do_change_type() clears MS_REC and then
      calls change_mnt_propagation() with the rest of the user-supplied
      flags.  change_mnt_propagation() clearly assumes only one flag is set
      but do_change_type() does not check that this is true.  For example,
      mount() with flags MS_SHARED | MS_RDONLY does not actually make the
      mount shared or read-only but does clear MNT_UNBINDABLE.
      Signed-off-by: NValerie Aurora <vaurora@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a2e8a8f
  13. 18 8月, 2010 1 次提交
    • N
      fs: brlock vfsmount_lock · 99b7db7b
      Nick Piggin 提交于
      fs: brlock vfsmount_lock
      
      Use a brlock for the vfsmount lock. It must be taken for write whenever
      modifying the mount hash or associated fields, and may be taken for read when
      performing mount hash lookups.
      
      A new lock is added for the mnt-id allocator, so it doesn't need to take
      the heavy vfsmount write-lock.
      
      The number of atomics should remain the same for fastpath rlock cases, though
      code would be slightly slower due to per-cpu access. Scalability is not not be
      much improved in common cases yet, due to other locks (ie. dcache_lock) getting
      in the way. However path lookups crossing mountpoints should be one case where
      scalability is improved (currently requiring the global lock).
      
      The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
      Altix system (high latency to remote nodes), a simple umount microbenchmark
      (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
      took 6.8s, afterwards took 7.1s, about 5% slower.
      
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      99b7db7b
  14. 11 8月, 2010 2 次提交
  15. 10 8月, 2010 1 次提交
    • A
      Fix sget() race with failing mount · 7a4dec53
      Al Viro 提交于
      If sget() finds a matching superblock being set up, it'll
      grab an active reference to it and grab s_umount.  That's
      fine - we'll wait for completion of foofs_get_sb() that way.
      However, if said foofs_get_sb() fails we'll end up holding
      the halfway-created superblock.  deactivate_locked_super()
      called by foofs_get_sb() will just unlock the sucker since
      we are holding another active reference to it.
      
      What we need is a way to tell if superblock has been successfully
      set up.  Unfortunately, neither ->s_root nor the check for
      MS_ACTIVE quite fit.  Cheap and easy way, suitable for backport:
      new flag set by the (only) caller of ->get_sb().  If that flag
      isn't present by the time sget() grabbed s_umount on preexisting
      superblock it has found, it's seeing a stillborn and should
      just bury it with deactivate_locked_super() (and repeat the search).
      
      Longer term we want to set that flag in ->get_sb() instances (and
      check for it to distinguish between "sget() found us a live sb"
      and "sget() has allocated an sb, we need to set it up" in there,
      instead of checking ->s_root as we do now).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@kernel.org
      7a4dec53
  16. 28 7月, 2010 2 次提交
  17. 15 5月, 2010 1 次提交
    • A
      Fix the regression created by "set S_DEAD on unlink()..." commit · d83c49f3
      Al Viro 提交于
      1) i_flags simply doesn't work for mount/unlink race prevention;
      we may have many links to file and rm on one of those obviously
      shouldn't prevent bind on top of another later on.  To fix it
      right way we need to mark _dentry_ as unsuitable for mounting
      upon; new flag (DCACHE_CANT_MOUNT) is protected by d_flags and
      i_mutex on the inode in question.  Set it (with dont_mount(dentry))
      in unlink/rmdir/etc., check (with cant_mount(dentry)) in places
      in namespace.c that used to check for S_DEAD.  Setting S_DEAD
      is still needed in places where we used to set it (for directories
      getting killed), since we rely on it for readdir/rmdir race
      prevention.
      
      2) rename()/mount() protection has another bogosity - we unhash
      the target before we'd checked that it's not a mountpoint.  Fixed.
      
      3) ancient bogosity in pivot_root() - we locked i_mutex on the
      right directory, but checked S_DEAD on the different (and wrong)
      one.  Noticed and fixed.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d83c49f3
  18. 12 4月, 2010 6 次提交
  19. 04 3月, 2010 4 次提交