1. 24 10月, 2014 1 次提交
  2. 15 10月, 2014 1 次提交
  3. 09 10月, 2014 8 次提交
    • S
      vfs: move getname() from callers to do_mount() · 5e6123f3
      Seunghun Lee 提交于
      It would make more sense to pass char __user * instead of
      char * in callers of do_mount() and do getname() inside do_mount().
      Suggested-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NSeunghun Lee <waydi1@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5e6123f3
    • T
      fs: namespace: suppress 'may be used uninitialized' warnings · b8850d1f
      Tim Gardner 提交于
      The gcc version 4.9.1 compiler complains Even though it isn't possible for
      these variables to not get initialized before they are used.
      
      fs/namespace.c: In function ‘SyS_mount’:
      fs/namespace.c:2720:8: warning: ‘kernel_dev’ may be used uninitialized in this function [-Wmaybe-uninitialized]
        ret = do_mount(kernel_dev, kernel_dir->name, kernel_type, flags,
              ^
      fs/namespace.c:2699:8: note: ‘kernel_dev’ was declared here
        char *kernel_dev;
              ^
      fs/namespace.c:2720:8: warning: ‘kernel_type’ may be used uninitialized in this function [-Wmaybe-uninitialized]
        ret = do_mount(kernel_dev, kernel_dir->name, kernel_type, flags,
              ^
      fs/namespace.c:2697:8: note: ‘kernel_type’ was declared here
        char *kernel_type;
              ^
      
      Fix the warnings by simplifying copy_mount_string() as suggested by Al Viro.
      
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTim Gardner <tim.gardner@canonical.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b8850d1f
    • E
      vfs: Add a function to lazily unmount all mounts from any dentry. · 80b5dce8
      Eric W. Biederman 提交于
      The new function detach_mounts comes in two pieces.  The first piece
      is a static inline test of d_mounpoint that returns immediately
      without taking any locks if d_mounpoint is not set.  In the common
      case when mountpoints are absent this allows the vfs to continue
      running with it's same cacheline foot print.
      
      The second piece of detach_mounts __detach_mounts actually does the
      work and it assumes that a mountpoint is present so it is slow and
      takes namespace_sem for write, and then locks the mount hash (aka
      mount_lock) after a struct mountpoint has been found.
      
      With those two locks held each entry on the list of mounts on a
      mountpoint is selected and lazily unmounted until all of the mount
      have been lazily unmounted.
      
      v7: Wrote a proper change description and removed the changelog
          documenting deleted wrong turns.
      Signed-off-by: NEric W. Biederman <ebiederman@twitter.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      80b5dce8
    • E
      vfs: factor out lookup_mountpoint from new_mountpoint · e2dfa935
      Eric W. Biederman 提交于
      I am shortly going to add a new user of struct mountpoint that
      needs to look up existing entries but does not want to create
      a struct mountpoint if one does not exist.  Therefore to keep
      the code simple and easy to read split out lookup_mountpoint
      from new_mountpoint.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e2dfa935
    • E
      vfs: Keep a list of mounts on a mount point · 0a5eb7c8
      Eric W. Biederman 提交于
      To spot any possible problems call BUG if a mountpoint
      is put when it's list of mounts is not empty.
      
      AV: use hlist instead of list_head
      Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: NEric W. Biederman <ebiederman@twitter.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0a5eb7c8
    • E
      vfs: Don't allow overwriting mounts in the current mount namespace · 7af1364f
      Eric W. Biederman 提交于
      In preparation for allowing mountpoints to be renamed and unlinked
      in remote filesystems and in other mount namespaces test if on a dentry
      there is a mount in the local mount namespace before allowing it to
      be renamed or unlinked.
      
      The primary motivation here are old versions of fusermount unmount
      which is not safe if the a path can be renamed or unlinked while it is
      verifying the mount is safe to unmount.  More recent versions are simpler
      and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
      in a directory owned by an arbitrary user.
      
      Miklos Szeredi <miklos@szeredi.hu> reports this is approach is good
      enough to remove concerns about new kernels mixed with old versions
      of fusermount.
      
      A secondary motivation for restrictions here is that it removing empty
      directories that have non-empty mount points on them appears to
      violate the rule that rmdir can not remove empty directories.  As
      Linus Torvalds pointed out this is useful for programs (like git) that
      test if a directory is empty with rmdir.
      
      Therefore this patch arranges to enforce the existing mount point
      semantics for local mount namespace.
      
      v2: Rewrote the test to be a drop in replacement for d_mountpoint
      v3: Use bool instead of int as the return type of is_local_mountpoint
      Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7af1364f
    • A
      delayed mntput · 9ea459e1
      Al Viro 提交于
      On final mntput() we want fs shutdown to happen before return to
      userland; however, the only case where we want it happen right
      there (i.e. where task_work_add won't do) is MNT_INTERNAL victim.
      Those have to be fully synchronous - failure halfway through module
      init might count on having vfsmount killed right there.  Fortunately,
      final mntput on MNT_INTERNAL vfsmounts happens on shallow stack.
      So we handle those synchronously and do an analog of delayed fput
      logics for everything else.
      
      As the result, we are guaranteed that fs shutdown will always happen
      on shallow stack.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9ea459e1
    • A
      fs: Add a missing permission check to do_umount · a1480dcc
      Andy Lutomirski 提交于
      Accessing do_remount_sb should require global CAP_SYS_ADMIN, but
      only one of the two call sites was appropriately protected.
      
      Fixes CVE-2014-7975.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      a1480dcc
  4. 31 8月, 2014 2 次提交
    • A
      fix EBUSY on umount() from MNT_SHRINKABLE · 81b6b061
      Al Viro 提交于
      We need the parents of victims alive until namespace_unlock() gets to
      dput() of the (ex-)mountpoints.  However, that screws up the "is it
      busy" checks in case when we have shrinkable mounts that need to be
      killed.  Solution: go ahead and decrement refcounts of parents right
      in umount_tree(), increment them again just before dropping rwsem in
      namespace_unlock() (and let the loop in the end of namespace_unlock()
      finally drop those references for good, as we do now).  Parents can't
      get freed until we drop rwsem - at least one reference is kept until
      then, both in case when parent is among the victims and when it is
      not.  So they'll still be around when we get to namespace_unlock().
      
      Cc: stable@vger.kernel.org # 3.12+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      81b6b061
    • A
      get rid of propagate_umount() mistakenly treating slaves as busy. · 88b368f2
      Al Viro 提交于
      The check in __propagate_umount() ("has somebody explicitly mounted
      something on that slave?") is done *before* taking the already doomed
      victims out of the child lists.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      88b368f2
  5. 12 8月, 2014 1 次提交
    • A
      fix copy_tree() regression · 12a5b529
      Al Viro 提交于
      Since 3.14 we had copy_tree() get the shadowing wrong - if we had one
      vfsmount shadowing another (i.e. if A is a slave of B, C is mounted
      on A/foo, then D got mounted on B/foo creating D' on A/foo shadowed
      by C), copy_tree() of A would make a copy of D' shadow the the copy of
      C, not the other way around.
      
      It's easy to fix, fortunately - just make sure that mount follows
      the one that shadows it in mnt_child as well as in mnt_hash, and when
      copy_tree() decides to attach a new mount, check if the last child
      it has added to the same parent should be shadowing the new one.
      And if it should, just use the same logics commit_tree() has - put the
      new mount into the hash and children lists right after the one that
      should shadow it.
      
      Cc: stable@vger.kernel.org [3.14 and later]
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      12a5b529
  6. 08 8月, 2014 3 次提交
    • A
      death to mnt_pinned · 3064c356
      Al Viro 提交于
      Rather than playing silly buggers with vfsmount refcounts, just have
      acct_on() ask fs/namespace.c for internal clone of file->f_path.mnt
      and replace it with said clone.  Then attach the pin to original
      vfsmount.  Voila - the clone will be alive until the file gets closed,
      making sure that underlying superblock remains active, etc., and
      we can drop the original vfsmount, so that it's not kept busy.
      If the file lives until the final mntput of the original vfsmount,
      we'll notice that there's an fs_pin (one in bsd_acct_struct that
      holds that file) and mnt_pin_kill() will take it out.  Since
      ->kill() is synchronous, we won't proceed past that point until
      these files are closed (and private clones of our vfsmount are
      gone), so we get the same ordering warranties we used to get.
      
      mnt_pin()/mnt_unpin()/->mnt_pinned is gone now, and good riddance -
      it never became usable outside of kernel/acct.c (and racy wrt
      umount even there).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3064c356
    • A
      make fs/{namespace,super}.c forget about acct.h · 8fa1f1c2
      Al Viro 提交于
      These externs belong in fs/internal.h.  Rename (they are not acct-specific
      anymore) and move them over there.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8fa1f1c2
    • A
      acct: get rid of acct_list · 215752fc
      Al Viro 提交于
      Put these suckers on per-vfsmount and per-superblock lists instead.
      Note: right now it's still acct_lock for everything, but that's
      going to change.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      215752fc
  7. 07 8月, 2014 1 次提交
    • K
      list: fix order of arguments for hlist_add_after(_rcu) · 1d023284
      Ken Helias 提交于
      All other add functions for lists have the new item as first argument
      and the position where it is added as second argument.  This was changed
      for no good reason in this function and makes using it unnecessary
      confusing.
      
      The name was changed to hlist_add_behind() to cause unconverted code to
      generate a compile error instead of using the wrong parameter order.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKen Helias <kenhelias@firemail.de>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[intel driver bits]
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d023284
  8. 01 8月, 2014 4 次提交
    • E
      mnt: Change the default remount atime from relatime to the existing value · ffbc6f0e
      Eric W. Biederman 提交于
      Since March 2009 the kernel has treated the state that if no
      MS_..ATIME flags are passed then the kernel defaults to relatime.
      
      Defaulting to relatime instead of the existing atime state during a
      remount is silly, and causes problems in practice for people who don't
      specify any MS_...ATIME flags and to get the default filesystem atime
      setting.  Those users may encounter a permission error because the
      default atime setting does not work.
      
      A default that does not work and causes permission problems is
      ridiculous, so preserve the existing value to have a default
      atime setting that is always guaranteed to work.
      
      Using the default atime setting in this way is particularly
      interesting for applications built to run in restricted userspace
      environments without /proc mounted, as the existing atime mount
      options of a filesystem can not be read from /proc/mounts.
      
      In practice this fixes user space that uses the default atime
      setting on remount that are broken by the permission checks
      keeping less privileged users from changing more privileged users
      atime settings.
      
      Cc: stable@vger.kernel.org
      Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      ffbc6f0e
    • E
      mnt: Correct permission checks in do_remount · 9566d674
      Eric W. Biederman 提交于
      While invesgiating the issue where in "mount --bind -oremount,ro ..."
      would result in later "mount --bind -oremount,rw" succeeding even if
      the mount started off locked I realized that there are several
      additional mount flags that should be locked and are not.
      
      In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime
      flags in addition to MNT_READONLY should all be locked.  These
      flags are all per superblock, can all be changed with MS_BIND,
      and should not be changable if set by a more privileged user.
      
      The following additions to the current logic are added in this patch.
      - nosuid may not be clearable by a less privileged user.
      - nodev  may not be clearable by a less privielged user.
      - noexec may not be clearable by a less privileged user.
      - atime flags may not be changeable by a less privileged user.
      
      The logic with atime is that always setting atime on access is a
      global policy and backup software and auditing software could break if
      atime bits are not updated (when they are configured to be updated),
      and serious performance degradation could result (DOS attack) if atime
      updates happen when they have been explicitly disabled.  Therefore an
      unprivileged user should not be able to mess with the atime bits set
      by a more privileged user.
      
      The additional restrictions are implemented with the addition of
      MNT_LOCK_NOSUID, MNT_LOCK_NODEV, MNT_LOCK_NOEXEC, and MNT_LOCK_ATIME
      mnt flags.
      
      Taken together these changes and the fixes for MNT_LOCK_READONLY
      should make it safe for an unprivileged user to create a user
      namespace and to call "mount --bind -o remount,... ..." without
      the danger of mount flags being changed maliciously.
      
      Cc: stable@vger.kernel.org
      Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      9566d674
    • E
      mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount · 07b64558
      Eric W. Biederman 提交于
      There are no races as locked mount flags are guaranteed to never change.
      
      Moving the test into do_remount makes it more visible, and ensures all
      filesystem remounts pass the MNT_LOCK_READONLY permission check.  This
      second case is not an issue today as filesystem remounts are guarded
      by capable(CAP_DAC_ADMIN) and thus will always fail in less privileged
      mount namespaces, but it could become an issue in the future.
      
      Cc: stable@vger.kernel.org
      Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      07b64558
    • E
      mnt: Only change user settable mount flags in remount · a6138db8
      Eric W. Biederman 提交于
      Kenton Varda <kenton@sandstorm.io> discovered that by remounting a
      read-only bind mount read-only in a user namespace the
      MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
      to the remount a read-only mount read-write.
      
      Correct this by replacing the mask of mount flags to preserve
      with a mask of mount flags that may be changed, and preserve
      all others.   This ensures that any future bugs with this mask and
      remount will fail in an easy to detect way where new mount flags
      simply won't change.
      
      Cc: stable@vger.kernel.org
      Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      a6138db8
  9. 30 7月, 2014 1 次提交
    • E
      namespaces: Use task_lock and not rcu to protect nsproxy · 728dba3a
      Eric W. Biederman 提交于
      The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
      a sufficiently expensive system call that people have complained.
      
      Upon inspect nsproxy no longer needs rcu protection for remote reads.
      remote reads are rare.  So optimize for same process reads and write
      by switching using rask_lock instead.
      
      This yields a simpler to understand lock, and a faster setns system call.
      
      In particular this fixes a performance regression observed
      by Rafael David Tinoco <rafael.tinoco@canonical.com>.
      
      This is effectively a revert of Pavel Emelyanov's commit
      cf7b708c Make access to task's nsproxy lighter
      from 2007.  The race this originialy fixed no longer exists as
      do_notify_parent uses task_active_pid_ns(parent) instead of
      parent->nsproxy.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      728dba3a
  10. 02 4月, 2014 4 次提交
    • D
      VFS: Make delayed_free() call free_vfsmnt() · 8ffcb32e
      David Howells 提交于
      Make delayed_free() call free_vfsmnt() so that we don't have two functions
      doing the same job.  This requires the calls to mnt_free_id() in free_vfsmnt()
      to be moved into the callers of that function.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8ffcb32e
    • A
      mark struct file that had write access grabbed by open() · 83f936c7
      Al Viro 提交于
      new flag in ->f_mode - FMODE_WRITER.  Set by do_dentry_open() in case
      when it has grabbed write access, checked by __fput() to decide whether
      it wants to drop the sucker.  Allows to stop bothering with mnt_clone_write()
      in alloc_file(), along with fewer special_file() checks.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      83f936c7
    • A
      reduce m_start() cost... · c7999c36
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c7999c36
    • A
      smarter propagate_mnt() · f2ebb3a9
      Al Viro 提交于
      The current mainline has copies propagated to *all* nodes, then
      tears down the copies we made for nodes that do not contain
      counterparts of the desired mountpoint.  That sets the right
      propagation graph for the copies (at teardown time we move
      the slaves of removed node to a surviving peer or directly
      to master), but we end up paying a fairly steep price in
      useless allocations.  It's fairly easy to create a situation
      where N calls of mount(2) create exactly N bindings, with
      O(N^2) vfsmounts allocated and freed in process.
      
      Fortunately, it is possible to avoid those allocations/freeings.
      The trick is to create copies in the right order and find which
      one would've eventually become a master with the current algorithm.
      It turns out to be possible in O(nodes getting propagation) time
      and with no extra allocations at all.
      
      One part is that we need to make sure that eventual master will be
      created before its slaves, so we need to walk the propagation
      tree in a different order - by peer groups.  And iterate through
      the peers before dealing with the next group.
      
      Another thing is finding the (earlier) copy that will be a master
      of one we are about to create; to do that we are (temporary) marking
      the masters of mountpoints we are attaching the copies to.
      
      Either we are in a peer of the last mountpoint we'd dealt with,
      or we have the following situation: we are attaching to mountpoint M,
      the last copy S_0 had been attached to M_0 and there are sequences
      S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
      S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
      such that M is getting propagation from M_{k}.  It means that the master
      of M_{k} will be among the sequence of masters of M.  On the
      other hand, the nearest marked node in that sequence will either
      be the master of M_{k} or the master of M_{k-1} (the latter -
      in the case if M_{k-1} is a slave of something M gets propagation
      from, but in a wrong peer group).
      
      So we go through the sequence of masters of M until we find
      a marked one (P).  Let N be the one before it.  Then we go through
      the sequence of masters of S_0 until we find one (say, S) mounted
      on a node D that has P as master and check if D is a peer of N.
      If it is, S will be the master of new copy, if not - the master of S
      will be.
      
      That's it for the hard part; the rest is fairly simple.  Iterator
      is in next_group(), handling of one prospective mountpoint is
      propagate_one().
      
      It seems to survive all tests and gives a noticably better performance
      than the current mainline for setups that are seriously using shared
      subtrees.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f2ebb3a9
  11. 31 3月, 2014 4 次提交
  12. 30 11月, 2013 1 次提交
    • T
      sysfs, kernfs: prepare mount path for kernfs · 4b93dc9b
      Tejun Heo 提交于
      We're in the process of separating out core sysfs functionality into
      kernfs which will deal with sysfs_dirents directly.  This patch
      rearranges mount path so that the kernfs and sysfs parts are separate.
      
      * As sysfs_super_info won't be visible outside kernfs proper,
        kernfs_super_ns() is added to allow kernfs users to access a
        super_block's namespace tag.
      
      * Generic mount operation is separated out into kernfs_mount_ns().
        sysfs_mount() now just performs sysfs-specific permission check,
        acquires namespace tag, and invokes kernfs_mount_ns().
      
      * Generic superblock release is separated out into kernfs_kill_sb()
        which can be used directly as file_system_type->kill_sb().  As sysfs
        needs to put the namespace tag, sysfs_kill_sb() wraps
        kernfs_kill_sb() with ns tag put.
      
      * sysfs_dir_cachep init and sysfs_inode_init() are separated out into
        kernfs_init().  kernfs_init() uses only small amount of memory and
        trying to handle and propagate kernfs_init() failure doesn't make
        much sense.  Use SLAB_PANIC for sysfs_dir_cachep and make
        sysfs_inode_init() panic on failure.
      
        After this change, kernfs_init() should be called before
        sysfs_init(), fs/namespace.c::mnt_init() modified accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4b93dc9b
  13. 27 11月, 2013 1 次提交
    • E
      vfs: Fix a regression in mounting proc · 41301ae7
      Eric W. Biederman 提交于
      Gao feng <gaofeng@cn.fujitsu.com> reported that commit
      e51db735
      userns: Better restrictions on when proc and sysfs can be mounted
      caused a regression on mounting a new instance of proc in a mount
      namespace created with user namespace privileges, when binfmt_misc
      is mounted on /proc/sys/fs/binfmt_misc.
      
      This is an unintended regression caused by the absolutely bogus empty
      directory check in fs_fully_visible.  The check fs_fully_visible replaced
      didn't even bother to attempt to verify proc was fully visible and
      hiding proc files with any kind of mount is rare.  So for now fix
      the userspace regression by allowing directory with nlink == 1
      as /proc/sys/fs/binfmt_misc has.
      
      I will have a better patch but it is not stable material, or
      last minute kernel material.  So it will have to wait.
      
      Cc: stable@vger.kernel.org
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NGao feng <gaofeng@cn.fujitsu.com>
      Tested-by: NGao feng <gaofeng@cn.fujitsu.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      41301ae7
  14. 09 11月, 2013 1 次提交
    • A
      RCU'd vfsmounts · 48a066e7
      Al Viro 提交于
      * RCU-delayed freeing of vfsmounts
      * vfsmount_lock replaced with a seqlock (mount_lock)
      * sequence number from mount_lock is stored in nameidata->m_seq and
      used when we exit RCU mode
      * new vfsmount flag - MNT_SYNC_UMOUNT.  Set by umount_tree() when its
      caller knows that vfsmount will have no surviving references.
      * synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
      and doing pending mntput().
      * new helper: legitimize_mnt(mnt, seq).  Checks the mount_lock sequence
      number against seq, then grabs reference to mnt.  Then it rechecks mount_lock
      again to close the race and either returns success or drops the reference it
      has acquired.  The subtle point is that in case of MNT_SYNC_UMOUNT we can
      simply decrement the refcount and sod off - aforementioned synchronize_rcu()
      makes sure that final mntput() won't come until we leave RCU mode.  We need
      that, since we don't want to end up with some lazy pathwalk racing with
      umount() and stealing the final mntput() from it - caller of umount() may
      expect it to return only once the fs is shut down and we don't want to break
      that.  In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
      full-blown mntput() in case of mount_lock sequence number mismatch happening
      just as we'd grabbed the reference, but in those cases we won't be stealing
      the final mntput() from anything that would care.
      * mntput_no_expire() doesn't lock anything on the fast path now.  Incidentally,
      SMP and UP cases are handled the same way - no ifdefs there.
      * normal pathname resolution does *not* do any writes to mount_lock.  It does,
      of course, bump the refcounts of vfsmount and dentry in the very end, but that's
      it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      48a066e7
  15. 25 10月, 2013 7 次提交