1. 16 11月, 2015 1 次提交
    • J
      locks: Allow disabling mandatory locking at compile time · 9e8925b6
      Jeff Layton 提交于
      Mandatory locking appears to be almost unused and buggy and there
      appears no real interest in doing anything with it.  Since effectively
      no one uses the code and since the code is buggy let's allow it to be
      disabled at compile time.  I would just suggest removing the code but
      undoubtedly that will break some piece of userspace code somewhere.
      
      For the distributions that don't care about this piece of code
      this gives a nice starting point to make mandatory locking go away.
      
      Cc: Benjamin Coddington <bcodding@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Jeff Layton <jeff.layton@primarydata.com>
      Cc: J. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      9e8925b6
  2. 24 7月, 2015 1 次提交
    • E
      mnt: In detach_mounts detach the appropriate unmounted mount · fe78fcc8
      Eric W. Biederman 提交于
      The handling of in detach_mounts of unmounted but connected mounts is
      buggy and can lead to an infinite loop.
      
      Correct the handling of unmounted mounts in detach_mount.  When the
      mountpoint of an unmounted but connected mount is connected to a
      dentry, and that dentry is deleted we need to disconnect that mount
      from the parent mount and the deleted dentry.
      
      Nothing changes for the unmounted and connected children.  They can be
      safely ignored.
      
      Cc: stable@vger.kernel.org
      Fixes: ce07d891 mnt: Honor MNT_LOCKED when detaching mounts
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      fe78fcc8
  3. 23 7月, 2015 1 次提交
    • E
      mnt: Clarify and correct the disconnect logic in umount_tree · f2d0a123
      Eric W. Biederman 提交于
      rmdir mntpoint will result in an infinite loop when there is
      a mount locked on the mountpoint in another mount namespace.
      
      This is because the logic to test to see if a mount should
      be disconnected in umount_tree is buggy.
      
      Move the logic to decide if a mount should remain connected to
      it's mountpoint into it's own function disconnect_mount so that
      clarity of expression instead of terseness of expression becomes
      a virtue.
      
      When the conditions where it is invalid to leave a mount connected
      are first ruled out, the logic for deciding if a mount should
      be disconnected becomes much clearer and simpler.
      
      Fixes: e0c9c0af mnt: Update detach_mounts to leave mounts connected
      Fixes: ce07d891 mnt: Honor MNT_LOCKED when detaching mounts
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f2d0a123
  4. 10 7月, 2015 1 次提交
    • E
      mnt: fs_fully_visible enforce noexec and nosuid if !SB_I_NOEXEC · 77b1a97d
      Eric W. Biederman 提交于
      The filesystems proc and sysfs do not have executable files do not
      have exectuable files today and portions of userspace break if we do
      enforce nosuid and noexec consistency of nosuid and noexec flags
      between previous mounts and new mounts of proc and sysfs.
      
      Add the code to enforce consistency of the nosuid and noexec flags,
      and use the presence of SB_I_NOEXEC to signal that there is no need to
      bother.
      
      This results in a completely userspace invisible change that makes it
      clear fs_fully_visible can only skip the enforcement of noexec and
      nosuid because it is known the filesystems in question do not support
      executables.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      77b1a97d
  5. 01 7月, 2015 3 次提交
    • E
      mnt: Update fs_fully_visible to test for permanently empty directories · 7236c85e
      Eric W. Biederman 提交于
      fs_fully_visible attempts to make fresh mounts of proc and sysfs give
      the mounter no more access to proc and sysfs than if they could have
      by creating a bind mount.  One aspect of proc and sysfs that makes
      this particularly tricky is that there are other filesystems that
      typically mount on top of proc and sysfs.  As those filesystems are
      mounted on empty directories in practice it is safe to ignore them.
      However testing to ensure filesystems are mounted on empty directories
      has not been something the in kernel data structures have supported so
      the current test for an empty directory which checks to see
      if nlink <= 2 is a bit lacking.
      
      proc and sysfs have recently been modified to use the new empty_dir
      infrastructure to create all of their dedicated mount points.  Instead
      of testing for S_ISDIR(inode->i_mode) && i_nlink <= 2 to see if a
      directory is empty, test for is_empty_dir_inode(inode).  That small
      change guaranteess mounts found on proc and sysfs really are safe to
      ignore, because the directories are not only empty but nothing can
      ever be added to them.  This guarantees there is nothing to worry
      about when mounting proc and sysfs.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      7236c85e
    • E
      vfs: Ignore unlocked mounts in fs_fully_visible · ceeb0e5d
      Eric W. Biederman 提交于
      Limit the mounts fs_fully_visible considers to locked mounts.
      Unlocked can always be unmounted so considering them adds hassle
      but no security benefit.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      ceeb0e5d
    • Y
      fs: use seq_open_private() for proc_mounts · ede1bf0d
      Yann Droneaud 提交于
      A patchset to remove support for passing pre-allocated struct seq_file to
      seq_open().  Such feature is undocumented and prone to error.
      
      In particular, if seq_release() is used in release handler, it will
      kfree() a pointer which was not allocated by seq_open().
      
      So this patchset drops support for pre-allocated struct seq_file: it's
      only of use in proc_namespace.c and can be easily replaced by using
      seq_open_private()/seq_release_private().
      
      Additionally, it documents the use of file->private_data to hold pointer
      to struct seq_file by seq_open().
      
      This patch (of 3):
      
      Since patch described below, from v2.6.15-rc1, seq_open() could use a
      struct seq_file already allocated by the caller if the pointer to the
      structure is stored in file->private_data before calling the function.
      
          Commit 1abe77b0
          Author: Al Viro <viro@zeniv.linux.org.uk>
          Date:   Mon Nov 7 17:15:34 2005 -0500
      
              [PATCH] allow callers of seq_open do allocation themselves
      
              Allow caller of seq_open() to kmalloc() seq_file + whatever else they
              want and set ->private_data to it.  seq_open() will then abstain from
              doing allocation itself.
      
      Such behavior is only used by mounts_open_common().
      
      In order to drop support for such uncommon feature, proc_mounts is
      converted to use seq_open_private(), which take care of allocating the
      proc_mounts structure, making it available through ->private in struct
      seq_file.
      
      Conversely, proc_mounts is converted to use seq_release_private(), in
      order to release the private structure allocated by seq_open_private().
      
      Then, ->private is used directly instead of proc_mounts() macro to access
      to the proc_mounts structure.
      
      Link: http://lkml.kernel.org/r/cover.1433193673.git.ydroneaud@opteya.comSigned-off-by: NYann Droneaud <ydroneaud@opteya.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ede1bf0d
  6. 04 6月, 2015 1 次提交
    • E
      mnt: Modify fs_fully_visible to deal with locked ro nodev and atime · 8c6cf9cc
      Eric W. Biederman 提交于
      Ignore an existing mount if the locked readonly, nodev or atime
      attributes are less permissive than the desired attributes
      of the new mount.
      
      On success ensure the new mount locks all of the same readonly, nodev and
      atime attributes as the old mount.
      
      The nosuid and noexec attributes are not checked here as this change
      is destined for stable and enforcing those attributes causes a
      regression in lxc and libvirt-lxc where those applications will not
      start and there are no known executables on sysfs or proc and no known
      way to create exectuables without code modifications
      
      Cc: stable@vger.kernel.org
      Fixes: e51db735 ("userns: Better restrictions on when proc and sysfs can be mounted")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      8c6cf9cc
  7. 14 5月, 2015 1 次提交
    • E
      mnt: Refactor the logic for mounting sysfs and proc in a user namespace · 1b852bce
      Eric W. Biederman 提交于
      Fresh mounts of proc and sysfs are a very special case that works very
      much like a bind mount.  Unfortunately the current structure can not
      preserve the MNT_LOCK... mount flags.  Therefore refactor the logic
      into a form that can be modified to preserve those lock bits.
      
      Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount
      of the filesystem be fully visible in the current mount namespace,
      before the filesystem may be mounted.
      
      Move the logic for calling fs_fully_visible from proc and sysfs into
      fs/namespace.c where it has greater access to mount namespace state.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      1b852bce
  8. 11 5月, 2015 1 次提交
    • A
      new helper: __legitimize_mnt() · 294d71ff
      Al Viro 提交于
      same as legitimize_mnt(), except that it does *not* drop and regain
      rcu_read_lock; return values are
      0  =>  grabbed a reference, we are fine
      1  =>  failed, just go away
      -1 =>  failed, go away and mntput(bastard) when outside of rcu_read_lock
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      294d71ff
  9. 10 5月, 2015 1 次提交
  10. 10 4月, 2015 6 次提交
    • E
      mnt: Update detach_mounts to leave mounts connected · e0c9c0af
      Eric W. Biederman 提交于
      Now that it is possible to lazily unmount an entire mount tree and
      leave the individual mounts connected to each other add a new flag
      UMOUNT_CONNECTED to umount_tree to force this behavior and use
      this flag in detach_mounts.
      
      This closes a bug where the deletion of a file or directory could
      trigger an unmount and reveal data under a mount point.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      e0c9c0af
    • E
      mnt: Fix the error check in __detach_mounts · f53e5797
      Eric W. Biederman 提交于
      lookup_mountpoint can return either NULL or an error value.
      Update the test in __detach_mounts to test for an error value
      to avoid pathological cases causing a NULL pointer dereferences.
      
      The callers of __detach_mounts should prevent it from ever being
      called on an unlinked dentry but don't take any chances.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f53e5797
    • E
      mnt: Honor MNT_LOCKED when detaching mounts · ce07d891
      Eric W. Biederman 提交于
      Modify umount(MNT_DETACH) to keep mounts in the hash table that are
      locked to their parent mounts, when the parent is lazily unmounted.
      
      In mntput_no_expire detach the children from the hash table, depending
      on mnt_pin_kill in cleanup_mnt to decrement the mnt_count of the children.
      
      In __detach_mounts if there are any mounts that have been unmounted
      but still are on the list of mounts of a mountpoint, remove their
      children from the mount hash table and those children to the unmounted
      list so they won't linger potentially indefinitely waiting for their
      final mntput, now that the mounts serve no purpose.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      ce07d891
    • E
      mnt: Factor umount_mnt from umount_tree · 6a46c573
      Eric W. Biederman 提交于
      For future use factor out a function umount_mnt from umount_tree.
      This function unhashes a mount and remembers where the mount
      was mounted so that eventually when the code makes it to a
      sleeping context the mountpoint can be dput.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      6a46c573
    • E
      mnt: Factor out unhash_mnt from detach_mnt and umount_tree · 7bdb11de
      Eric W. Biederman 提交于
      Create a function unhash_mnt that contains the common code between
      detach_mnt and umount_tree, and use unhash_mnt in place of the common
      code.  This add a unncessary list_del_init(mnt->mnt_child) into
      umount_tree but given that mnt_child is already empty this extra
      line is a noop.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      7bdb11de
    • E
      mnt: Fail collect_mounts when applied to unmounted mounts · cd4a4017
      Eric W. Biederman 提交于
      The only users of collect_mounts are in audit_tree.c
      
      In audit_trim_trees and audit_add_tree_rule the path passed into
      collect_mounts is generated from kern_path passed an audit_tree
      pathname which is guaranteed to be an absolute path.   In those cases
      collect_mounts is obviously intended to work on mounted paths and
      if a race results in paths that are unmounted when collect_mounts
      it is reasonable to fail early.
      
      The paths passed into audit_tag_tree don't have the absolute path
      check.  But are used to play with fsnotify and otherwise interact with
      the audit_trees, so again operating only on mounted paths appears
      reasonable.
      
      Avoid having to worry about what happens when we try and audit
      unmounted filesystems by restricting collect_mounts to mounts
      that appear in the mount tree.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      cd4a4017
  11. 03 4月, 2015 7 次提交
    • E
      mnt: On an unmount propagate clearing of MNT_LOCKED · 5d88457e
      Eric W. Biederman 提交于
      A prerequisite of calling umount_tree is that the point where the tree
      is mounted at is valid to unmount.
      
      If we are propagating the effect of the unmount clear MNT_LOCKED in
      every instance where the same filesystem is mounted on the same
      mountpoint in the mount tree, as we know (by virtue of the fact
      that umount_tree was called) that it is safe to reveal what
      is at that mountpoint.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      5d88457e
    • E
      mnt: Delay removal from the mount hash. · 411a938b
      Eric W. Biederman 提交于
      - Modify __lookup_mnt_hash_last to ignore mounts that have MNT_UMOUNTED set.
      - Don't remove mounts from the mount hash table in propogate_umount
      - Don't remove mounts from the mount hash table in umount_tree before
        the entire list of mounts to be umounted is selected.
      - Remove mounts from the mount hash table as the last thing that
        happens in the case where a mount has a parent in umount_tree.
        Mounts without parents are not hashed (by definition).
      
      This paves the way for delaying removal from the mount hash table even
      farther and fixing the MNT_LOCKED vs MNT_DETACH issue.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      411a938b
    • E
      mnt: Add MNT_UMOUNT flag · 590ce4bc
      Eric W. Biederman 提交于
      In some instances it is necessary to know if the the unmounting
      process has begun on a mount.  Add MNT_UMOUNT to make that reliably
      testable.
      
      This fix gets used in fixing locked mounts in MNT_DETACH
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      590ce4bc
    • E
      mnt: In umount_tree reuse mnt_list instead of mnt_hash · c003b26f
      Eric W. Biederman 提交于
      umount_tree builds a list of mounts that need to be unmounted.
      Utilize mnt_list for this purpose instead of mnt_hash.  This begins to
      allow keeping a mount on the mnt_hash after it is unmounted, which is
      necessary for a properly functioning MNT_LOCKED implementation.
      
      The fact that mnt_list is an ordinary list makding available list_move
      is nice bonus.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c003b26f
    • E
      mnt: Don't propagate umounts in __detach_mounts · 8318e667
      Eric W. Biederman 提交于
      Invoking mount propagation from __detach_mounts is inefficient and
      wrong.
      
      It is inefficient because __detach_mounts already walks the list of
      mounts that where something needs to be done, and mount propagation
      walks some subset of those mounts again.
      
      It is actively wrong because if the dentry that is passed to
      __detach_mounts is not part of the path to a mount that mount should
      not be affected.
      
      change_mnt_propagation(p,MS_PRIVATE) modifies the mount propagation
      tree of a master mount so it's slaves are connected to another master
      if possible.  Which means even removing a mount from the middle of a
      mount tree with __detach_mounts will not deprive any mount propagated
      mount events.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      8318e667
    • E
      mnt: Improve the umount_tree flags · e819f152
      Eric W. Biederman 提交于
      - Remove the unneeded declaration from pnode.h
      - Mark umount_tree static as it has no callers outside of namespace.c
      - Define an enumeration of umount_tree's flags.
      - Pass umount_tree's flags in by name
      
      This removes the magic numbers 0, 1 and 2 making the code a little
      clearer and makes it possible for there to be lazy unmounts that don't
      propagate.  Which is what __detach_mounts actually wants for example.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      e819f152
    • E
      mnt: Use hlist_move_list in namespace_unlock · a3b3c562
      Eric W. Biederman 提交于
      Small cleanup to make the code more readable and maintainable.
      Signed-off-by: NEric Biederman <ebiederm@xmission.com>
      a3b3c562
  12. 23 2月, 2015 1 次提交
    • D
      VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) · e36cb0b8
      David Howells 提交于
      Convert the following where appropriate:
      
       (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).
      
       (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).
      
       (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
           complicated than it appears as some calls should be converted to
           d_can_lookup() instead.  The difference is whether the directory in
           question is a real dir with a ->lookup op or whether it's a fake dir with
           a ->d_automount op.
      
      In some circumstances, we can subsume checks for dentry->d_inode not being
      NULL into this, provided we the code isn't in a filesystem that expects
      d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
      use d_inode() rather than d_backing_inode() to get the inode pointer).
      
      Note that the dentry type field may be set to something other than
      DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
      manages the fall-through from a negative dentry to a lower layer.  In such a
      case, the dentry type of the negative union dentry is set to the same as the
      type of the lower dentry.
      
      However, if you know d_inode is not NULL at the call site, then you can use
      the d_is_xxx() functions even in a filesystem.
      
      There is one further complication: a 0,0 chardev dentry may be labelled
      DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
      intended for special directory entry types that don't have attached inodes.
      
      The following perl+coccinelle script was used:
      
      use strict;
      
      my @callers;
      open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
          die "Can't grep for S_ISDIR and co. callers";
      @callers = <$fd>;
      close($fd);
      unless (@callers) {
          print "No matches\n";
          exit(0);
      }
      
      my @cocci = (
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISLNK(E->d_inode->i_mode)',
          '+ d_is_symlink(E)',
          '',
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISDIR(E->d_inode->i_mode)',
          '+ d_is_dir(E)',
          '',
          '@@',
          'expression E;',
          '@@',
          '',
          '- S_ISREG(E->d_inode->i_mode)',
          '+ d_is_reg(E)' );
      
      my $coccifile = "tmp.sp.cocci";
      open($fd, ">$coccifile") || die $coccifile;
      print($fd "$_\n") || die $coccifile foreach (@cocci);
      close($fd);
      
      foreach my $file (@callers) {
          chomp $file;
          print "Processing ", $file, "\n";
          system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
      	die "spatch failed";
      }
      
      [AV: overlayfs parts skipped]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e36cb0b8
  13. 14 2月, 2015 1 次提交
  14. 26 1月, 2015 1 次提交
  15. 19 12月, 2014 1 次提交
    • E
      mnt: Fix a memory stomp in umount · c297abfd
      Eric W. Biederman 提交于
      While reviewing the code of umount_tree I realized that when we append
      to a preexisting unmounted list we do not change pprev of the former
      first item in the list.
      
      Which means later in namespace_unlock hlist_del_init(&mnt->mnt_hash) on
      the former first item of the list will stomp unmounted.first leaving
      it set to some random mount point which we are likely to free soon.
      
      This isn't likely to hit, but if it does I don't know how anyone could
      track it down.
      
      [ This happened because we don't have all the same operations for
        hlist's as we do for normal doubly-linked lists. In particular,
        list_splice() is easy on our standard doubly-linked lists, while
        hlist_splice() doesn't exist and needs both start/end entries of the
        hlist.  And commit 38129a13 incorrectly open-coded that missing
        hlist_splice().
      
        We should think about making these kinds of "mindless" conversions
        easier to get right by adding the missing hlist helpers   - Linus ]
      
      Fixes: 38129a13 switch mnt_hash to hlist
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c297abfd
  16. 11 12月, 2014 1 次提交
    • A
      take the targets of /proc/*/ns/* symlinks to separate fs · e149ed2b
      Al Viro 提交于
      New pseudo-filesystem: nsfs.  Targets of /proc/*/ns/* live there now.
      It's not mountable (not even registered, so it's not in /proc/filesystems,
      etc.).  Files on it *are* bindable - we explicitly permit that in do_loopback().
      
      This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
      get_proc_ns() is a macro now (it's simply returning ->i_private; would
      have been an inline, if not for header ordering headache).
      proc_ns_inode() is an ex-parrot.  The interface used in procfs is
      ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).
      
      Dentries and inodes are never hashed; a non-counting reference to dentry
      is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
      if present.  See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
      of that mechanism.
      
      As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
      it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets
      from ns_get_path().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e149ed2b
  17. 05 12月, 2014 6 次提交
  18. 03 12月, 2014 5 次提交
    • E
      mnt: Clear mnt_expire during pivot_root · 4fed655c
      Eric W. Biederman 提交于
      When inspecting the pivot_root and the current mount expiry logic I
      realized that pivot_root fails to clear like mount move does.
      
      Add the missing line in case someone does the interesting feat of
      moving an expirable submount.  This gives a strong guarantee that root
      of the filesystem tree will never expire.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      4fed655c
    • E
      mnt: Carefully set CL_UNPRIVILEGED in clone_mnt · 381cacb1
      Eric W. Biederman 提交于
      old->mnt_expiry should be ignored unless CL_EXPIRE is set.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      381cacb1
    • E
      mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers. · 8486a788
      Eric W. Biederman 提交于
      Clear MNT_LOCKED in the callers of copy_tree except copy_mnt_ns, and
      collect_mounts.  In copy_mnt_ns it is necessary to create an exact
      copy of a mount tree, so not clearing MNT_LOCKED is important.
      Similarly collect_mounts is used to take a snapshot of the mount tree
      for audit logging purposes and auditing using a faithful copy of the
      tree is important.
      
      This becomes particularly significant when we start setting MNT_LOCKED
      on rootfs to prevent it from being unmounted.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      8486a788
    • E
      umount: Do not allow unmounting rootfs. · da362b09
      Eric W. Biederman 提交于
      Andrew Vagin <avagin@parallels.com> writes:
      
      > #define _GNU_SOURCE
      > #include <sys/types.h>
      > #include <sys/stat.h>
      > #include <fcntl.h>
      > #include <sched.h>
      > #include <unistd.h>
      > #include <sys/mount.h>
      >
      > int main(int argc, char **argv)
      > {
      > 	int fd;
      >
      > 	fd = open("/proc/self/ns/mnt", O_RDONLY);
      > 	if (fd < 0)
      > 	   return 1;
      > 	   while (1) {
      > 	   	 if (umount2("/", MNT_DETACH) ||
      > 		        setns(fd, CLONE_NEWNS))
      > 					break;
      > 					}
      >
      > 					return 0;
      > }
      >
      > root@ubuntu:/home/avagin# gcc -Wall nsenter.c -o nsenter
      > root@ubuntu:/home/avagin# strace ./nsenter
      > execve("./nsenter", ["./nsenter"], [/* 22 vars */]) = 0
      > ...
      > open("/proc/self/ns/mnt", O_RDONLY)     = 3
      > umount("/", MNT_DETACH)                 = 0
      > setns(3, 131072)                        = 0
      > umount("/", MNT_DETACH
      >
      causes:
      
      > [  260.548301] ------------[ cut here ]------------
      > [  260.550941] kernel BUG at /build/buildd/linux-3.13.0/fs/pnode.c:372!
      > [  260.552068] invalid opcode: 0000 [#1] SMP
      > [  260.552068] Modules linked in: xt_CHECKSUM iptable_mangle xt_tcpudp xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison iptable_filter ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel binfmt_misc nfsd auth_rpcgss nfs_acl aesni_intel nfs lockd aes_x86_64 sunrpc fscache lrw gf128mul glue_helper ablk_helper cryptd serio_raw ppdev parport_pc lp parport btrfs xor raid6_pq libcrc32c psmouse floppy
      > [  260.552068] CPU: 0 PID: 1723 Comm: nsenter Not tainted 3.13.0-30-generic #55-Ubuntu
      > [  260.552068] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      > [  260.552068] task: ffff8800376097f0 ti: ffff880074824000 task.ti: ffff880074824000
      > [  260.552068] RIP: 0010:[<ffffffff811e9483>]  [<ffffffff811e9483>] propagate_umount+0x123/0x130
      > [  260.552068] RSP: 0018:ffff880074825e98  EFLAGS: 00010246
      > [  260.552068] RAX: ffff88007c741140 RBX: 0000000000000002 RCX: ffff88007c741190
      > [  260.552068] RDX: ffff88007c741190 RSI: ffff880074825ec0 RDI: ffff880074825ec0
      > [  260.552068] RBP: ffff880074825eb0 R08: 00000000000172e0 R09: ffff88007fc172e0
      > [  260.552068] R10: ffffffff811cc642 R11: ffffea0001d59000 R12: ffff88007c741140
      > [  260.552068] R13: ffff88007c741140 R14: ffff88007c741140 R15: 0000000000000000
      > [  260.552068] FS:  00007fd5c7e41740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
      > [  260.552068] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > [  260.552068] CR2: 00007fd5c7968050 CR3: 0000000070124000 CR4: 00000000000406f0
      > [  260.552068] Stack:
      > [  260.552068]  0000000000000002 0000000000000002 ffff88007c631000 ffff880074825ed8
      > [  260.552068]  ffffffff811dcfac ffff88007c741140 0000000000000002 ffff88007c741160
      > [  260.552068]  ffff880074825f38 ffffffff811dd12b ffffffff811cc642 0000000075640000
      > [  260.552068] Call Trace:
      > [  260.552068]  [<ffffffff811dcfac>] umount_tree+0x20c/0x260
      > [  260.552068]  [<ffffffff811dd12b>] do_umount+0x12b/0x300
      > [  260.552068]  [<ffffffff811cc642>] ? final_putname+0x22/0x50
      > [  260.552068]  [<ffffffff811cc849>] ? putname+0x29/0x40
      > [  260.552068]  [<ffffffff811dd88c>] SyS_umount+0xdc/0x100
      > [  260.552068]  [<ffffffff8172aeff>] tracesys+0xe1/0xe6
      > [  260.552068] Code: 89 50 08 48 8b 50 08 48 89 02 49 89 45 08 e9 72 ff ff ff 0f 1f 44 00 00 4c 89 e6 4c 89 e7 e8 f5 f6 ff ff 48 89 c3 e9 39 ff ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 66 66 66 66 90 55 b8 01
      > [  260.552068] RIP  [<ffffffff811e9483>] propagate_umount+0x123/0x130
      > [  260.552068]  RSP <ffff880074825e98>
      > [  260.611451] ---[ end trace 11c33d85f1d4c652 ]--
      
      Which in practice is totally uninteresting.  Only the global root user can
      do it, and it is just a stupid thing to do.
      
      However that is no excuse to allow a silly way to oops the kernel.
      
      We can avoid this silly problem by setting MNT_LOCKED on the rootfs
      mount point and thus avoid needing any special cases in the unmount
      code.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      da362b09
    • E
      umount: Disallow unprivileged mount force · b2f5d4dc
      Eric W. Biederman 提交于
      Forced unmount affects not just the mount namespace but the underlying
      superblock as well.  Restrict forced unmount to the global root user
      for now.  Otherwise it becomes possible a user in a less privileged
      mount namespace to force the shutdown of a superblock of a filesystem
      in a more privileged mount namespace, allowing a DOS attack on root.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      b2f5d4dc