1. 07 7月, 2017 1 次提交
    • P
      mm: update callers to use HASH_ZERO flag · 3d375d78
      Pavel Tatashin 提交于
      Update dcache, inode, pid, mountpoint, and mount hash tables to use
      HASH_ZERO, and remove initialization after allocations.  In case of
      places where HASH_EARLY was used such as in __pv_init_lock_hash the
      zeroed hash table was already assumed, because memblock zeroes the
      memory.
      
      CPU: SPARC M6, Memory: 7T
      Before fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 11.798s
      
      After fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 3.198s
      
      CPU: Intel Xeon E5-2630, Memory: 2.2T:
      Before fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.245s
      
      After fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.244s
      
      Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NBabu Moger <babu.moger@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d375d78
  2. 15 6月, 2017 1 次提交
  3. 23 5月, 2017 2 次提交
    • E
      mnt: In propgate_umount handle visiting mounts in any order · 99b19d16
      Eric W. Biederman 提交于
      While investigating some poor umount performance I realized that in
      the case of overlapping mount trees where some of the mounts are locked
      the code has been failing to unmount all of the mounts it should
      have been unmounting.
      
      This failure to unmount all of the necessary
      mounts can be reproduced with:
      
      $ cat locked_mounts_test.sh
      
      mount -t tmpfs test-base /mnt
      mount --make-shared /mnt
      mkdir -p /mnt/b
      
      mount -t tmpfs test1 /mnt/b
      mount --make-shared /mnt/b
      mkdir -p /mnt/b/10
      
      mount -t tmpfs test2 /mnt/b/10
      mount --make-shared /mnt/b/10
      mkdir -p /mnt/b/10/20
      
      mount --rbind /mnt/b /mnt/b/10/20
      
      unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
      sleep 1
      umount -l /mnt/b
      wait %%
      
      $ unshare -Urm ./locked_mounts_test.sh
      
      This failure is corrected by removing the prepass that marks mounts
      that may be umounted.
      
      A first pass is added that umounts mounts if possible and if not sets
      mount mark if they could be unmounted if they weren't locked and adds
      them to a list to umount possibilities.  This first pass reconsiders
      the mounts parent if it is on the list of umount possibilities, ensuring
      that information of umoutability will pass from child to mount parent.
      
      A second pass then walks through all mounts that are umounted and processes
      their children unmounting them or marking them for reparenting.
      
      A last pass cleans up the state on the mounts that could not be umounted
      and if applicable reparents them to their first parent that remained
      mounted.
      
      While a bit longer than the old code this code is much more robust
      as it allows information to flow up from the leaves and down
      from the trunk making the order in which mounts are encountered
      in the umount propgation tree irrelevant.
      
      Cc: stable@vger.kernel.org
      Fixes: 0c56fe31 ("mnt: Don't propagate unmounts to locked mounts")
      Reviewed-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      99b19d16
    • E
      mnt: In umount propagation reparent in a separate pass · 570487d3
      Eric W. Biederman 提交于
      It was observed that in some pathlogical cases that the current code
      does not unmount everything it should.  After investigation it
      was determined that the issue is that mnt_change_mntpoint can
      can change which mounts are available to be unmounted during mount
      propagation which is wrong.
      
      The trivial reproducer is:
      $ cat ./pathological.sh
      
      mount -t tmpfs test-base /mnt
      cd /mnt
      mkdir 1 2 1/1
      mount --bind 1 1
      mount --make-shared 1
      mount --bind 1 2
      mount --bind 1/1 1/1
      mount --bind 1/1 1/1
      echo
      grep test-base /proc/self/mountinfo
      umount 1/1
      echo
      grep test-base /proc/self/mountinfo
      
      $ unshare -Urm ./pathological.sh
      
      The expected output looks like:
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      The output without the fix looks like:
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
      47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
      
      That last mount in the output was in the propgation tree to be unmounted but
      was missed because the mnt_change_mountpoint changed it's parent before the walk
      through the mount propagation tree observed it.
      
      Cc: stable@vger.kernel.org
      Fixes: 1064f874 ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Reviewed-by: NRam Pai <linuxram@us.ibm.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      570487d3
  4. 22 4月, 2017 1 次提交
  5. 10 4月, 2017 2 次提交
    • J
      fsnotify: Free fsnotify_mark_connector when there is no mark attached · 08991e83
      Jan Kara 提交于
      Currently we free fsnotify_mark_connector structure only when inode /
      vfsmount is getting freed. This can however impose noticeable memory
      overhead when marks get attached to inodes only temporarily. So free the
      connector structure once the last mark is detached from the object.
      Since notification infrastructure can be working with the connector
      under the protection of fsnotify_mark_srcu, we have to be careful and
      free the fsnotify_mark_connector only after SRCU period passes.
      Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      08991e83
    • J
      fsnotify: Move mark list head from object into dedicated structure · 9dd813c1
      Jan Kara 提交于
      Currently notification marks are attached to object (inode or vfsmnt) by
      a hlist_head in the object. The list is also protected by a spinlock in
      the object. So while there is any mark attached to the list of marks,
      the object must be pinned in memory (and thus e.g. last iput() deleting
      inode cannot happen). Also for list iteration in fsnotify() to work, we
      must hold fsnotify_mark_srcu lock so that mark itself and
      mark->obj_list.next cannot get freed. Thus we are required to wait for
      response to fanotify events from userspace process with
      fsnotify_mark_srcu lock held. That causes issues when userspace process
      is buggy and does not reply to some event - basically the whole
      notification subsystem gets eventually stuck.
      
      So to be able to drop fsnotify_mark_srcu lock while waiting for
      response, we have to pin the mark in memory and make sure it stays in
      the object list (as removing the mark waiting for response could lead to
      lost notification events for groups later in the list). However we don't
      want inode reclaim to block on such mark as that would lead to system
      just locking up elsewhere.
      
      This commit is the first in the series that paves way towards solving
      these conflicting lifetime needs. Instead of anchoring the list of marks
      directly in the object, we anchor it in a dedicated structure
      (fsnotify_mark_connector) and just point to that structure from the
      object. The following commits will also add spinlock protecting the list
      and object pointer to the structure.
      Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      9dd813c1
  6. 02 3月, 2017 2 次提交
  7. 03 2月, 2017 1 次提交
    • E
      mnt: Tuck mounts under others instead of creating shadow/side mounts. · 1064f874
      Eric W. Biederman 提交于
      Ever since mount propagation was introduced in cases where a mount in
      propagated to parent mount mountpoint pair that is already in use the
      code has placed the new mount behind the old mount in the mount hash
      table.
      
      This implementation detail is problematic as it allows creating
      arbitrary length mount hash chains.
      
      Furthermore it invalidates the constraint maintained elsewhere in the
      mount code that a parent mount and a mountpoint pair will have exactly
      one mount upon them.  Making it hard to deal with and to talk about
      this special case in the mount code.
      
      Modify mount propagation to notice when there is already a mount at
      the parent mount and mountpoint where a new mount is propagating to
      and place that preexisting mount on top of the new mount.
      
      Modify unmount propagation to notice when a mount that is being
      unmounted has another mount on top of it (and no other children), and
      to replace the unmounted mount with the mount on top of it.
      
      Move the MNT_UMUONT test from __lookup_mnt_last into
      __propagate_umount as that is the only call of __lookup_mnt_last where
      MNT_UMOUNT may be set on any mount visible in the mount hash table.
      
      These modifications allow:
       - __lookup_mnt_last to be removed.
       - attach_shadows to be renamed __attach_mnt and its shadow
         handling to be removed.
       - commit_tree to be simplified
       - copy_tree to be simplified
      
      The result is an easier to understand tree of mounts that does not
      allow creation of arbitrary length hash chains in the mount hash table.
      
      The result is also a very slight userspace visible difference in semantics.
      The following two cases now behave identically, where before order
      mattered:
      
      case 1: (explicit user action)
      	B is a slave of A
      	mount something on A/a , it will propagate to B/a
      	and than mount something on B/a
      
      case 2: (tucked mount)
      	B is a slave of A
      	mount something on B/a
      	and than mount something on A/a
      
      Histroically umount A/a would fail in case 1 and succeed in case 2.
      Now umount A/a succeeds in both configurations.
      
      This very small change in semantics appears if anything to be a bug
      fix to me and my survey of userspace leads me to believe that no programs
      will notice or care of this subtle semantic change.
      
      v2: Updated to mnt_change_mountpoint to not call dput or mntput
      and instead to decrement the counts directly.  It is guaranteed
      that there will be other references when mnt_change_mountpoint is
      called so this is safe.
      
      v3: Moved put_mountpoint under mount_lock in attach_recursive_mnt
          As the locking in fs/namespace.c changed between v2 and v3.
      
      v4: Reworked the logic in propagate_mount_busy and __propagate_umount
          that detects when a mount completely covers another mount.
      
      v5: Removed unnecessary tests whose result is alwasy true in
          find_topper and attach_recursive_mnt.
      
      v6: Document the user space visible semantic difference.
      
      Cc: stable@vger.kernel.org
      Fixes: b90fa9ae ("[PATCH] shared mount handling: bind and rbind")
      Tested-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      1064f874
  8. 01 2月, 2017 1 次提交
    • E
      fs: Better permission checking for submounts · 93faccbb
      Eric W. Biederman 提交于
      To support unprivileged users mounting filesystems two permission
      checks have to be performed: a test to see if the user allowed to
      create a mount in the mount namespace, and a test to see if
      the user is allowed to access the specified filesystem.
      
      The automount case is special in that mounting the original filesystem
      grants permission to mount the sub-filesystems, to any user who
      happens to stumble across the their mountpoint and satisfies the
      ordinary filesystem permission checks.
      
      Attempting to handle the automount case by using override_creds
      almost works.  It preserves the idea that permission to mount
      the original filesystem is permission to mount the sub-filesystem.
      Unfortunately using override_creds messes up the filesystems
      ordinary permission checks.
      
      Solve this by being explicit that a mount is a submount by introducing
      vfs_submount, and using it where appropriate.
      
      vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
      sget and friends know that a mount is a submount so they can take appropriate
      action.
      
      sget and sget_userns are modified to not perform any permission checks
      on submounts.
      
      follow_automount is modified to stop using override_creds as that
      has proven problemantic.
      
      do_mount is modified to always remove the new MS_SUBMOUNT flag so
      that we know userspace will never by able to specify it.
      
      autofs4 is modified to stop using current_real_cred that was put in
      there to handle the previous version of submount permission checking.
      
      cifs is modified to pass the mountpoint all of the way down to vfs_submount.
      
      debugfs is modified to pass the mountpoint all of the way down to
      trace_automount by adding a new parameter.  To make this change easier
      a new typedef debugfs_automount_t is introduced to capture the type of
      the debugfs automount function.
      
      Cc: stable@vger.kernel.org
      Fixes: 069d5ac9 ("autofs:  Fix automounts by using current_real_cred()->uid")
      Fixes: aeaa4a79 ("fs: Call d_automount with the filesystems creds")
      Reviewed-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      Reviewed-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      93faccbb
  9. 10 1月, 2017 1 次提交
    • E
      mnt: Protect the mountpoint hashtable with mount_lock · 3895dbf8
      Eric W. Biederman 提交于
      Protecting the mountpoint hashtable with namespace_sem was sufficient
      until a call to umount_mnt was added to mntput_no_expire.  At which
      point it became possible for multiple calls of put_mountpoint on
      the same hash chain to happen on the same time.
      
      Kristen Johansen <kjlx@templeofstupid.com> reported:
      > This can cause a panic when simultaneous callers of put_mountpoint
      > attempt to free the same mountpoint.  This occurs because some callers
      > hold the mount_hash_lock, while others hold the namespace lock.  Some
      > even hold both.
      >
      > In this submitter's case, the panic manifested itself as a GP fault in
      > put_mountpoint() when it called hlist_del() and attempted to dereference
      > a m_hash.pprev that had been poisioned by another thread.
      
      Al Viro observed that the simple fix is to switch from using the namespace_sem
      to the mount_lock to protect the mountpoint hash table.
      
      I have taken Al's suggested patch moved put_mountpoint in pivot_root
      (instead of taking mount_lock an additional time), and have replaced
      new_mountpoint with get_mountpoint a function that does the hash table
      lookup and addition under the mount_lock.   The introduction of get_mounptoint
      ensures that only the mount_lock is needed to manipulate the mountpoint
      hashtable.
      
      d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
      already set.  This allows get_mountpoint to use the setting of
      DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
      happens exactly once.
      
      Cc: stable@vger.kernel.org
      Fixes: ce07d891 ("mnt: Honor MNT_LOCKED when detaching mounts")
      Reported-by: NKrister Johansen <kjlx@templeofstupid.com>
      Suggested-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Acked-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      3895dbf8
  10. 17 12月, 2016 3 次提交
  11. 06 12月, 2016 2 次提交
  12. 04 12月, 2016 1 次提交
  13. 11 10月, 2016 1 次提交
    • E
      latent_entropy: Mark functions with __latent_entropy · 0766f788
      Emese Revfy 提交于
      The __latent_entropy gcc attribute can be used only on functions and
      variables.  If it is on a function then the plugin will instrument it for
      gathering control-flow entropy. If the attribute is on a variable then
      the plugin will initialize it with random contents.  The variable must
      be an integer, an integer array type or a structure with integer fields.
      
      These specific functions have been selected because they are init
      functions (to help gather boot-time entropy), are called at unpredictable
      times, or they have variable loops, each of which provide some level of
      latent entropy.
      Signed-off-by: NEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      0766f788
  14. 01 10月, 2016 1 次提交
    • E
      mnt: Add a per mount namespace limit on the number of mounts · d2921684
      Eric W. Biederman 提交于
      CAI Qian <caiqian@redhat.com> pointed out that the semantics
      of shared subtrees make it possible to create an exponentially
      increasing number of mounts in a mount namespace.
      
          mkdir /tmp/1 /tmp/2
          mount --make-rshared /
          for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done
      
      Will create create 2^20 or 1048576 mounts, which is a practical problem
      as some people have managed to hit this by accident.
      
      As such CVE-2016-6213 was assigned.
      
      Ian Kent <raven@themaw.net> described the situation for autofs users
      as follows:
      
      > The number of mounts for direct mount maps is usually not very large because of
      > the way they are implemented, large direct mount maps can have performance
      > problems. There can be anywhere from a few (likely case a few hundred) to less
      > than 10000, plus mounts that have been triggered and not yet expired.
      >
      > Indirect mounts have one autofs mount at the root plus the number of mounts that
      > have been triggered and not yet expired.
      >
      > The number of autofs indirect map entries can range from a few to the common
      > case of several thousand and in rare cases up to between 30000 and 50000. I've
      > not heard of people with maps larger than 50000 entries.
      >
      > The larger the number of map entries the greater the possibility for a large
      > number of active mounts so it's not hard to expect cases of a 1000 or somewhat
      > more active mounts.
      
      So I am setting the default number of mounts allowed per mount
      namespace at 100,000.  This is more than enough for any use case I
      know of, but small enough to quickly stop an exponential increase
      in mounts.  Which should be perfect to catch misconfigurations and
      malfunctioning programs.
      
      For anyone who needs a higher limit this can be changed by writing
      to the new /proc/sys/fs/mount-max sysctl.
      Tested-by: NCAI Qian <caiqian@redhat.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      d2921684
  15. 23 9月, 2016 2 次提交
  16. 16 9月, 2016 1 次提交
    • M
      locks: fix file locking on overlayfs · c568d683
      Miklos Szeredi 提交于
      This patch allows flock, posix locks, ofd locks and leases to work
      correctly on overlayfs.
      
      Instead of using the underlying inode for storing lock context use the
      overlay inode.  This allows locks to be persistent across copy-up.
      
      This is done by introducing locks_inode() helper and using it instead of
      file_inode() to get the inode in locking code.  For non-overlayfs the two
      are equivalent, except for an extra pointer dereference in locks_inode().
      
      Since lock operations are in "struct file_operations" we must also make
      sure not to call underlying filesystem's lock operations.  Introcude a
      super block flag MS_NOREMOTELOCK to this effect.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Acked-by: NJeff Layton <jlayton@poochiereds.net>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      c568d683
  17. 31 8月, 2016 1 次提交
  18. 01 7月, 2016 1 次提交
    • A
      namespace: update event counter when umounting a deleted dentry · e06b933e
      Andrey Ulanov 提交于
      - m_start() in fs/namespace.c expects that ns->event is incremented each
        time a mount added or removed from ns->list.
      - umount_tree() removes items from the list but does not increment event
        counter, expecting that it's done before the function is called.
      - There are some codepaths that call umount_tree() without updating
        "event" counter. e.g. from __detach_mounts().
      - When this happens m_start may reuse a cached mount structure that no
        longer belongs to ns->list (i.e. use after free which usually leads
        to infinite loop).
      
      This change fixes the above problem by incrementing global event counter
      before invoking umount_tree().
      
      Change-Id: I622c8e84dcb9fb63542372c5dbf0178ee86bb589
      Cc: stable@vger.kernel.org
      Signed-off-by: NAndrey Ulanov <andreyu@google.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e06b933e
  19. 24 6月, 2016 5 次提交
    • A
      fs: Treat foreign mounts as nosuid · 380cf5ba
      Andy Lutomirski 提交于
      If a process gets access to a mount from a different user
      namespace, that process should not be able to take advantage of
      setuid files or selinux entrypoints from that filesystem.  Prevent
      this by treating mounts from other mount namespaces and those not
      owned by current_user_ns() or an ancestor as nosuid.
      
      This will make it safer to allow more complex filesystems to be
      mounted in non-root user namespaces.
      
      This does not remove the need for MNT_LOCK_NOSUID.  The setuid,
      setgid, and file capability bits can no longer be abused if code in
      a user namespace were to clear nosuid on an untrusted filesystem,
      but this patch, by itself, is insufficient to protect the system
      from abuse of files that, when execed, would increase MAC privilege.
      
      As a more concrete explanation, any task that can manipulate a
      vfsmount associated with a given user namespace already has
      capabilities in that namespace and all of its descendents.  If they
      can cause a malicious setuid, setgid, or file-caps executable to
      appear in that mount, then that executable will only allow them to
      elevate privileges in exactly the set of namespaces in which they
      are already privileges.
      
      On the other hand, if they can cause a malicious executable to
      appear with a dangerous MAC label, running it could change the
      caller's security context in a way that should not have been
      possible, even inside the namespace in which the task is confined.
      
      As a hardening measure, this would have made CVE-2014-5207 much
      more difficult to exploit.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NSeth Forshee <seth.forshee@canonical.com>
      Acked-by: NJames Morris <james.l.morris@oracle.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      380cf5ba
    • E
      userns: Remove implicit MNT_NODEV fragility. · 67690f93
      Eric W. Biederman 提交于
      Replace the implict setting of MNT_NODEV on mounts that happen with
      just user namespace permissions with an implicit setting of SB_I_NODEV
      in s_iflags.  The visibility of the implicit MNT_NODEV has caused
      problems in the past.
      
      With this change the fragile case where an implicit MNT_NODEV needs to
      be preserved in do_remount is removed.  Using SB_I_NODEV is much less
      fragile as s_iflags are set during the original mount and never
      changed.
      
      In do_new_mount with the implicit setting of MNT_NODEV gone, the only
      code that can affect mnt_flags is fs_fully_visible so simplify the if
      statement and reduce the indentation of the code to make that clear.
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      67690f93
    • E
      mnt: Simplify mount_too_revealing · a1935c17
      Eric W. Biederman 提交于
      Verify all filesystems that we check in mount_too_revealing set
      SB_I_NOEXEC and SB_I_NODEV in sb->s_iflags.  That is true for today
      and it should remain true in the future.
      
      Remove the now unnecessary checks from mnt_already_visibile that
      ensure MNT_LOCK_NOSUID, MNT_LOCK_NOEXEC, and MNT_LOCK_NODEV are
      preserved.  Making the code shorter and easier to read.
      
      Relying on SB_I_NOEXEC and SB_I_NODEV instead of the user visible
      MNT_NOSUID, MNT_NOEXEC, and MNT_NODEV ensures the many current
      systems where proc and sysfs are mounted with "nosuid, nodev, noexec"
      and several slightly buggy container applications don't bother to
      set those flags continue to work.
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      a1935c17
    • E
      mnt: Move the FS_USERNS_MOUNT check into sget_userns · a001e74c
      Eric W. Biederman 提交于
      Allowing a filesystem to be mounted by other than root in the initial
      user namespace is a filesystem property not a mount namespace property
      and as such should be checked in filesystem specific code.  Move the
      FS_USERNS_MOUNT test into super.c:sget_userns().
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      a001e74c
    • E
      mnt: Refactor fs_fully_visible into mount_too_revealing · 8654df4e
      Eric W. Biederman 提交于
      Replace the call of fs_fully_visible in do_new_mount from before the
      new superblock is allocated with a call of mount_too_revealing after
      the superblock is allocated.   This winds up being a much better location
      for maintainability of the code.
      
      The first change this enables is the replacement of FS_USERNS_VISIBLE
      with SB_I_USERNS_VISIBLE.  Moving the flag from struct filesystem_type
      to sb_iflags on the superblock.
      
      Unfortunately mount_too_revealing fundamentally needs to touch
      mnt_flags adding several MNT_LOCKED_XXX flags at the appropriate
      times.  If the mnt_flags did not need to be touched the code
      could be easily moved into the filesystem specific mount code.
      Acked-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      8654df4e
  20. 15 6月, 2016 1 次提交
    • E
      mnt: Account for MS_RDONLY in fs_fully_visible · 695e9df0
      Eric W. Biederman 提交于
      In rare cases it is possible for s_flags & MS_RDONLY to be set but
      MNT_READONLY to be clear.  This starting combination can cause
      fs_fully_visible to fail to ensure that the new mount is readonly.
      Therefore force MNT_LOCK_READONLY in the new mount if MS_RDONLY
      is set on the source filesystem of the mount.
      
      In general both MS_RDONLY and MNT_READONLY are set at the same for
      mounts so I don't expect any programs to care.  Nor do I expect
      MS_RDONLY to be set on proc or sysfs in the initial user namespace,
      which further decreases the likelyhood of problems.
      
      Which means this change should only affect system configurations by
      paranoid sysadmins who should welcome the additional protection
      as it keeps people from wriggling out of their policies.
      
      Cc: stable@vger.kernel.org
      Fixes: 8c6cf9cc ("mnt: Modify fs_fully_visible to deal with locked ro nodev and atime")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      695e9df0
  21. 07 6月, 2016 2 次提交
  22. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  23. 04 1月, 2016 1 次提交
  24. 07 12月, 2015 1 次提交
  25. 16 11月, 2015 2 次提交
  26. 24 7月, 2015 1 次提交
    • E
      mnt: In detach_mounts detach the appropriate unmounted mount · fe78fcc8
      Eric W. Biederman 提交于
      The handling of in detach_mounts of unmounted but connected mounts is
      buggy and can lead to an infinite loop.
      
      Correct the handling of unmounted mounts in detach_mount.  When the
      mountpoint of an unmounted but connected mount is connected to a
      dentry, and that dentry is deleted we need to disconnect that mount
      from the parent mount and the deleted dentry.
      
      Nothing changes for the unmounted and connected children.  They can be
      safely ignored.
      
      Cc: stable@vger.kernel.org
      Fixes: ce07d891 mnt: Honor MNT_LOCKED when detaching mounts
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      fe78fcc8
  27. 23 7月, 2015 1 次提交
    • E
      mnt: Clarify and correct the disconnect logic in umount_tree · f2d0a123
      Eric W. Biederman 提交于
      rmdir mntpoint will result in an infinite loop when there is
      a mount locked on the mountpoint in another mount namespace.
      
      This is because the logic to test to see if a mount should
      be disconnected in umount_tree is buggy.
      
      Move the logic to decide if a mount should remain connected to
      it's mountpoint into it's own function disconnect_mount so that
      clarity of expression instead of terseness of expression becomes
      a virtue.
      
      When the conditions where it is invalid to leave a mount connected
      are first ruled out, the logic for deciding if a mount should
      be disconnected becomes much clearer and simpler.
      
      Fixes: e0c9c0af mnt: Update detach_mounts to leave mounts connected
      Fixes: ce07d891 mnt: Honor MNT_LOCKED when detaching mounts
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f2d0a123