1. 14 3月, 2020 18 次提交
    • A
      pick_link(): check for WALK_TRAILING, not LOOKUP_PARENT · b1a81972
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b1a81972
    • A
      namei: invert the meaning of WALK_FOLLOW · 8c4efe22
      Al Viro 提交于
      old flags & WALK_FOLLOW <=> new !(flags & WALK_TRAILING)
      That's what that flag had really been used for.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8c4efe22
    • A
      sanitize handling of nd->last_type, kill LAST_BIND · b4c03536
      Al Viro 提交于
      ->last_type values are set in 3 places: path_init() (sets to LAST_ROOT),
      link_path_walk (LAST_NORM/DOT/DOTDOT) and pick_link (LAST_BIND).
      
      The are checked in walk_component(), lookup_last() and do_last().
      They also get copied to the caller by filename_parentat().  In the last
      3 cases the value is what we had at the return from link_path_walk().
      In case of walk_component() it's either directly downstream from
      assignment in link_path_walk() or, when called by lookup_last(), the
      value we have at the return from link_path_walk().
      
      The value at the entry into link_path_walk() can survive to return only
      if the pathname contains nothing but slashes.  Note that pick_link()
      never returns such - pure jumps are handled directly.  So for the calls
      of link_path_walk() for trailing symlinks it does not matter what value
      had been there at the entry; the value at the return won't depend upon it.
      
      There are 3 call chains that might have pick_link() storing LAST_BIND:
      
      1) pick_link() from step_into() from walk_component() from
      link_path_walk().  In that case we will either be parsing the next
      component immediately after return into link_path_walk(), which will
      overwrite the ->last_type before anyone has a chance to look at it,
      or we'll fail, in which case nobody will be looking at ->last_type at all.
      
      2) pick_link() from step_into() from walk_component() from lookup_last().
      The value is never looked at due to the above; it won't affect the value
      seen at return from any link_path_walk().
      
      3) pick_link() from step_into() from do_last().  Ditto.
      
      In other words, assignemnt in pick_link() is pointless, and so is
      LAST_BIND itself; nothing ever looks at that value.  Kill it off.
      And make link_path_walk() _always_ assign ->last_type - in the only
      case when the value at the entry might survive to the return that value
      is always LAST_ROOT, inherited from path_init().  Move that assignment
      from path_init() into the beginning of link_path_walk(), to consolidate
      the things.
      
      Historical note: LAST_BIND used to be used for the kludge with trailing
      pure jump symlinks (extra iteration through the top-level loop).
      No point keeping it anymore...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b4c03536
    • A
      finally fold get_link() into pick_link() · ad6cc4c3
      Al Viro 提交于
      kill nd->link_inode, while we are at it
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ad6cc4c3
    • A
      merging pick_link() with get_link(), part 6 · 06708adb
      Al Viro 提交于
      move the only remaining call of get_link() into pick_link()
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      06708adb
    • A
      merging pick_link() with get_link(), part 5 · b0417d2c
      Al Viro 提交于
      move get_link() call into step_into().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b0417d2c
    • A
      merging pick_link() with get_link(), part 4 · 92d27016
      Al Viro 提交于
      Move the call of get_link() into walk_component().  Change the
      calling conventions for walk_component() to returning the link
      body to follow (if any).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      92d27016
    • A
      merging pick_link() with get_link(), part 3 · 40fcf5a9
      Al Viro 提交于
      After a pure jump ("/" or procfs-style symlink) we don't need to
      hold the link anymore.  link_path_walk() dropped it if such case
      had been detected, lookup_last/do_last() (i.e. old trailing_symlink())
      left it on the stack - it ended up calling terminate_walk() shortly
      anyway, which would've purged the entire stack.
      
      Do it in get_link() itself instead.  Simpler logics that way...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      40fcf5a9
    • A
      merging pick_link() with get_link(), part 2 · 1ccac622
      Al Viro 提交于
      Fold trailing_symlink() into lookup_last() and do_last(), change
      the calling conventions of those two.  Rules change:
      	success, we are done => NULL instead of 0
      	error	=> ERR_PTR(-E...) instead of -E...
      	got a symlink to follow => return the path to be followed instead of 1
      
      The loops calling those (in path_lookupat() and path_openat()) adjusted.
      
      A subtle change of control flow here: originally a pure-jump trailing
      symlink ("/" or procfs one) would've passed through the upper level
      loop once more, with "" for path to traverse.  That would've brought
      us back to the lookup_last/do_last entry and we would've hit LAST_BIND
      case (LAST_BIND left from get_link() called by trailing_symlink())
      and pretty much skip to the point right after where we'd left the
      sucker back when we picked that trailing symlink.
      
      Now we don't bother with that extra pass through the upper level
      loop - if get_link() says "I've just done a pure jump, nothing
      else to do", we just treat that as non-symlink case.
      
      Boilerplate added on that step will go away shortly - it'll migrate
      into walk_component() and then to step_into(), collapsing into the
      change of calling conventions for those.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1ccac622
    • A
      merging pick_link() with get_link(), part 1 · 43679723
      Al Viro 提交于
      Move restoring LOOKUP_PARENT and zeroing nd->stack.name[0] past
      the call of get_link() (nothing _currently_ uses them in there).
      That allows to moved the call of may_follow_link() into get_link()
      as well, since now the presence of LOOKUP_PARENT distinguishes
      the callers from each other (link_path_walk() has it, trailing_symlink()
      doesn't).
      
      Preparations for folding trailing_symlink() into callers (lookup_last()
      and do_last()) and changing the calling conventions of those.  Next
      stage after that will have get_link() call migrate into walk_component(),
      then - into step_into().  It's tricky enough to warrant doing that
      in stages, unfortunately...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      43679723
    • A
      a9dc1494
    • A
      LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() · 161aff1d
      Al Viro 提交于
      New LOOKUP flag, telling path_lookupat() to act as path_mountpointat().
      IOW, traverse mounts at the final point and skip revalidation of the
      location where it ends up.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      161aff1d
    • A
      fold handle_mounts() into step_into() · cbae4d12
      Al Viro 提交于
      The following is true:
      	* calls of handle_mounts() and step_into() are always
      paired in sequences like
      	err = handle_mounts(nd, dentry, &path, &inode, &seq);
      	if (unlikely(err < 0))
      		return err;
      	err = step_into(nd, &path, flags, inode, seq);
      	* in all such sequences path is uninitialized before and
      unused after this pair of calls
      	* in all such sequences inode and seq are unused afterwards.
      
      So the call of handle_mounts() can be shifted inside step_into(),
      turning 'path' into a local variable in the combined function.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cbae4d12
    • A
      new step_into() flag: WALK_NOFOLLOW · aca2903e
      Al Viro 提交于
      Tells step_into() not to follow symlinks, regardless of LOOKUP_FOLLOW.
      Allows to switch handle_lookup_down() to of step_into(), getting
      all follow_managed() and step_into() calls paired.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      aca2903e
    • A
      step_into() callers: dismiss the symlink earlier · 56676ec3
      Al Viro 提交于
      We need to dismiss a symlink when we are done traversing it;
      currently that's done when we call step_into() for its last
      component.  For the cases when we do not call step_into()
      for that component (i.e. when it's . or ..) we do the same
      symlink dismissal after the call of handle_dots().
      
      What we need to guarantee is that the symlink won't be dismissed
      while we are still using nd->last.name - it's pointing into the
      body of said symlink.  step_into() is sufficiently late - by
      the time it's called we'd already obtained the dentry, so the
      name we'd been looking up is no longer needed.  However, it
      turns out to be cleaner to have that ("we are done with that
      component now, can dismiss the link") done explicitly - in the
      callers of step_into().
      
      In handle_dots() case we won't be using the component string
      at all, so for . and .. the corresponding point is actually
      _before_ the call of handle_dots(), not after it.
      
      Fix a minor irregularity in do_last(), while we are at it -
      if trailing symlink ended with . or .. we forgot to dismiss
      it.  Not a problem, since nameidata is about to be done with
      (neither . nor .. can be a trailing symlink, so this is the
      last iteration through the loop) and terminate_walk() will
      clean the stack anyway, but let's keep it more regular.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      56676ec3
    • A
      lookup_fast(): take mount traversal into callers · 20e34357
      Al Viro 提交于
      Current calling conventions: -E... on error, 0 on cache miss,
      result of handle_mounts(nd, dentry, path, inode, seqp) on
      success.  Turn that into returning ERR_PTR(-E...), NULL and dentry
      resp.; deal with handle_mounts() in the callers.  The thing
      is, they already do that in cache miss handling case, so we
      just need to supply dentry to them and unify the mount traversal
      in those cases.  Fewer arguments that way, and we get closer
      to merging handle_mounts() and step_into().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      20e34357
    • A
      teach handle_mounts() to handle RCU mode · c153007b
      Al Viro 提交于
      ... and make the callers of __follow_mount_rcu() use handle_mounts().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c153007b
    • A
      lookup_fast(): consolidate the RCU success case · b023e172
      Al Viro 提交于
      1) in case of __follow_mount_rcu() failure, lookup_fast() proceeds
      to call unlazy_child() and, should it succeed, handle_mounts().
      Note that we have status > 0 (or we wouldn't be calling
      __follow_mount_rcu() at all), so all stuff conditional upon
      non-positive status won't be even touched.
      
      Consolidate just that sequence after the call of __follow_mount_rcu().
      
      2) calling d_is_negative() and keeping its result is pointless -
      we either don't get past checking ->d_seq (and don't use the results of
      d_is_negative() at all), or we are guaranteed that ->d_inode and
      type bits of ->d_flags had been consistent at the time of d_is_negative()
      call.  IOW, we could only get to the use of its result if it's
      equal to !inode.  The same ->d_seq check guarantees that after that point
      this CPU won't observe ->d_flags values older than ->d_inode update.
      So 'negative' variable is completely pointless these days.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b023e172
  2. 13 3月, 2020 3 次提交
  3. 28 2月, 2020 6 次提交
    • A
      atomic_open(): saner calling conventions (return dentry on success) · 239eb983
      Al Viro 提交于
      Currently it either returns -E... or puts (nd->path.mnt,dentry)
      into *path and returns 0.  Make it return ERR_PTR(-E...) or
      dentry; adjust the caller.  Fewer arguments and it's easier
      to keep track of *path contents that way.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      239eb983
    • A
      handle_mounts(): start building a sane wrapper for follow_managed() · bd7c4b50
      Al Viro 提交于
      All callers of follow_managed() follow it on success with the same steps -
      d_backing_inode(path->dentry) is calculated and stored into some struct inode *
      variable and, in all but one case, an unsigned variable (nd->seq to be) is
      zeroed.  The single exception is lookup_fast() and there zeroing is correct
      thing to do - not doing it is a pointless microoptimization.
      
      	Add a wrapper for follow_managed() that would do that combination.
      It's mostly a vehicle for code massage - it will be changing quite a bit,
      and the current calling conventions are by no means final.  Right now it
      takes path, nameidata and (as out params) inode and seq, similar to
      __follow_mount_rcu().  Which will soon get folded into it...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd7c4b50
    • A
      make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW · 31d1726d
      Al Viro 提交于
      O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink".
      As it is, we might or might not have LOOKUP_FOLLOW in op->intent
      in that case - that depends upon having O_NOFOLLOW in open flags.
      It doesn't matter, since we won't be checking it in that case -
      do_last() bails out earlier.
      
      However, making sure it's not set (i.e. acting as if we had an explicit
      O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the
      check for O_CREAT | O_EXCL in do_last() with the call of step_into()
      immediately following it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      31d1726d
    • A
      follow_automount() doesn't need the entire nameidata · 1c9f5e06
      Al Viro 提交于
      Only the address of ->total_link_count and the flags.
      And fix an off-by-one is ELOOP detection - make it
      consistent with symlink following, where we check if
      the pre-increment value has reached 40, rather than
      check the post-increment one.
      
      [kudos to Christian Brauner for spotted braino]
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1c9f5e06
    • A
      follow_automount(): get rid of dead^Wstillborn code · 25e195aa
      Al Viro 提交于
      1) no instances of ->d_automount() have ever made use of the "return
      ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's
      a rudiment of plans that got superseded before the thing went into
      the tree.  Despite the comment in follow_automount(), autofs has
      never done that.
      
      2) if there's no ->d_automount() in dentry_operations, filesystems
      should not set DCACHE_NEED_AUTOMOUNT in the first place.  None have
      ever done so...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      25e195aa
    • A
      fix automount/automount race properly · 26df6034
      Al Viro 提交于
      Protection against automount/automount races (two threads hitting the same
      referral point at the same time) is based upon do_add_mount() prevention of
      identical overmounts - trying to overmount the root of mounted tree with
      the same tree fails with -EBUSY.  It's unreliable (the other thread might've
      mounted something on top of the automount it has triggered) *and* causes
      no end of headache for follow_automount() and its caller, since
      finish_automount() behaves like do_new_mount() - if the mountpoint to be is
      overmounted, it mounts on top what's overmounting it.  It's not only wrong
      (we want to go into what's overmounting the automount point and quietly
      discard what we planned to mount there), it introduces the possibility of
      original parent mount getting dropped.  That's what 8aef1884 (VFS: Fix
      vfsmount overput on simultaneous automount) deals with, but it can't do
      anything about the reliability of conflict detection - if something had
      been overmounted the other thread's automount (e.g. that other thread
      having stepped into automount in mount(2)), we don't get that -EBUSY and
      the result is
      	 referral point under automounted NFS under explicit overmount
      under another copy of automounted NFS
      
      What we need is finish_automount() *NOT* digging into overmounts - if it
      finds one, it should just quietly discard the thing it was asked to mount.
      And don't bother with actually crossing into the results of finish_automount() -
      the same loop that calls follow_automount() will do that just fine on the
      next iteration.
      
      IOW, instead of calling lock_mount() have finish_automount() do it manually,
      _without_ the "move into overmount and retry" part.  And leave crossing into
      the results to the caller of follow_automount(), which simplifies it a lot.
      
      Moral: if you end up with a lot of glue working around the calling conventions
      of something, perhaps these calling conventions are simply wrong...
      
      Fixes: 8aef1884 (VFS: Fix vfsmount overput on simultaneous automount)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      26df6034
  4. 02 2月, 2020 1 次提交
    • A
      vfs: fix do_last() regression · 6404674a
      Al Viro 提交于
      Brown paperbag time: fetching ->i_uid/->i_mode really should've been
      done from nd->inode.  I even suggested that, but the reason for that has
      slipped through the cracks and I went for dir->d_inode instead - made
      for more "obvious" patch.
      
      Analysis:
      
       - at the entry into do_last() and all the way to step_into(): dir (aka
         nd->path.dentry) is known not to have been freed; so's nd->inode and
         it's equal to dir->d_inode unless we are already doomed to -ECHILD.
         inode of the file to get opened is not known.
      
       - after step_into(): inode of the file to get opened is known; dir
         might be pointing to freed memory/be negative/etc.
      
       - at the call of may_create_in_sticky(): guaranteed to be out of RCU
         mode; inode of the file to get opened is known and pinned; dir might
         be garbage.
      
      The last was the reason for the original patch.  Except that at the
      do_last() entry we can be in RCU mode and it is possible that
      nd->path.dentry->d_inode has already changed under us.
      
      In that case we are going to fail with -ECHILD, but we need to be
      careful; nd->inode is pointing to valid struct inode and it's the same
      as nd->path.dentry->d_inode in "won't fail with -ECHILD" case, so we
      should use that.
      Reported-by: N"Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com>
      Reported-by: syzbot+190005201ced78a74ad6@syzkaller.appspotmail.com
      Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk>
      Cc: stable@kernel.org
      Fixes: d0cb5018 ("do_last(): fetch directory ->i_mode and ->i_uid before it's too late")
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6404674a
  5. 26 1月, 2020 1 次提交
  6. 15 1月, 2020 2 次提交
    • A
      fix autofs regression caused by follow_managed() changes · 508c8772
      Al Viro 提交于
      we need to reload ->d_flags after the call of ->d_manage() - the thing
      might've been called with dentry still negative and have the damn thing
      turned positive while we'd waited.
      
      Fixes: d41efb52 "fs/namei.c: pull positivity check into follow_managed()"
      Reported-by: NIan Kent <raven@themaw.net>
      Tested-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      508c8772
    • A
      reimplement path_mountpoint() with less magic · c64cd6e3
      Al Viro 提交于
      ... and get rid of a bunch of bugs in it.  Background:
      the reason for path_mountpoint() is that umount() really doesn't
      want attempts to revalidate the root of what it's trying to umount.
      The thing we want to avoid actually happen from complete_walk();
      solution was to do something parallel to normal path_lookupat()
      and it both went overboard and got the boilerplate subtly
      (and not so subtly) wrong.
      
      A better solution is to do pretty much what the normal path_lookupat()
      does, but instead of complete_walk() do unlazy_walk().  All it takes
      to avoid that ->d_weak_revalidate() call...  mountpoint_last() goes
      away, along with everything it got wrong, and so does the magic around
      LOOKUP_NO_REVAL.
      
      Another source of bugs is that when we traverse mounts at the final
      location (and we need to do that - umount . expects to get whatever's
      overmounting ., if any, out of the lookup) we really ought to take
      care of ->d_manage() - as it is, manual umount of autofs automount
      in progress can lead to unpleasant surprises for the daemon.  Easily
      solved by using handle_lookup_down() instead of follow_mount().
      Tested-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c64cd6e3
  7. 09 12月, 2019 9 次提交
    • A
      namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution · ab87f9a5
      Aleksa Sarai 提交于
      Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution
      (in the case of LOOKUP_BENEATH the resolution will still fail if ".."
      resolution would resolve a path outside of the root -- while
      LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are
      still disallowed entirely[*].
      
      As Jann explains[1,2], the need for this patch (and the original no-".."
      restriction) is explained by observing there is a fairly easy-to-exploit
      race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and
      LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be
      used to "skip over" nd->root and thus escape to the filesystem above
      nd->root.
      
        thread1 [attacker]:
          for (;;)
            renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
        thread2 [victim]:
          for (;;)
            openat2(dirb, "b/c/../../etc/shadow",
                    { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );
      
      With fairly significant regularity, thread2 will resolve to
      "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
      (though somewhat more privileged) attack using MS_MOVE.
      
      With this patch, such cases will be detected *during* ".." resolution
      and will return -EAGAIN for userspace to decide to either retry or abort
      the lookup. It should be noted that ".." is the weak point of chroot(2)
      -- walking *into* a subdirectory tautologically cannot result in you
      walking *outside* nd->root (except through a bind-mount or magic-link).
      There is also no other way for a directory's parent to change (which is
      the primary worry with ".." resolution here) other than a rename or
      MS_MOVE.
      
      The primary reason for deferring to userspace with -EAGAIN is that an
      in-kernel retry loop (or doing a path_is_under() check after re-taking
      the relevant seqlocks) can become unreasonably expensive on machines
      with lots of VFS activity (nfsd can cause lots of rename_lock updates).
      Thus it should be up to userspace how many times they wish to retry the
      lookup -- the selftests for this attack indicate that there is a ~35%
      chance of the lookup succeeding on the first try even with an attacker
      thrashing rename_lock.
      
      A variant of the above attack is included in the selftests for
      openat2(2) later in this patch series. I've run this test on several
      machines for several days and no instances of a breakout were detected.
      While this is not concrete proof that this is safe, when combined with
      the above argument it should lend some trustworthiness to this
      construction.
      
      [*] It may be acceptable in the future to do a path_is_under() check for
          magic-links after they are resolved. However this seems unlikely to
          be a feature that people *really* need -- it can be added later if
          it turns out a lot of people want it.
      
      [1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
      [2]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NJann Horn <jannh@google.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ab87f9a5
    • A
      namei: LOOKUP_IN_ROOT: chroot-like scoped resolution · 8db52c7e
      Aleksa Sarai 提交于
      /* Background. */
      Container runtimes or other administrative management processes will
      often interact with root filesystems while in the host mount namespace,
      because the cost of doing a chroot(2) on every operation is too
      prohibitive (especially in Go, which cannot safely use vfork). However,
      a malicious program can trick the management process into doing
      operations on files outside of the root filesystem through careful
      crafting of symlinks.
      
      Most programs that need this feature have attempted to make this process
      safe, by doing all of the path resolution in userspace (with symlinks
      being scoped to the root of the malicious root filesystem).
      Unfortunately, this method is prone to foot-guns and usually such
      implementations have subtle security bugs.
      
      Thus, what userspace needs is a way to resolve a path as though it were
      in a chroot(2) -- with all absolute symlinks being resolved relative to
      the dirfd root (and ".." components being stuck under the dirfd root).
      It is much simpler and more straight-forward to provide this
      functionality in-kernel (because it can be done far more cheaply and
      correctly).
      
      More classical applications that also have this problem (which have
      their own potentially buggy userspace path sanitisation code) include
      web servers, archive extraction tools, network file servers, and so on.
      
      /* Userspace API. */
      LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_IN_ROOT applies to all components of the path.
      
      With LOOKUP_IN_ROOT, any path component which attempts to cross the
      starting point of the pathname lookup (the dirfd passed to openat) will
      remain at the starting point. Thus, all absolute paths and symlinks will
      be scoped within the starting point.
      
      There is a slight change in behaviour regarding pathnames -- if the
      pathname is absolute then the dirfd is still used as the root of
      resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
      foot-guns, at the cost of a minor API inconsistency).
      
      As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to
      LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction
      will be lifted in a future patch, but requires more work to ensure that
      permitting ".." is done safely.
      
      Magic-link jumps are also blocked, because they can beam the path lookup
      across the starting point. It would be possible to detect and block
      only the "bad" crossings with path_is_under() checks, but it's unclear
      whether it makes sense to permit magic-links at all. However, userspace
      is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
      magic-link crossing is entirely disabled.
      
      /* Testing. */
      LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8db52c7e
    • A
      namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution · adb21d2b
      Aleksa Sarai 提交于
      /* Background. */
      There are many circumstances when userspace wants to resolve a path and
      ensure that it doesn't go outside of a particular root directory during
      resolution. Obvious examples include archive extraction tools, as well as
      other security-conscious userspace programs. FreeBSD spun out O_BENEATH
      from their Capsicum project[1,2], so it also seems reasonable to
      implement similar functionality for Linux.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
      variation on David Drysdale's O_BENEATH patchset[4], which in turn was
      based on the Capsicum project[5]).
      
      /* Userspace API. */
      LOOKUP_BENEATH will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_BENEATH applies to all components of the path.
      
      With LOOKUP_BENEATH, any path component which attempts to "escape" the
      starting point of the filesystem lookup (the dirfd passed to openat)
      will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
      
      Due to a security concern brought up by Jann[6], any ".." path
      components are also blocked. This restriction will be lifted in a future
      patch, but requires more work to ensure that permitting ".." is done
      safely.
      
      Magic-link jumps are also blocked, because they can beam the path lookup
      across the starting point. It would be possible to detect and block
      only the "bad" crossings with path_is_under() checks, but it's unclear
      whether it makes sense to permit magic-links at all. However, userspace
      is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
      magic-link crossing is entirely disabled.
      
      /* Testing. */
      LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
      
      [1]: https://reviews.freebsd.org/D2808
      [2]: https://reviews.freebsd.org/D17547
      [3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      [6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NDavid Drysdale <drysdale@google.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      adb21d2b
    • A
      namei: LOOKUP_NO_XDEV: block mountpoint crossing · 72ba2929
      Aleksa Sarai 提交于
      /* Background. */
      The need to contain path operations within a mountpoint has been a
      long-standing usecase that userspace has historically implemented
      manually with liberal usage of stat(). find, rsync, tar and
      many other programs implement these semantics -- but it'd be much
      simpler to have a fool-proof way of refusing to open a path if it
      crosses a mountpoint.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
      variation on David Drysdale's O_BENEATH patchset[2], which in turn was
      based on the Capsicum project[3]).
      
      /* Userspace API. */
      LOOKUP_NO_XDEV will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_NO_XDEV applies to all components of the path.
      
      With LOOKUP_NO_XDEV, any path component which crosses a mount-point
      during path resolution (including "..") will yield an -EXDEV. Absolute
      paths, absolute symlinks, and magic-links will only yield an -EXDEV if
      the jump involved changing mount-points.
      
      /* Testing. */
      LOOKUP_NO_XDEV is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NDavid Drysdale <drysdale@google.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      72ba2929
    • A
      namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution · 4b99d499
      Aleksa Sarai 提交于
      /* Background. */
      There has always been a special class of symlink-like objects in procfs
      (and a few other pseudo-filesystems) which allow for non-lexical
      resolution of paths using nd_jump_link(). These "magic-links" do not
      follow traditional mount namespace boundaries, and have been used
      consistently in container escape attacks because they can be used to
      trick unsuspecting privileged processes into resolving unexpected paths.
      
      It is also non-trivial for userspace to unambiguously avoid resolving
      magic-links, because they do not have a reliable indication that they
      are a magic-link (in order to verify them you'd have to manually open
      the path given by readlink(2) and then verify that the two file
      descriptors reference the same underlying file, which is plagued with
      possible race conditions or supplementary attack scenarios).
      
      It would therefore be very helpful for userspace to be able to avoid
      these symlinks easily, thus hopefully removing a tool from attackers'
      toolboxes.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
      variation on David Drysdale's O_BENEATH patchset[2], which in turn was
      based on the Capsicum project[3]).
      
      /* Userspace API. */
      LOOKUP_NO_MAGICLINKS will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_NO_MAGICLINKS applies to all components of the path.
      
      With LOOKUP_NO_MAGICLINKS, any magic-link path component encountered
      during path resolution will yield -ELOOP. The handling of ~LOOKUP_FOLLOW
      for a trailing magic-link is identical to LOOKUP_NO_SYMLINKS.
      
      LOOKUP_NO_SYMLINKS implies LOOKUP_NO_MAGICLINKS.
      
      /* Testing. */
      LOOKUP_NO_MAGICLINKS is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NDavid Drysdale <drysdale@google.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4b99d499
    • A
      namei: LOOKUP_NO_SYMLINKS: block symlink resolution · 27812141
      Aleksa Sarai 提交于
      /* Background. */
      Userspace cannot easily resolve a path without resolving symlinks, and
      would have to manually resolve each path component with O_PATH and
      O_NOFOLLOW. This is clearly inefficient, and can be fairly easy to screw
      up (resulting in possible security bugs). Linus has mentioned that Git
      has a particular need for this kind of flag[1]. It also resolves a
      fairly long-standing perceived deficiency in O_NOFOLLOw -- that it only
      blocks the opening of trailing symlinks.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[2] (which was a
      variation on David Drysdale's O_BENEATH patchset[3], which in turn was
      based on the Capsicum project[4]).
      
      /* Userspace API. */
      LOOKUP_NO_SYMLINKS will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_NO_SYMLINKS applies to all components of the path.
      
      With LOOKUP_NO_SYMLINKS, any symlink path component encountered during
      path resolution will yield -ELOOP. If the trailing component is a
      symlink (and no other components were symlinks), then O_PATH|O_NOFOLLOW
      will not error out and will instead provide a handle to the trailing
      symlink -- without resolving it.
      
      /* Testing. */
      LOOKUP_NO_SYMLINKS is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/CA+55aFyOKM7DW7+0sdDFKdZFXgptb5r1id9=Wvhd8AgSP7qjwQ@mail.gmail.com/
      [2]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [3]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [4]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      27812141
    • A
      namei: allow set_root() to produce errors · 740a1678
      Aleksa Sarai 提交于
      For LOOKUP_BENEATH and LOOKUP_IN_ROOT it is necessary to ensure that
      set_root() is never called, and thus (for hardening purposes) it should
      return an error rather than permit a breakout from the root. In
      addition, move all of the repetitive set_root() calls to nd_jump_root().
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      740a1678
    • A
      namei: allow nd_jump_link() to produce errors · 1bc82070
      Aleksa Sarai 提交于
      In preparation for LOOKUP_NO_MAGICLINKS, it's necessary to add the
      ability for nd_jump_link() to return an error which the corresponding
      get_link() caller must propogate back up to the VFS.
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1bc82070
    • A
      namei: only return -ECHILD from follow_dotdot_rcu() · 2b98149c
      Aleksa Sarai 提交于
      It's over-zealous to return hard errors under RCU-walk here, given that
      a REF-walk will be triggered for all other cases handling ".." under
      RCU.
      
      The original purpose of this check was to ensure that if a rename occurs
      such that a directory is moved outside of the bind-mount which the
      resolution started in, it would be detected and blocked to avoid being
      able to mess with paths outside of the bind-mount. However, triggering a
      new REF-walk is just as effective a solution.
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Fixes: 397d425d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2b98149c