1. 28 5月, 2020 25 次提交
    • J
      io_uring: avoid ring quiesce for fixed file set unregister and update · 43df7c78
      Jens Axboe 提交于
      to #26323588
      
      We currently fully quiesce the ring before an unregister or update of
      the fixed fileset. This is very expensive, and we can be a bit smarter
      about this.
      
      Add a percpu refcount for the file tables as a whole. Grab a percpu ref
      when we use a registered file, and put it on completion. This is cheap
      to do. Upon removal of a file from a set, switch the ref count to atomic
      mode. When we hit zero ref on the completion side, then we know we can
      drop the previously registered files. When the old files have been
      dropped, switch the ref back to percpu mode for normal operation.
      
      Since there's a period between doing the update and the kernel being
      done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the
      same action. The application knows the update has completed when it gets
      the CQE for it. Between doing the update and receiving this completion,
      the application must continue to use the unregistered fd if submitting
      IO on this particular file.
      
      This takes the runtime of test/file-register from liburing from 14s to
      about 0.7s.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      43df7c78
    • R
      io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT · 3fc2e820
      Roman Gushchin 提交于
      to #26323588
      
      commit 214828962dead0c698f92b60ef97ce3c5fc2c8fe upstream.
      
      Percpu reference counters should now be initialized with the
      PERCPU_REF_ALLOW_REINIT in order to allow switching them to the
      percpu mode from the atomic mode. This is exactly what
      percpu_ref_reinit() called from __io_uring_register() is supposed to
      do. So let's initialize percpu refcounters with the
      PERCU_REF_ALLOW_REINIT flag.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      3fc2e820
    • R
      percpu_ref: release percpu memory early without PERCPU_REF_ALLOW_REINIT · 31fafee5
      Roman Gushchin 提交于
      to #26323588
      
      commit 7d9ab9b6adffd9c474c1274acb5f6208f9a09cf3 upstream.
      
      Release percpu memory after finishing the switch to the atomic mode
      if only PERCPU_REF_ALLOW_REINIT isn't set.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      31fafee5
    • R
      percpu_ref: introduce PERCPU_REF_ALLOW_REINIT flag · 02f53f5e
      Roman Gushchin 提交于
      to #26323588
      
      commit 09ed79d6d75f06cc963a78f25463251b0a758dc7 upstream.
      
      In most cases percpu reference counters are not switched to the
      percpu mode after they reach the atomic mode. Some obvious exceptions
      are reference counters which are initialized into the atomic
      mode (using PERCPU_REF_INIT_ATOMIC and PERCPU_REF_INIT_DEAD flags),
      and there are few other exceptions.
      
      But in most cases there is no way back, and once the reference counter
      is switched to the atomic mode, there is no reason to wait for
      percpu_ref_exit() to release the percpu memory. Of course, the size
      of a single counter is not so big, but because it can pin the whole
      percpu block in memory, the memory footprint can be noticeable
      (e.g. on my 32 CPUs machine a percpu block is 8Mb large).
      
      To make releasing of the percpu memory as early as possible, let's
      introduce the PERCPU_REF_ALLOW_REINIT flag with the following semantics:
      it has to be set in order to switch a percpu reference counter to the
      percpu mode after the initialization. PERCPU_REF_INIT_ATOMIC and
      PERCPU_REF_INIT_DEAD flags will implicitly assume PERCPU_REF_ALLOW_REINIT.
      
      This patch doesn't introduce any functional change to avoid any
      regressions. It will be done later in the patchset after adjusting
      all call sites, which are reviving percpu counters.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      02f53f5e
    • J
      io_uring: add support for IORING_OP_CLOSE · 7a194789
      Jens Axboe 提交于
      to #26323588
      
      commit b5dba59e0cf7e2cc4d3b3b1ac5fe81ddf21959eb upstream.
      
      This works just like close(2), unsurprisingly. We remove the file
      descriptor and post the completion inline, then offload the actual
      (potential) last file put to async context.
      
      Mark the async part of this work as uncancellable, as we really must
      guarantee that the latter part of the close is run.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      7a194789
    • T
      binder: fix use-after-free due to ksys_close() during fdget() · b0775049
      Todd Kjos 提交于
      to #26323588
      
      Cherry-pick from commit 80cd795630d6526ba729a089a435bf74a57af927 upstream.
      
      44d8047f1d8 ("binder: use standard functions to allocate fds")
      exposed a pre-existing issue in the binder driver.
      
      fdget() is used in ksys_ioctl() as a performance optimization.
      One of the rules associated with fdget() is that ksys_close() must
      not be called between the fdget() and the fdput(). There is a case
      where this requirement is not met in the binder driver which results
      in the reference count dropping to 0 when the device is still in
      use. This can result in use-after-free or other issues.
      
      If userpace has passed a file-descriptor for the binder driver using
      a BINDER_TYPE_FDA object, then kys_close() is called on it when
      handling a binder_ioctl(BC_FREE_BUFFER) command. This violates
      the assumptions for using fdget().
      
      The problem is fixed by deferring the close using task_work_add(). A
      new variant of __close_fd() was created that returns a struct file
      with a reference. The fput() is deferred instead of using ksys_close().
      
      Fixes: 44d8047f1d87a ("binder: use standard functions to allocate fds")
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTodd Kjos <tkjos@google.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b0775049
    • J
      io-wq: add support for uncancellable work · 9b71b0f1
      Jens Axboe 提交于
      to #26323588
      
      commit 0c9d5ccd26a004f59333c06fbbb98f9cb1eed93d upstream.
      
      Not all work can be cancelled, some of it we may need to guarantee
      that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL
      on work that must not be cancelled. Note that the caller work function
      must also check for IO_WQ_WORK_NO_CANCEL on work that is marked
      IO_WQ_WORK_CANCEL.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      9b71b0f1
    • J
      io_uring: add support for IORING_OP_OPENAT · dd7ff1b1
      Jens Axboe 提交于
      to #26323588
      
      commit 15b71abe7b52df214785dde0de9f581cc0216d17 upstream.
      
      This works just like openat(2), except it can be performed async. For
      the normal case of a non-blocking path lookup this will complete
      inline. If we have to do IO to perform the open, it'll be done from
      async context.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      dd7ff1b1
    • J
      fs: make build_open_flags() available internally · e06eb6e5
      Jens Axboe 提交于
      to #26323588
      
      commit 35cb6d54c1d5daf1d1ed585ef5ce4557e7ab284c upstream.
      
      This is a prep patch for supporting non-blocking open from io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e06eb6e5
    • A
      selftests: add openat2(2) selftests · 5f7d3758
      Aleksa Sarai 提交于
      to #26323588
      
      commit b28a10aedcd4d175470171a32f4f20b0a60a612b upstream.
      
      Test all of the various openat2(2) flags. A small stress-test of a
      symlink-rename attack is included to show that the protections against
      ".."-based attacks are sufficient.
      
      The main things these self-tests are enforcing are:
      
        * The struct+usize ABI for openat2(2) and copy_struct_from_user() to
          ensure that upgrades will be handled gracefully (in addition,
          ensuring that misaligned structures are also handled correctly).
      
        * The -EINVAL checks for openat2(2) are all correctly handled to avoid
          userspace passing unknown or conflicting flag sets (most
          importantly, ensuring that invalid flag combinations are checked).
      
        * All of the RESOLVE_* semantics (including errno values) are
          correctly handled with various combinations of paths and flags.
      
        * RESOLVE_IN_ROOT correctly protects against the symlink rename(2)
          attack that has been responsible for several CVEs (and likely will
          be responsible for several more).
      
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5f7d3758
    • A
      open: introduce openat2(2) syscall · 5b9369e5
      Aleksa Sarai 提交于
      to #26323588
      
      commit fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179 upstream.
      
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5b9369e5
    • A
      lib: introduce copy_struct_from_user() helper · 915ddf4c
      Aleksa Sarai 提交于
      to #26323588
      
      commit f5a1a536fa14895ccff4e94e6a5af90901ce86aa upstream.
      
      A common pattern for syscall extensions is increasing the size of a
      struct passed from userspace, such that the zero-value of the new fields
      result in the old kernel behaviour (allowing for a mix of userspace and
      kernel vintages to operate on one another in most cases).
      
      While this interface exists for communication in both directions, only
      one interface is straightforward to have reasonable semantics for
      (userspace passing a struct to the kernel). For kernel returns to
      userspace, what the correct semantics are (whether there should be an
      error if userspace is unaware of a new extension) is very
      syscall-dependent and thus probably cannot be unified between syscalls
      (a good example of this problem is [1]).
      
      Previously there was no common lib/ function that implemented
      the necessary extension-checking semantics (and different syscalls
      implemented them slightly differently or incompletely[2]). Future
      patches replace common uses of this pattern to make use of
      copy_struct_from_user().
      
      Some in-kernel selftests that insure that the handling of alignment and
      various byte patterns are all handled identically to memchr_inv() usage.
      
      [1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
           robustify sched_read_attr() ABI logic and code")
      
      [2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
           similar checks to copy_struct_from_user() while rt_sigprocmask(2)
           always rejects differently-sized struct arguments.
      Suggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/r/20191001011055.19283-2-cyphar@cyphar.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      915ddf4c
    • A
      namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution · 414ddd93
      Aleksa Sarai 提交于
      to #26323588
      
      commit ab87f9a56c8ee9fa6856cb13d8f2905db913baae upstream.
      
      Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution
      (in the case of LOOKUP_BENEATH the resolution will still fail if ".."
      resolution would resolve a path outside of the root -- while
      LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are
      still disallowed entirely[*].
      
      As Jann explains[1,2], the need for this patch (and the original no-".."
      restriction) is explained by observing there is a fairly easy-to-exploit
      race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and
      LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be
      used to "skip over" nd->root and thus escape to the filesystem above
      nd->root.
      
        thread1 [attacker]:
          for (;;)
            renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
        thread2 [victim]:
          for (;;)
            openat2(dirb, "b/c/../../etc/shadow",
                    { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } );
      
      With fairly significant regularity, thread2 will resolve to
      "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
      (though somewhat more privileged) attack using MS_MOVE.
      
      With this patch, such cases will be detected *during* ".." resolution
      and will return -EAGAIN for userspace to decide to either retry or abort
      the lookup. It should be noted that ".." is the weak point of chroot(2)
      -- walking *into* a subdirectory tautologically cannot result in you
      walking *outside* nd->root (except through a bind-mount or magic-link).
      There is also no other way for a directory's parent to change (which is
      the primary worry with ".." resolution here) other than a rename or
      MS_MOVE.
      
      The primary reason for deferring to userspace with -EAGAIN is that an
      in-kernel retry loop (or doing a path_is_under() check after re-taking
      the relevant seqlocks) can become unreasonably expensive on machines
      with lots of VFS activity (nfsd can cause lots of rename_lock updates).
      Thus it should be up to userspace how many times they wish to retry the
      lookup -- the selftests for this attack indicate that there is a ~35%
      chance of the lookup succeeding on the first try even with an attacker
      thrashing rename_lock.
      
      A variant of the above attack is included in the selftests for
      openat2(2) later in this patch series. I've run this test on several
      machines for several days and no instances of a breakout were detected.
      While this is not concrete proof that this is safe, when combined with
      the above argument it should lend some trustworthiness to this
      construction.
      
      [*] It may be acceptable in the future to do a path_is_under() check for
          magic-links after they are resolved. However this seems unlikely to
          be a feature that people *really* need -- it can be added later if
          it turns out a lot of people want it.
      
      [1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
      [2]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NJann Horn <jannh@google.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      414ddd93
    • A
      namei: LOOKUP_IN_ROOT: chroot-like scoped resolution · 685c60ef
      Aleksa Sarai 提交于
      to #26323588
      
      commit 8db52c7e7ee1bd861b6096fcafc0fe7d0f24a994 upstream.
      
      /* Background. */
      Container runtimes or other administrative management processes will
      often interact with root filesystems while in the host mount namespace,
      because the cost of doing a chroot(2) on every operation is too
      prohibitive (especially in Go, which cannot safely use vfork). However,
      a malicious program can trick the management process into doing
      operations on files outside of the root filesystem through careful
      crafting of symlinks.
      
      Most programs that need this feature have attempted to make this process
      safe, by doing all of the path resolution in userspace (with symlinks
      being scoped to the root of the malicious root filesystem).
      Unfortunately, this method is prone to foot-guns and usually such
      implementations have subtle security bugs.
      
      Thus, what userspace needs is a way to resolve a path as though it were
      in a chroot(2) -- with all absolute symlinks being resolved relative to
      the dirfd root (and ".." components being stuck under the dirfd root).
      It is much simpler and more straight-forward to provide this
      functionality in-kernel (because it can be done far more cheaply and
      correctly).
      
      More classical applications that also have this problem (which have
      their own potentially buggy userspace path sanitisation code) include
      web servers, archive extraction tools, network file servers, and so on.
      
      /* Userspace API. */
      LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_IN_ROOT applies to all components of the path.
      
      With LOOKUP_IN_ROOT, any path component which attempts to cross the
      starting point of the pathname lookup (the dirfd passed to openat) will
      remain at the starting point. Thus, all absolute paths and symlinks will
      be scoped within the starting point.
      
      There is a slight change in behaviour regarding pathnames -- if the
      pathname is absolute then the dirfd is still used as the root of
      resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
      foot-guns, at the cost of a minor API inconsistency).
      
      As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to
      LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction
      will be lifted in a future patch, but requires more work to ensure that
      permitting ".." is done safely.
      
      Magic-link jumps are also blocked, because they can beam the path lookup
      across the starting point. It would be possible to detect and block
      only the "bad" crossings with path_is_under() checks, but it's unclear
      whether it makes sense to permit magic-links at all. However, userspace
      is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
      magic-link crossing is entirely disabled.
      
      /* Testing. */
      LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      685c60ef
    • A
      namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution · 31084d4b
      Aleksa Sarai 提交于
      to #26323588
      
      commit adb21d2b526f7f196b2f3fdca97d80ba05dd14a0 upstream.
      
      /* Background. */
      There are many circumstances when userspace wants to resolve a path and
      ensure that it doesn't go outside of a particular root directory during
      resolution. Obvious examples include archive extraction tools, as well as
      other security-conscious userspace programs. FreeBSD spun out O_BENEATH
      from their Capsicum project[1,2], so it also seems reasonable to
      implement similar functionality for Linux.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
      variation on David Drysdale's O_BENEATH patchset[4], which in turn was
      based on the Capsicum project[5]).
      
      /* Userspace API. */
      LOOKUP_BENEATH will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_BENEATH applies to all components of the path.
      
      With LOOKUP_BENEATH, any path component which attempts to "escape" the
      starting point of the filesystem lookup (the dirfd passed to openat)
      will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
      
      Due to a security concern brought up by Jann[6], any ".." path
      components are also blocked. This restriction will be lifted in a future
      patch, but requires more work to ensure that permitting ".." is done
      safely.
      
      Magic-link jumps are also blocked, because they can beam the path lookup
      across the starting point. It would be possible to detect and block
      only the "bad" crossings with path_is_under() checks, but it's unclear
      whether it makes sense to permit magic-links at all. However, userspace
      is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
      magic-link crossing is entirely disabled.
      
      /* Testing. */
      LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
      
      [1]: https://reviews.freebsd.org/D2808
      [2]: https://reviews.freebsd.org/D17547
      [3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      [6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NDavid Drysdale <drysdale@google.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      31084d4b
    • A
      fs/namei.c: keep track of nd->root refcount status · 0a73d163
      Al Viro 提交于
      to #26323588
      
      commit 84a2bd39405ffd5fa6d6d77e408c5b9210da98de upstream.
      
      The rules for nd->root are messy:
      	* if we have LOOKUP_ROOT, it doesn't contribute to refcounts
      	* if we have LOOKUP_RCU, it doesn't contribute to refcounts
      	* if nd->root.mnt is NULL, it doesn't contribute to refcounts
      	* otherwise it does contribute
      
      terminate_walk() needs to drop the references if they are contributing.
      So everything else should be careful not to confuse it, leading to
      rather convoluted code.
      
      It's easier to keep track of whether we'd grabbed the reference(s)
      explicitly.  Use a new flag for that.  Don't bother with zeroing
      nd->root.mnt on unlazy failures and in terminate_walk - it's not
      needed anymore (terminate_walk() won't care and the next path_init()
      will zero nd->root in !LOOKUP_ROOT case anyway).
      
      Resulting rules for nd->root refcounts are much simpler: they are
      contributing iff LOOKUP_ROOT_GRABBED is set in nd->flags.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0a73d163
    • A
      fs/namei.c: new helper - legitimize_root() · 40107f78
      Al Viro 提交于
      to #26323588
      
      commit ee594bfff389aa9105f713135211c0da736e5698 upstream.
      
      identical logics in unlazy_walk() and unlazy_child()
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      40107f78
    • A
      namei: LOOKUP_NO_XDEV: block mountpoint crossing · 38549ac1
      Aleksa Sarai 提交于
      to #26323588
      
      commit 72ba29297e1439efaa54d9125b866ae9d15df339 upstream.
      
      /* Background. */
      The need to contain path operations within a mountpoint has been a
      long-standing usecase that userspace has historically implemented
      manually with liberal usage of stat(). find, rsync, tar and
      many other programs implement these semantics -- but it'd be much
      simpler to have a fool-proof way of refusing to open a path if it
      crosses a mountpoint.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
      variation on David Drysdale's O_BENEATH patchset[2], which in turn was
      based on the Capsicum project[3]).
      
      /* Userspace API. */
      LOOKUP_NO_XDEV will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_NO_XDEV applies to all components of the path.
      
      With LOOKUP_NO_XDEV, any path component which crosses a mount-point
      during path resolution (including "..") will yield an -EXDEV. Absolute
      paths, absolute symlinks, and magic-links will only yield an -EXDEV if
      the jump involved changing mount-points.
      
      /* Testing. */
      LOOKUP_NO_XDEV is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NDavid Drysdale <drysdale@google.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      38549ac1
    • A
      namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution · 773b3516
      Aleksa Sarai 提交于
      to #26323588
      
      commit 4b99d4996979d582859c5a49072e92de124bf691 upstream.
      
      /* Background. */
      There has always been a special class of symlink-like objects in procfs
      (and a few other pseudo-filesystems) which allow for non-lexical
      resolution of paths using nd_jump_link(). These "magic-links" do not
      follow traditional mount namespace boundaries, and have been used
      consistently in container escape attacks because they can be used to
      trick unsuspecting privileged processes into resolving unexpected paths.
      
      It is also non-trivial for userspace to unambiguously avoid resolving
      magic-links, because they do not have a reliable indication that they
      are a magic-link (in order to verify them you'd have to manually open
      the path given by readlink(2) and then verify that the two file
      descriptors reference the same underlying file, which is plagued with
      possible race conditions or supplementary attack scenarios).
      
      It would therefore be very helpful for userspace to be able to avoid
      these symlinks easily, thus hopefully removing a tool from attackers'
      toolboxes.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
      variation on David Drysdale's O_BENEATH patchset[2], which in turn was
      based on the Capsicum project[3]).
      
      /* Userspace API. */
      LOOKUP_NO_MAGICLINKS will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_NO_MAGICLINKS applies to all components of the path.
      
      With LOOKUP_NO_MAGICLINKS, any magic-link path component encountered
      during path resolution will yield -ELOOP. The handling of ~LOOKUP_FOLLOW
      for a trailing magic-link is identical to LOOKUP_NO_SYMLINKS.
      
      LOOKUP_NO_SYMLINKS implies LOOKUP_NO_MAGICLINKS.
      
      /* Testing. */
      LOOKUP_NO_MAGICLINKS is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NDavid Drysdale <drysdale@google.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      773b3516
    • A
      namei: LOOKUP_NO_SYMLINKS: block symlink resolution · 35e8a4ee
      Aleksa Sarai 提交于
      to #26323588
      
      commit 278121417a72d87fb29dd8c48801f80821e8f75a upstream.
      
      /* Background. */
      Userspace cannot easily resolve a path without resolving symlinks, and
      would have to manually resolve each path component with O_PATH and
      O_NOFOLLOW. This is clearly inefficient, and can be fairly easy to screw
      up (resulting in possible security bugs). Linus has mentioned that Git
      has a particular need for this kind of flag[1]. It also resolves a
      fairly long-standing perceived deficiency in O_NOFOLLOw -- that it only
      blocks the opening of trailing symlinks.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[2] (which was a
      variation on David Drysdale's O_BENEATH patchset[3], which in turn was
      based on the Capsicum project[4]).
      
      /* Userspace API. */
      LOOKUP_NO_SYMLINKS will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_NO_SYMLINKS applies to all components of the path.
      
      With LOOKUP_NO_SYMLINKS, any symlink path component encountered during
      path resolution will yield -ELOOP. If the trailing component is a
      symlink (and no other components were symlinks), then O_PATH|O_NOFOLLOW
      will not error out and will instead provide a handle to the trailing
      symlink -- without resolving it.
      
      /* Testing. */
      LOOKUP_NO_SYMLINKS is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/CA+55aFyOKM7DW7+0sdDFKdZFXgptb5r1id9=Wvhd8AgSP7qjwQ@mail.gmail.com/
      [2]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [3]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [4]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      35e8a4ee
    • A
      namei: allow set_root() to produce errors · 53a5271b
      Aleksa Sarai 提交于
      to #26323588
      
      commit 740a16782750a5b6c7d1609a9c09641ce6753ea6 upstream.
      
      For LOOKUP_BENEATH and LOOKUP_IN_ROOT it is necessary to ensure that
      set_root() is never called, and thus (for hardening purposes) it should
      return an error rather than permit a breakout from the root. In
      addition, move all of the repetitive set_root() calls to nd_jump_root().
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      53a5271b
    • A
      namei: allow nd_jump_link() to produce errors · b63f56c4
      Aleksa Sarai 提交于
      to #26323588
      
      commit 1bc82070fa2763bdca626fa8bde72b35f11e8960 upstream.
      
      In preparation for LOOKUP_NO_MAGICLINKS, it's necessary to add the
      ability for nd_jump_link() to return an error which the corresponding
      get_link() caller must propogate back up to the VFS.
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b63f56c4
    • A
      nsfs: clean-up ns_get_path() signature to return int · 39f702ab
      Aleksa Sarai 提交于
      to #26323588
      
      commit ce623f89872df4253719be71531116751eeab85f upstream.
      
      ns_get_path() and ns_get_path_cb() only ever return either NULL or an
      ERR_PTR. It is far more idiomatic to simply return an integer, and it
      makes all of the callers of ns_get_path() more straightforward to read.
      
      Fixes: e149ed2b ("take the targets of /proc/*/ns/* symlinks to separate fs")
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      39f702ab
    • A
      namei: only return -ECHILD from follow_dotdot_rcu() · 8eef64a2
      Aleksa Sarai 提交于
      to #26323588
      
      commit 2b98149c2377bff12be5dd3ce02ae0506e2dd613 upstream.
      
      It's over-zealous to return hard errors under RCU-walk here, given that
      a REF-walk will be triggered for all other cases handling ".." under
      RCU.
      
      The original purpose of this check was to ensure that if a rename occurs
      such that a directory is moved outside of the bind-mount which the
      resolution started in, it would be detected and blocked to avoid being
      able to mess with paths outside of the bind-mount. However, triggering a
      new REF-walk is just as effective a solution.
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Fixes: 397d425d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      8eef64a2
    • J
      io_uring: add support for fallocate() · 41076b20
      Jens Axboe 提交于
      to #26323588
      
      commit d63d1b5edb7b832210bfde587ba9e7549fa064eb upstream.
      
      This exposes fallocate(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      41076b20
  2. 27 5月, 2020 15 次提交