1. 28 5月, 2020 14 次提交
    • J
      io_uring: add support for IORING_OP_CLOSE · 7a194789
      Jens Axboe 提交于
      to #26323588
      
      commit b5dba59e0cf7e2cc4d3b3b1ac5fe81ddf21959eb upstream.
      
      This works just like close(2), unsurprisingly. We remove the file
      descriptor and post the completion inline, then offload the actual
      (potential) last file put to async context.
      
      Mark the async part of this work as uncancellable, as we really must
      guarantee that the latter part of the close is run.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      7a194789
    • T
      binder: fix use-after-free due to ksys_close() during fdget() · b0775049
      Todd Kjos 提交于
      to #26323588
      
      Cherry-pick from commit 80cd795630d6526ba729a089a435bf74a57af927 upstream.
      
      44d8047f1d8 ("binder: use standard functions to allocate fds")
      exposed a pre-existing issue in the binder driver.
      
      fdget() is used in ksys_ioctl() as a performance optimization.
      One of the rules associated with fdget() is that ksys_close() must
      not be called between the fdget() and the fdput(). There is a case
      where this requirement is not met in the binder driver which results
      in the reference count dropping to 0 when the device is still in
      use. This can result in use-after-free or other issues.
      
      If userpace has passed a file-descriptor for the binder driver using
      a BINDER_TYPE_FDA object, then kys_close() is called on it when
      handling a binder_ioctl(BC_FREE_BUFFER) command. This violates
      the assumptions for using fdget().
      
      The problem is fixed by deferring the close using task_work_add(). A
      new variant of __close_fd() was created that returns a struct file
      with a reference. The fput() is deferred instead of using ksys_close().
      
      Fixes: 44d8047f1d87a ("binder: use standard functions to allocate fds")
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTodd Kjos <tkjos@google.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b0775049
    • J
      io_uring: add support for IORING_OP_OPENAT · dd7ff1b1
      Jens Axboe 提交于
      to #26323588
      
      commit 15b71abe7b52df214785dde0de9f581cc0216d17 upstream.
      
      This works just like openat(2), except it can be performed async. For
      the normal case of a non-blocking path lookup this will complete
      inline. If we have to do IO to perform the open, it'll be done from
      async context.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      dd7ff1b1
    • A
      open: introduce openat2(2) syscall · 5b9369e5
      Aleksa Sarai 提交于
      to #26323588
      
      commit fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179 upstream.
      
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5b9369e5
    • A
      lib: introduce copy_struct_from_user() helper · 915ddf4c
      Aleksa Sarai 提交于
      to #26323588
      
      commit f5a1a536fa14895ccff4e94e6a5af90901ce86aa upstream.
      
      A common pattern for syscall extensions is increasing the size of a
      struct passed from userspace, such that the zero-value of the new fields
      result in the old kernel behaviour (allowing for a mix of userspace and
      kernel vintages to operate on one another in most cases).
      
      While this interface exists for communication in both directions, only
      one interface is straightforward to have reasonable semantics for
      (userspace passing a struct to the kernel). For kernel returns to
      userspace, what the correct semantics are (whether there should be an
      error if userspace is unaware of a new extension) is very
      syscall-dependent and thus probably cannot be unified between syscalls
      (a good example of this problem is [1]).
      
      Previously there was no common lib/ function that implemented
      the necessary extension-checking semantics (and different syscalls
      implemented them slightly differently or incompletely[2]). Future
      patches replace common uses of this pattern to make use of
      copy_struct_from_user().
      
      Some in-kernel selftests that insure that the handling of alignment and
      various byte patterns are all handled identically to memchr_inv() usage.
      
      [1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
           robustify sched_read_attr() ABI logic and code")
      
      [2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
           similar checks to copy_struct_from_user() while rt_sigprocmask(2)
           always rejects differently-sized struct arguments.
      Suggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/r/20191001011055.19283-2-cyphar@cyphar.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      915ddf4c
    • A
      namei: LOOKUP_IN_ROOT: chroot-like scoped resolution · 685c60ef
      Aleksa Sarai 提交于
      to #26323588
      
      commit 8db52c7e7ee1bd861b6096fcafc0fe7d0f24a994 upstream.
      
      /* Background. */
      Container runtimes or other administrative management processes will
      often interact with root filesystems while in the host mount namespace,
      because the cost of doing a chroot(2) on every operation is too
      prohibitive (especially in Go, which cannot safely use vfork). However,
      a malicious program can trick the management process into doing
      operations on files outside of the root filesystem through careful
      crafting of symlinks.
      
      Most programs that need this feature have attempted to make this process
      safe, by doing all of the path resolution in userspace (with symlinks
      being scoped to the root of the malicious root filesystem).
      Unfortunately, this method is prone to foot-guns and usually such
      implementations have subtle security bugs.
      
      Thus, what userspace needs is a way to resolve a path as though it were
      in a chroot(2) -- with all absolute symlinks being resolved relative to
      the dirfd root (and ".." components being stuck under the dirfd root).
      It is much simpler and more straight-forward to provide this
      functionality in-kernel (because it can be done far more cheaply and
      correctly).
      
      More classical applications that also have this problem (which have
      their own potentially buggy userspace path sanitisation code) include
      web servers, archive extraction tools, network file servers, and so on.
      
      /* Userspace API. */
      LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_IN_ROOT applies to all components of the path.
      
      With LOOKUP_IN_ROOT, any path component which attempts to cross the
      starting point of the pathname lookup (the dirfd passed to openat) will
      remain at the starting point. Thus, all absolute paths and symlinks will
      be scoped within the starting point.
      
      There is a slight change in behaviour regarding pathnames -- if the
      pathname is absolute then the dirfd is still used as the root of
      resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
      foot-guns, at the cost of a minor API inconsistency).
      
      As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to
      LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction
      will be lifted in a future patch, but requires more work to ensure that
      permitting ".." is done safely.
      
      Magic-link jumps are also blocked, because they can beam the path lookup
      across the starting point. It would be possible to detect and block
      only the "bad" crossings with path_is_under() checks, but it's unclear
      whether it makes sense to permit magic-links at all. However, userspace
      is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
      magic-link crossing is entirely disabled.
      
      /* Testing. */
      LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      685c60ef
    • A
      namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution · 31084d4b
      Aleksa Sarai 提交于
      to #26323588
      
      commit adb21d2b526f7f196b2f3fdca97d80ba05dd14a0 upstream.
      
      /* Background. */
      There are many circumstances when userspace wants to resolve a path and
      ensure that it doesn't go outside of a particular root directory during
      resolution. Obvious examples include archive extraction tools, as well as
      other security-conscious userspace programs. FreeBSD spun out O_BENEATH
      from their Capsicum project[1,2], so it also seems reasonable to
      implement similar functionality for Linux.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
      variation on David Drysdale's O_BENEATH patchset[4], which in turn was
      based on the Capsicum project[5]).
      
      /* Userspace API. */
      LOOKUP_BENEATH will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_BENEATH applies to all components of the path.
      
      With LOOKUP_BENEATH, any path component which attempts to "escape" the
      starting point of the filesystem lookup (the dirfd passed to openat)
      will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
      
      Due to a security concern brought up by Jann[6], any ".." path
      components are also blocked. This restriction will be lifted in a future
      patch, but requires more work to ensure that permitting ".." is done
      safely.
      
      Magic-link jumps are also blocked, because they can beam the path lookup
      across the starting point. It would be possible to detect and block
      only the "bad" crossings with path_is_under() checks, but it's unclear
      whether it makes sense to permit magic-links at all. However, userspace
      is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
      magic-link crossing is entirely disabled.
      
      /* Testing. */
      LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
      
      [1]: https://reviews.freebsd.org/D2808
      [2]: https://reviews.freebsd.org/D17547
      [3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      [6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NDavid Drysdale <drysdale@google.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      31084d4b
    • A
      fs/namei.c: keep track of nd->root refcount status · 0a73d163
      Al Viro 提交于
      to #26323588
      
      commit 84a2bd39405ffd5fa6d6d77e408c5b9210da98de upstream.
      
      The rules for nd->root are messy:
      	* if we have LOOKUP_ROOT, it doesn't contribute to refcounts
      	* if we have LOOKUP_RCU, it doesn't contribute to refcounts
      	* if nd->root.mnt is NULL, it doesn't contribute to refcounts
      	* otherwise it does contribute
      
      terminate_walk() needs to drop the references if they are contributing.
      So everything else should be careful not to confuse it, leading to
      rather convoluted code.
      
      It's easier to keep track of whether we'd grabbed the reference(s)
      explicitly.  Use a new flag for that.  Don't bother with zeroing
      nd->root.mnt on unlazy failures and in terminate_walk - it's not
      needed anymore (terminate_walk() won't care and the next path_init()
      will zero nd->root in !LOOKUP_ROOT case anyway).
      
      Resulting rules for nd->root refcounts are much simpler: they are
      contributing iff LOOKUP_ROOT_GRABBED is set in nd->flags.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0a73d163
    • A
      namei: LOOKUP_NO_XDEV: block mountpoint crossing · 38549ac1
      Aleksa Sarai 提交于
      to #26323588
      
      commit 72ba29297e1439efaa54d9125b866ae9d15df339 upstream.
      
      /* Background. */
      The need to contain path operations within a mountpoint has been a
      long-standing usecase that userspace has historically implemented
      manually with liberal usage of stat(). find, rsync, tar and
      many other programs implement these semantics -- but it'd be much
      simpler to have a fool-proof way of refusing to open a path if it
      crosses a mountpoint.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
      variation on David Drysdale's O_BENEATH patchset[2], which in turn was
      based on the Capsicum project[3]).
      
      /* Userspace API. */
      LOOKUP_NO_XDEV will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_NO_XDEV applies to all components of the path.
      
      With LOOKUP_NO_XDEV, any path component which crosses a mount-point
      during path resolution (including "..") will yield an -EXDEV. Absolute
      paths, absolute symlinks, and magic-links will only yield an -EXDEV if
      the jump involved changing mount-points.
      
      /* Testing. */
      LOOKUP_NO_XDEV is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NDavid Drysdale <drysdale@google.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      38549ac1
    • A
      namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution · 773b3516
      Aleksa Sarai 提交于
      to #26323588
      
      commit 4b99d4996979d582859c5a49072e92de124bf691 upstream.
      
      /* Background. */
      There has always been a special class of symlink-like objects in procfs
      (and a few other pseudo-filesystems) which allow for non-lexical
      resolution of paths using nd_jump_link(). These "magic-links" do not
      follow traditional mount namespace boundaries, and have been used
      consistently in container escape attacks because they can be used to
      trick unsuspecting privileged processes into resolving unexpected paths.
      
      It is also non-trivial for userspace to unambiguously avoid resolving
      magic-links, because they do not have a reliable indication that they
      are a magic-link (in order to verify them you'd have to manually open
      the path given by readlink(2) and then verify that the two file
      descriptors reference the same underlying file, which is plagued with
      possible race conditions or supplementary attack scenarios).
      
      It would therefore be very helpful for userspace to be able to avoid
      these symlinks easily, thus hopefully removing a tool from attackers'
      toolboxes.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
      variation on David Drysdale's O_BENEATH patchset[2], which in turn was
      based on the Capsicum project[3]).
      
      /* Userspace API. */
      LOOKUP_NO_MAGICLINKS will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_NO_MAGICLINKS applies to all components of the path.
      
      With LOOKUP_NO_MAGICLINKS, any magic-link path component encountered
      during path resolution will yield -ELOOP. The handling of ~LOOKUP_FOLLOW
      for a trailing magic-link is identical to LOOKUP_NO_SYMLINKS.
      
      LOOKUP_NO_SYMLINKS implies LOOKUP_NO_MAGICLINKS.
      
      /* Testing. */
      LOOKUP_NO_MAGICLINKS is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NDavid Drysdale <drysdale@google.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      773b3516
    • A
      namei: LOOKUP_NO_SYMLINKS: block symlink resolution · 35e8a4ee
      Aleksa Sarai 提交于
      to #26323588
      
      commit 278121417a72d87fb29dd8c48801f80821e8f75a upstream.
      
      /* Background. */
      Userspace cannot easily resolve a path without resolving symlinks, and
      would have to manually resolve each path component with O_PATH and
      O_NOFOLLOW. This is clearly inefficient, and can be fairly easy to screw
      up (resulting in possible security bugs). Linus has mentioned that Git
      has a particular need for this kind of flag[1]. It also resolves a
      fairly long-standing perceived deficiency in O_NOFOLLOw -- that it only
      blocks the opening of trailing symlinks.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[2] (which was a
      variation on David Drysdale's O_BENEATH patchset[3], which in turn was
      based on the Capsicum project[4]).
      
      /* Userspace API. */
      LOOKUP_NO_SYMLINKS will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_NO_SYMLINKS applies to all components of the path.
      
      With LOOKUP_NO_SYMLINKS, any symlink path component encountered during
      path resolution will yield -ELOOP. If the trailing component is a
      symlink (and no other components were symlinks), then O_PATH|O_NOFOLLOW
      will not error out and will instead provide a handle to the trailing
      symlink -- without resolving it.
      
      /* Testing. */
      LOOKUP_NO_SYMLINKS is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/CA+55aFyOKM7DW7+0sdDFKdZFXgptb5r1id9=Wvhd8AgSP7qjwQ@mail.gmail.com/
      [2]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [3]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [4]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      35e8a4ee
    • A
      namei: allow nd_jump_link() to produce errors · b63f56c4
      Aleksa Sarai 提交于
      to #26323588
      
      commit 1bc82070fa2763bdca626fa8bde72b35f11e8960 upstream.
      
      In preparation for LOOKUP_NO_MAGICLINKS, it's necessary to add the
      ability for nd_jump_link() to return an error which the corresponding
      get_link() caller must propogate back up to the VFS.
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b63f56c4
    • A
      nsfs: clean-up ns_get_path() signature to return int · 39f702ab
      Aleksa Sarai 提交于
      to #26323588
      
      commit ce623f89872df4253719be71531116751eeab85f upstream.
      
      ns_get_path() and ns_get_path_cb() only ever return either NULL or an
      ERR_PTR. It is far more idiomatic to simply return an integer, and it
      makes all of the callers of ns_get_path() more straightforward to read.
      
      Fixes: e149ed2b ("take the targets of /proc/*/ns/* symlinks to separate fs")
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      39f702ab
    • J
      io_uring: add support for fallocate() · 41076b20
      Jens Axboe 提交于
      to #26323588
      
      commit d63d1b5edb7b832210bfde587ba9e7549fa064eb upstream.
      
      This exposes fallocate(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      41076b20
  2. 27 5月, 2020 9 次提交
  3. 15 5月, 2020 1 次提交
    • X
      alinux: add tcprt framework to kernel · f84e8fa0
      xuanzhuo 提交于
      to #26353046
      
      TcpRT: Instrument and Diagnostic Analysis System for Service Quality
      of Cloud Databases at Massive Scale in Real-time.
      
      It can also provide information for all request/response services. Such as
      HTTP request.
      
      This is the kernel framework for tcprt, more work needs tcprt module
      support.
      
      TcpRt module should call tcp_unregitsert_rt before rmmod.
      
      TcpRt hooks will be called when sock init, recv data, send data,
      packet acked and socket been destroy. The private data save to
      icsk->icsk_tcp_rt_priv.
      Reviewed-by: NCambda Zhu <cambda@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: Nxuanzhuo <xuanzhuo@linux.alibaba.com>
      f84e8fa0
  4. 08 5月, 2020 1 次提交
    • N
      mm: zero remaining unavailable struct pages · f8521831
      Naoya Horiguchi 提交于
      to #26809468
      
      commit 907ec5fca3dc38d37737de826f06f25b063aa08e upstream.
      
      Patch series "mm: Fix for movable_node boot option", v3.
      
      This patch series contains a fix for the movable_node boot option issue
      which was introduced by commit 124049de ("x86/e820: put !E820_TYPE_RAM
      regions into memblock.reserved").
      
      The commit breaks the option because it changed the memory gap range to
      reserved memblock.  So, the node is marked as Normal zone even if the SRAT
      has Hot pluggable affinity.
      
      First and second patch fix the original issue which the commit tried to
      fix, then revert the commit.
      
      This patch (of 3):
      
      There is a kernel panic that is triggered when reading /proc/kpageflags on
      the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
      
        BUG: unable to handle kernel paging request at fffffffffffffffe
        PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
        Oops: 0000 [#1] SMP PTI
        CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
        RIP: 0010:stable_page_flags+0x27/0x3c0
        Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
        RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
        RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
        RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
        R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
        R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
        FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
        Call Trace:
         kpageflags_read+0xc7/0x120
         proc_reg_read+0x3c/0x60
         __vfs_read+0x36/0x170
         vfs_read+0x89/0x130
         ksys_pread64+0x71/0x90
         do_syscall_64+0x5b/0x160
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7efc42e75e23
        Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
      
      According to kernel bisection, this problem became visible due to commit
      f7f99100 which changes how struct pages are initialized.
      
      Memblock layout affects the pfn ranges covered by node/zone.  Consider
      that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
      default (no memmap= given) memblock layout is like below:
      
        MEMBLOCK configuration:
         memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x4
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
         memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      If you give memmap=1G!4G (so it just covers memory[0x2]),
      the range [0x100000000-0x13fffffff] is gone:
      
        MEMBLOCK configuration:
         memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x3
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      This causes shrinking node 0's pfn range because it is calculated by the
      address range of memblock.memory.  So some of struct pages in the gap
      range are left uninitialized.
      
      We have a function zero_resv_unavail() which does zeroing the struct pages
      outside memblock.memory, but currently it covers only the reserved
      unavailable range (i.e.  memblock.memory && !memblock.reserved).  This
      patch extends it to cover all unavailable range, which fixes the reported
      issue.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Tested-by: NOscar Salvador <osalvador@suse.de>
      Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      f8521831
  5. 06 5月, 2020 3 次提交
  6. 30 4月, 2020 5 次提交
  7. 29 4月, 2020 2 次提交
    • A
      new helper: lookup_positive_unlocked() · ec6880e8
      Al Viro 提交于
      fix #27211210
      
      commit 6c2d4798a8d16cf4f3a28c3cd4af4f1dcbbb4d04 upstream.
      
      Most of the callers of lookup_one_len_unlocked() treat negatives are
      ERR_PTR(-ENOENT).  Provide a helper that would do just that.  Note
      that a pinned positive dentry remains positive - it's ->d_inode is
      stable, etc.; a pinned _negative_ dentry can become positive at any
      point as long as you are not holding its parent at least shared.
      So using lookup_one_len_unlocked() needs to be careful;
      lookup_positive_unlocked() is safer and that's what the callers
      end up open-coding anyway.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ec6880e8
    • A
      fs/namei.c: pull positivity check into follow_managed() · da2c0773
      Al Viro 提交于
      fix #27211210
      
      commit d41efb522e902364ab09c782d511c1bedc388ddd upstream.
      
      There are 4 callers; two proceed to check if result is positive and
      fail with ENOENT if it isn't; one (in handle_lookup_down()) is
      guaranteed to yield positive and one (in lookup_fast()) is _preceded_
      by positivity check.
      
      However, follow_managed() on a negative dentry is a (fairly cheap)
      no-op on anything other than autofs.  And negative autofs dentries
      are never hashed, so lookup_fast() is not going to run into one
      of those.  Moreover, successful follow_managed() on a _positive_
      dentry never yields a negative one (and we significantly rely upon
      that in callers of lookup_fast()).
      
      In other words, we can easily transpose the positivity check and
      the call of follow_managed() in lookup_fast().  And that allows
      to fold the positivity check *into* follow_managed(), simplifying
      life for the code downstream of its calls.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      da2c0773
  8. 28 4月, 2020 1 次提交
    • B
      x86/resctrl: Rename the config option INTEL_RDT to RESCTRL · fce127d5
      Babu Moger 提交于
      to #26613714
      
      commit 6fe07ce35e8ad870ba1cf82e0481e0fc0f526eff upstream.
      
      The resource control feature is supported by both Intel and AMD. So,
      rename CONFIG_INTEL_RDT to the vendor-neutral CONFIG_RESCTRL.
      
      Now CONFIG_RESCTRL will be used for both Intel and AMD to enable
      Resource Control support. Update the texts in config and condition
      accordingly.
      
       [ bp: Simplify Kconfig text. ]
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Dmitry Safonov <dima@arista.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <linux-doc@vger.kernel.org>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: <qianyue.zj@alibaba-inc.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Rian Hunter <rian@alum.mit.edu>
      Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: <xiaochen.shen@intel.com>
      Link: https://lkml.kernel.org/r/20181121202811.4492-9-babu.moger@amd.com
      
      [ Shile: fixed conflict in arch/x86/Kconfig ]
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Tested-by: NWANG Siyuan <Siyuan.Wang@amd.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      fce127d5
  9. 26 4月, 2020 1 次提交
  10. 24 4月, 2020 3 次提交