1. 29 6月, 2020 7 次提交
  2. 24 6月, 2020 3 次提交
  3. 23 6月, 2020 3 次提交
    • Y
      alinux: sched: Add switch for scheduler_tick load tracking · bcaf8afd
      Yihao Wu 提交于
      to #28739709
      
      Assume workloads are composed of massive short tasks. Then periodical
      load tracking is unnecessary. Because load tracking should be already
      guaranteed by frequent sleep and wake-up.
      
      If these massive short tasks run in their individual cgroups, the load
      tracking becomes extremely heavy.
      
      This patch adds a switch to bypass scheduler_tick load tracking, in
      order to reduce scheduler overhead, without sacrificing much balance
      in this scenario.
      
      Performance Tests:
      
      1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine
      
      	sched overhead(each HT): 0.74% -> 0.48%
      
      	(This test's baseline is from the previous patch)
      
      2) sysbench-threads with 96 threads, running for 5min
      
      	latency_ms 95th: 63.07 -> 54.01
      
      Besides these, no regression is found on our test platform.
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      bcaf8afd
    • Y
      alinux: sched: Add switch for update_blocked_averages · bb48b716
      Yihao Wu 提交于
      to #28739709
      
      Unless the workloads are IO-bounded, update_blocked_averages doesn't help
      load balance. This patch adds a switch to bypass update_blocked_averages
      if prior knowledge about workloads indicates IO is negligible.
      
      Performance Tests:
      
      1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine
      
      	sched overhead(each HT): 3.78% -> 0.74%
      
      2) cgroup-overhead benchmark in our sched-test suite on a 96-HT Skylake
      
      	overhead: 21.06 -> 18.08
      
      3) unixbench context1 with 96 threads running for 1min
      
      	Score: 15409.40 -> 16821.77
      
      Besides these, UnixBench has some performance ups and downs. But
      generally, the performance of UnixBench hasn't changed.
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      bb48b716
    • Y
      alinux: mm: thp: add fast_cow switch · 56a432f5
      Yang Shi 提交于
      task #27327988
      
      The commit ("thp: change CoW semantics for anon-THP") rewrites THP CoW
      page fault handler to allocate base page only, but there is request to
      keep the old behavior just in case.  So, introduce a new sysfs knob,
      fast_cow, to control the behavior, the default is the new behavior.
      Write that knob to 0 to switch to old behavior.
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      [ caspar: fix checkpatch.pl warnings ]
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      56a432f5
  4. 15 6月, 2020 2 次提交
    • S
      block: Fix use-after-free issue accessing struct io_cq · fba123ba
      Sahitya Tummala 提交于
      task #28557799
      
      [ Upstream commit 30a2da7b7e225ef6c87a660419ea04d3cef3f6a7 ]
      
      There is a potential race between ioc_release_fn() and
      ioc_clear_queue() as shown below, due to which below kernel
      crash is observed. It also can result into use-after-free
      issue.
      
      context#1:				context#2:
      ioc_release_fn()			__ioc_clear_queue() gets the same icq
      ->spin_lock(&ioc->lock);		->spin_lock(&ioc->lock);
      ->ioc_destroy_icq(icq);
        ->list_del_init(&icq->q_node);
        ->call_rcu(&icq->__rcu_head,
        	icq_free_icq_rcu);
      ->spin_unlock(&ioc->lock);
      					->ioc_destroy_icq(icq);
      					  ->hlist_del_init(&icq->ioc_node);
      					  This results into below crash as this memory
      					  is now used by icq->__rcu_head in context#1.
      					  There is a chance that icq could be free'd
      					  as well.
      
      22150.386550:   <6> Unable to handle kernel write to read-only memory
      at virtual address ffffffaa8d31ca50
      ...
      Call trace:
      22150.607350:   <2>  ioc_destroy_icq+0x44/0x110
      22150.611202:   <2>  ioc_clear_queue+0xac/0x148
      22150.615056:   <2>  blk_cleanup_queue+0x11c/0x1a0
      22150.619174:   <2>  __scsi_remove_device+0xdc/0x128
      22150.623465:   <2>  scsi_forget_host+0x2c/0x78
      22150.627315:   <2>  scsi_remove_host+0x7c/0x2a0
      22150.631257:   <2>  usb_stor_disconnect+0x74/0xc8
      22150.635371:   <2>  usb_unbind_interface+0xc8/0x278
      22150.639665:   <2>  device_release_driver_internal+0x198/0x250
      22150.644897:   <2>  device_release_driver+0x24/0x30
      22150.649176:   <2>  bus_remove_device+0xec/0x140
      22150.653204:   <2>  device_del+0x270/0x460
      22150.656712:   <2>  usb_disable_device+0x120/0x390
      22150.660918:   <2>  usb_disconnect+0xf4/0x2e0
      22150.664684:   <2>  hub_event+0xd70/0x17e8
      22150.668197:   <2>  process_one_work+0x210/0x480
      22150.672222:   <2>  worker_thread+0x32c/0x4c8
      
      Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to
      indicate this icq is once marked as destroyed. Also, ensure
      __ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so
      that icq doesn't get free'd up while it is still using it.
      Signed-off-by: NSahitya Tummala <stummala@codeaurora.org>
      Co-developed-by: NPradeep P V K <ppvk@codeaurora.org>
      Signed-off-by: NPradeep P V K <ppvk@codeaurora.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      fba123ba
    • M
      block: fix an integer overflow in logical block size · 8b05616d
      Mikulas Patocka 提交于
      task #28557799
      
      commit ad6bf88a6c19a39fb3b0045d78ea880325dfcf15 upstream.
      
      Logical block size has type unsigned short. That means that it can be at
      most 32768. However, there are architectures that can run with 64k pages
      (for example arm64) and on these architectures, it may be possible to
      create block devices with 64k block size.
      
      For exmaple (run this on an architecture with 64k pages):
      
      Mount will fail with this error because it tries to read the superblock using 2-sector
      access:
        device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
        EXT4-fs (dm-0): unable to read superblock
      
      This patch changes the logical block size from unsigned short to unsigned
      int to avoid the overflow.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8b05616d
  5. 11 6月, 2020 1 次提交
    • J
      alinux: blk-mq: remove QUEUE_FLAG_POLL from default MQ flags · 294d5fb2
      Joseph Qi 提交于
      fix #28528017
      
      In case of virtio-blk device, checking /sys/block/<device>/queue/io_poll
      will show 1 and user can't disable it. Actually virtio-blk doesn't
      support poll yet, so it will confuse end user. The root cause is mq
      initialization will default set bit QUEUE_FLAG_POLL.
      
      This fix takes ideas from the following upstream commits:
      6544d229bf43 ("block: enable polling by default if a poll map is initalized")
      6e0de61107f0 ("blk-mq: remove QUEUE_FLAG_POLL from default MQ flags")
      Since we don't want to get HCTX_TYPE_POLL related logic involved, so
      just check mq_ops->poll and then set QUEUE_FLAG_POLL.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      294d5fb2
  6. 09 6月, 2020 5 次提交
  7. 05 6月, 2020 2 次提交
  8. 04 6月, 2020 2 次提交
  9. 28 5月, 2020 15 次提交
    • J
      io_uring: make sure accept honor rlimit nofile · 4520967e
      Jens Axboe 提交于
      to #26323588
      
      commit 09952e3e7826119ddd4357c453d54bcc7ef25156 upstream.
      
      Just like commit 4022e7af86be, this fixes the fact that
      IORING_OP_ACCEPT ends up using get_unused_fd_flags(), which checks
      current->signal->rlim[] for limits.
      
      Add an extra argument to __sys_accept4_file() that allows us to pass
      in the proper nofile limit, and grab it at request prep time.
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      4520967e
    • J
      io_uring: make sure openat/openat2 honor rlimit nofile · 4d28e850
      Jens Axboe 提交于
      to #26323588
      
      commit 4022e7af86be2dd62975dedb6b7ea551d108695e upstream.
      
      Dmitry reports that a test case shows that io_uring isn't honoring a
      modified rlimit nofile setting. get_unused_fd_flags() checks the task
      signal->rlimi[] for the limits. As this isn't easily inheritable,
      provide a __get_unused_fd_flags() that takes the value instead. Then we
      can grab it when the request is prepared (from the original task), and
      pass that in when we do the async part part of the open.
      Reported-by: NDmitry Kadashev <dkadashev@gmail.com>
      Tested-by: NDmitry Kadashev <dkadashev@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      4d28e850
    • J
      eventfd: track eventfd_signal() recursion depth · 0c570e16
      Jens Axboe 提交于
      to #26323588
      
      commit b5e683d5cab8cd433b06ae178621f083cabd4f63 upstream.
      
      eventfd use cases from aio and io_uring can deadlock due to circular
      or resursive calling, when eventfd_signal() tries to grab the waitqueue
      lock. On top of that, it's also possible to construct notification
      chains that are deep enough that we could blow the stack.
      
      Add a percpu counter that tracks the percpu recursion depth, warn if we
      exceed it. The counter is also exposed so that users of eventfd_signal()
      can do the right thing if it's non-zero in the context where it is
      called.
      
      Cc: stable@vger.kernel.org # 4.19+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0c570e16
    • J
      eventpoll: support non-blocking do_epoll_ctl() calls · 73681652
      Jens Axboe 提交于
      to #26323588
      
      commit 39220e8d4a2aaab045ea03cc16d737e85d0817bf upstream.
      
      Also make it available outside of epoll, along with the helper that
      decides if we need to copy the passed in epoll_event.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      73681652
    • B
      percpu-refcount: Introduce percpu_ref_resurrect() · 2596a4a6
      Bart Van Assche 提交于
      to #26323588
      
      commit 18c9a6bbe0645a05172a900740b9d2d379d54320 upstream.
      
      This function will be used in a later patch to switch the struct
      request_queue q_usage_counter from killed back to live. In contrast
      to percpu_ref_reinit(), this new function does not require that the
      refcount is zero.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2596a4a6
    • P
      pcpu_ref: add percpu_ref_tryget_many() · e65cfad0
      Pavel Begunkov 提交于
      to #26323588
      
      commit 4e5ef02317b12e2ed3d604281ffb6b75261f7612 upstream.
      
      Add percpu_ref_tryget_many(), which works the same way as
      percpu_ref_tryget(), but grabs specified number of refs.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDennis Zhou <dennis@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e65cfad0
    • J
      mm: make do_madvise() available internally · 724746b1
      Jens Axboe 提交于
      to #26323588
      
      commit db08ca25253d56f1f76eb4b3fe32a7ac1fbab741 upstream.
      
      This is in preparation for enabling this functionality through io_uring.
      Add a helper that is just exporting what sys_madvise() does, and have the
      system call use it.
      
      No functional changes in this patch.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      724746b1
    • R
      percpu_ref: release percpu memory early without PERCPU_REF_ALLOW_REINIT · 31fafee5
      Roman Gushchin 提交于
      to #26323588
      
      commit 7d9ab9b6adffd9c474c1274acb5f6208f9a09cf3 upstream.
      
      Release percpu memory after finishing the switch to the atomic mode
      if only PERCPU_REF_ALLOW_REINIT isn't set.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      31fafee5
    • R
      percpu_ref: introduce PERCPU_REF_ALLOW_REINIT flag · 02f53f5e
      Roman Gushchin 提交于
      to #26323588
      
      commit 09ed79d6d75f06cc963a78f25463251b0a758dc7 upstream.
      
      In most cases percpu reference counters are not switched to the
      percpu mode after they reach the atomic mode. Some obvious exceptions
      are reference counters which are initialized into the atomic
      mode (using PERCPU_REF_INIT_ATOMIC and PERCPU_REF_INIT_DEAD flags),
      and there are few other exceptions.
      
      But in most cases there is no way back, and once the reference counter
      is switched to the atomic mode, there is no reason to wait for
      percpu_ref_exit() to release the percpu memory. Of course, the size
      of a single counter is not so big, but because it can pin the whole
      percpu block in memory, the memory footprint can be noticeable
      (e.g. on my 32 CPUs machine a percpu block is 8Mb large).
      
      To make releasing of the percpu memory as early as possible, let's
      introduce the PERCPU_REF_ALLOW_REINIT flag with the following semantics:
      it has to be set in order to switch a percpu reference counter to the
      percpu mode after the initialization. PERCPU_REF_INIT_ATOMIC and
      PERCPU_REF_INIT_DEAD flags will implicitly assume PERCPU_REF_ALLOW_REINIT.
      
      This patch doesn't introduce any functional change to avoid any
      regressions. It will be done later in the patchset after adjusting
      all call sites, which are reviving percpu counters.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      02f53f5e
    • T
      binder: fix use-after-free due to ksys_close() during fdget() · b0775049
      Todd Kjos 提交于
      to #26323588
      
      Cherry-pick from commit 80cd795630d6526ba729a089a435bf74a57af927 upstream.
      
      44d8047f1d8 ("binder: use standard functions to allocate fds")
      exposed a pre-existing issue in the binder driver.
      
      fdget() is used in ksys_ioctl() as a performance optimization.
      One of the rules associated with fdget() is that ksys_close() must
      not be called between the fdget() and the fdput(). There is a case
      where this requirement is not met in the binder driver which results
      in the reference count dropping to 0 when the device is still in
      use. This can result in use-after-free or other issues.
      
      If userpace has passed a file-descriptor for the binder driver using
      a BINDER_TYPE_FDA object, then kys_close() is called on it when
      handling a binder_ioctl(BC_FREE_BUFFER) command. This violates
      the assumptions for using fdget().
      
      The problem is fixed by deferring the close using task_work_add(). A
      new variant of __close_fd() was created that returns a struct file
      with a reference. The fput() is deferred instead of using ksys_close().
      
      Fixes: 44d8047f1d87a ("binder: use standard functions to allocate fds")
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTodd Kjos <tkjos@google.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b0775049
    • A
      open: introduce openat2(2) syscall · 5b9369e5
      Aleksa Sarai 提交于
      to #26323588
      
      commit fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179 upstream.
      
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5b9369e5
    • A
      lib: introduce copy_struct_from_user() helper · 915ddf4c
      Aleksa Sarai 提交于
      to #26323588
      
      commit f5a1a536fa14895ccff4e94e6a5af90901ce86aa upstream.
      
      A common pattern for syscall extensions is increasing the size of a
      struct passed from userspace, such that the zero-value of the new fields
      result in the old kernel behaviour (allowing for a mix of userspace and
      kernel vintages to operate on one another in most cases).
      
      While this interface exists for communication in both directions, only
      one interface is straightforward to have reasonable semantics for
      (userspace passing a struct to the kernel). For kernel returns to
      userspace, what the correct semantics are (whether there should be an
      error if userspace is unaware of a new extension) is very
      syscall-dependent and thus probably cannot be unified between syscalls
      (a good example of this problem is [1]).
      
      Previously there was no common lib/ function that implemented
      the necessary extension-checking semantics (and different syscalls
      implemented them slightly differently or incompletely[2]). Future
      patches replace common uses of this pattern to make use of
      copy_struct_from_user().
      
      Some in-kernel selftests that insure that the handling of alignment and
      various byte patterns are all handled identically to memchr_inv() usage.
      
      [1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
           robustify sched_read_attr() ABI logic and code")
      
      [2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
           similar checks to copy_struct_from_user() while rt_sigprocmask(2)
           always rejects differently-sized struct arguments.
      Suggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/r/20191001011055.19283-2-cyphar@cyphar.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      915ddf4c
    • A
      namei: LOOKUP_IN_ROOT: chroot-like scoped resolution · 685c60ef
      Aleksa Sarai 提交于
      to #26323588
      
      commit 8db52c7e7ee1bd861b6096fcafc0fe7d0f24a994 upstream.
      
      /* Background. */
      Container runtimes or other administrative management processes will
      often interact with root filesystems while in the host mount namespace,
      because the cost of doing a chroot(2) on every operation is too
      prohibitive (especially in Go, which cannot safely use vfork). However,
      a malicious program can trick the management process into doing
      operations on files outside of the root filesystem through careful
      crafting of symlinks.
      
      Most programs that need this feature have attempted to make this process
      safe, by doing all of the path resolution in userspace (with symlinks
      being scoped to the root of the malicious root filesystem).
      Unfortunately, this method is prone to foot-guns and usually such
      implementations have subtle security bugs.
      
      Thus, what userspace needs is a way to resolve a path as though it were
      in a chroot(2) -- with all absolute symlinks being resolved relative to
      the dirfd root (and ".." components being stuck under the dirfd root).
      It is much simpler and more straight-forward to provide this
      functionality in-kernel (because it can be done far more cheaply and
      correctly).
      
      More classical applications that also have this problem (which have
      their own potentially buggy userspace path sanitisation code) include
      web servers, archive extraction tools, network file servers, and so on.
      
      /* Userspace API. */
      LOOKUP_IN_ROOT will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_IN_ROOT applies to all components of the path.
      
      With LOOKUP_IN_ROOT, any path component which attempts to cross the
      starting point of the pathname lookup (the dirfd passed to openat) will
      remain at the starting point. Thus, all absolute paths and symlinks will
      be scoped within the starting point.
      
      There is a slight change in behaviour regarding pathnames -- if the
      pathname is absolute then the dirfd is still used as the root of
      resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious
      foot-guns, at the cost of a minor API inconsistency).
      
      As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to
      LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction
      will be lifted in a future patch, but requires more work to ensure that
      permitting ".." is done safely.
      
      Magic-link jumps are also blocked, because they can beam the path lookup
      across the starting point. It would be possible to detect and block
      only the "bad" crossings with path_is_under() checks, but it's unclear
      whether it makes sense to permit magic-links at all. However, userspace
      is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
      magic-link crossing is entirely disabled.
      
      /* Testing. */
      LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests.
      
      [1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      685c60ef
    • A
      namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution · 31084d4b
      Aleksa Sarai 提交于
      to #26323588
      
      commit adb21d2b526f7f196b2f3fdca97d80ba05dd14a0 upstream.
      
      /* Background. */
      There are many circumstances when userspace wants to resolve a path and
      ensure that it doesn't go outside of a particular root directory during
      resolution. Obvious examples include archive extraction tools, as well as
      other security-conscious userspace programs. FreeBSD spun out O_BENEATH
      from their Capsicum project[1,2], so it also seems reasonable to
      implement similar functionality for Linux.
      
      This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a
      variation on David Drysdale's O_BENEATH patchset[4], which in turn was
      based on the Capsicum project[5]).
      
      /* Userspace API. */
      LOOKUP_BENEATH will be exposed to userspace through openat2(2).
      
      /* Semantics. */
      Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
      LOOKUP_BENEATH applies to all components of the path.
      
      With LOOKUP_BENEATH, any path component which attempts to "escape" the
      starting point of the filesystem lookup (the dirfd passed to openat)
      will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed.
      
      Due to a security concern brought up by Jann[6], any ".." path
      components are also blocked. This restriction will be lifted in a future
      patch, but requires more work to ensure that permitting ".." is done
      safely.
      
      Magic-link jumps are also blocked, because they can beam the path lookup
      across the starting point. It would be possible to detect and block
      only the "bad" crossings with path_is_under() checks, but it's unclear
      whether it makes sense to permit magic-links at all. However, userspace
      is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that
      magic-link crossing is entirely disabled.
      
      /* Testing. */
      LOOKUP_BENEATH is tested as part of the openat2(2) selftests.
      
      [1]: https://reviews.freebsd.org/D2808
      [2]: https://reviews.freebsd.org/D17547
      [3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
      [4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
      [5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/
      [6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/
      
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Suggested-by: NDavid Drysdale <drysdale@google.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      31084d4b
    • A
      fs/namei.c: keep track of nd->root refcount status · 0a73d163
      Al Viro 提交于
      to #26323588
      
      commit 84a2bd39405ffd5fa6d6d77e408c5b9210da98de upstream.
      
      The rules for nd->root are messy:
      	* if we have LOOKUP_ROOT, it doesn't contribute to refcounts
      	* if we have LOOKUP_RCU, it doesn't contribute to refcounts
      	* if nd->root.mnt is NULL, it doesn't contribute to refcounts
      	* otherwise it does contribute
      
      terminate_walk() needs to drop the references if they are contributing.
      So everything else should be careful not to confuse it, leading to
      rather convoluted code.
      
      It's easier to keep track of whether we'd grabbed the reference(s)
      explicitly.  Use a new flag for that.  Don't bother with zeroing
      nd->root.mnt on unlazy failures and in terminate_walk - it's not
      needed anymore (terminate_walk() won't care and the next path_init()
      will zero nd->root in !LOOKUP_ROOT case anyway).
      
      Resulting rules for nd->root refcounts are much simpler: they are
      contributing iff LOOKUP_ROOT_GRABBED is set in nd->flags.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0a73d163