- 28 5月, 2020 40 次提交
-
-
由 Pavel Begunkov 提交于
to #26323588 commit 2550878f8421f7912fdd56b38c630b797f95c749 upstream. io_wq workers use io_issue_sqe() to forward sqes and never io_queue_sqe(). Remove extra check for io_wq_current_is_worker() Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #26323588 commit caf582c652feccd42c50923f0467c4f2dcef279e upstream. It should be pretty rare to not submitting anything when there is something in the ring. No need to keep heuristics for this case. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #26323588 commit ee7d46d9db19ded7b7222af95add63606318a480 upstream. A user may ask to submit more than there is in the ring, and then io_uring will submit as much as it can. However, in the last iteration it will allocate an io_kiocb and immediately free it. It could do better and adjust @to_submit to what is in the ring. And since the ring's head is already checked here, there is no need to do it in the loop, spamming with smp_load_acquire()'s barriers Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #26323588 commit 9ef4f124894b7b9241a3cf5f9b40db0812783d66 upstream. Make io_submit_sqes() to clamp @to_submit itself. It removes duplicated code and prepares for following changes. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit 8110c1a6212e430a84edd2b83fe9043def8b743e upstream. Some applications like to start small in terms of ring size, and then ramp up as needed. This is a bit tricky to do currently, since we don't advertise the max ring size. This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring size exceed what we support, then clamp them at the max values instead of returning -EINVAL. Since we return the chosen ring sizes after setup, no further changes are needed on the application side. io_uring already changes the ring sizes if the application doesn't ask for power-of-two sizes, for example. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit c6ca97b30c47c7ad36107d3764bb4dc37026d171 upstream. Currently we only batch free if fixed files are used, no links, no aux data, etc. This extends the batch freeing to only exclude the linked case and fallback case, and make io_free_req_many() handle the other cases just fine. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit 8237e045983d82ba78eaab5f60b9300927fc6796 upstream. This cleans up the code a bit, and it allows us to build on top of the multi-req freeing. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #26323588 commit 2b85edfc0c90efc68dea3d665bb4111bf0694e05 upstream. percpu_ref_tryget() has its own overhead. Instead getting a reference for each request, grab a bunch once per io_submit_sqes(). ~5% throughput boost for a "submit and wait 128 nops" benchmark. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> __io_req_free_empty() -> __io_req_do_free() Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #26323588 commit 4e5ef02317b12e2ed3d604281ffb6b75261f7612 upstream. Add percpu_ref_tryget_many(), which works the same way as percpu_ref_tryget(), but grabs specified number of refs. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NDennis Zhou <dennis@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit c1ca757bd6f4632c510714631ddcc2d13030fe1e upstream. This adds support for doing madvise(2) through io_uring. We assume that any operation can block, and hence punt everything async. This could be improved, but hard to make bullet proof. The async punt ensures it's safe. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit db08ca25253d56f1f76eb4b3fe32a7ac1fbab741 upstream. This is in preparation for enabling this functionality through io_uring. Add a helper that is just exporting what sys_madvise() does, and have the system call use it. No functional changes in this patch. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit 4840e418c2fc533d55ff6caa5b9313eed1d26cfd upstream. This adds support for doing fadvise through io_uring. We assume that WILLNEED doesn't block, but that DONTNEED may block. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit ba04291eb66ed895f194ae5abd3748d72bf8aaea upstream. This behaves like preadv2/pwritev2 with offset == -1, it'll use (and update) the current file position. This obviously comes with the caveat that if the application has multiple read/writes in flight, then the end result will not be as expected. This is similar to threads sharing a file descriptor and doing IO using the current file position. Since this feature isn't easily detectable by doing a read or write, add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to detect presence of this feature. Reported-by: N李通洲 <carter.li@eoitek.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit 3a6820f2bb8a079975109c25a5d1f29f46bce5d2 upstream. For uses cases that don't already naturally have an iovec, it's easier (or more convenient) to just use a buffer address + length. This is particular true if the use case is from languages that want to create a memory safe abstraction on top of io_uring, and where introducing the need for the iovec may impose an ownership issue. For those cases, they currently need an indirection buffer, which means allocating data just for this purpose. Add basic read/write that don't require the iovec. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit e94f141bd248ebdadcb7351f1e70b31cee5add53 upstream. For busy IORING_OP_POLL_ADD workloads, we can have enough contention on the completion lock that we fail the inline completion path quite often as we fail the trylock on that lock. Add a list for deferred completions that we can use in that case. This helps reduce the number of async offloads we have to do, as if we get multiple completions in a row, we'll piggy back on to the poll_llist instead of having to queue our own offload. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit ad3eb2c89fb24d14ac81f43eff8e85fece2c934d upstream. We currently check ->cq_overflow_list from both SQ and CQ context, which causes some bouncing of that cache line. Add separate bits of state for this instead, so that the SQ side can check using its own state, and likewise for the CQ side. This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow with the CQ state. If we hit an overflow condition, both of these bits are set. Likewise for overflow flush clear, we clear both bits. For the fast path of just checking if there's an overflow condition on either the SQ or CQ side, we can use our own private bit for this. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit d3656344fea0339fb0365c8df4d2beba4e0089cd upstream. We currently have various switch statements that check if an opcode needs a file, mm, etc. These are hard to keep in sync as opcodes are added. Add a struct io_op_def that holds all of this information, so we have just one spot to update when opcodes are added. This also enables us to NOT allocate req->io if a deferred command doesn't need it, and corrects some mistakes we had in terms of what commands need mm context. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit add7b6b85a4dfa89283834d181e87ea2144b9028 upstream. __io_free_req() and io_double_put_req() aren't used before they are defined, so we can kill these two forwards. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #26323588 commit 32fe525b6d10fec956cfe68f0db76839cd7f0ea5 upstream. Move io_queue_link_head() to links handling code in io_submit_sqe(), so it wouldn't need extra checks and would have better data locality. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #26323588 commit 9d76377f7e13c19441fdd066033345289f89b5fe upstream. Calling "prev" a head of a link is a bit misleading. Rename it Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit ce35a47a3a0208a77b4d31b7f2e8ed57d624093d upstream. io_uring defaults to always doing inline submissions, if at all possible. But for larger copies, even if the data is fully cached, that can take a long time. Add an IOSQE_ASYNC flag that the application can set on the SQE - if set, it'll ensure that we always go async for those kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we get the concurrency we desire for this case. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commmit 895e2ca0f693c672902191747b548bdc56f0c7de upstream. io-wq assumes that work will complete fast (and not block), so it doesn't create a new worker when work is enqueued, if we already have at least one worker running. This is done on the assumption that if work is running, then it will complete fast. Add an option to force io-wq to fork a new worker for work queued. This is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that case, io-wq will create a new worker, even though workers are already running. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit eddc7ef52a6b37b7ba3d1c8a8fbb63d5d9914f8a upstream. This provides support for async statx(2) through io_uring. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit 3934e36f6099e6277db33f433fe135c6644e8ac2 upstream. To implement an async stat, we need to provide the flags mapping and the statx user copy. Make them available internally, through fs/internal.h. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 We currently fully quiesce the ring before an unregister or update of the fixed fileset. This is very expensive, and we can be a bit smarter about this. Add a percpu refcount for the file tables as a whole. Grab a percpu ref when we use a registered file, and put it on completion. This is cheap to do. Upon removal of a file from a set, switch the ref count to atomic mode. When we hit zero ref on the completion side, then we know we can drop the previously registered files. When the old files have been dropped, switch the ref back to percpu mode for normal operation. Since there's a period between doing the update and the kernel being done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the same action. The application knows the update has completed when it gets the CQE for it. Between doing the update and receiving this completion, the application must continue to use the unregistered fd if submitting IO on this particular file. This takes the runtime of test/file-register from liburing from 14s to about 0.7s. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Roman Gushchin 提交于
to #26323588 commit 214828962dead0c698f92b60ef97ce3c5fc2c8fe upstream. Percpu reference counters should now be initialized with the PERCPU_REF_ALLOW_REINIT in order to allow switching them to the percpu mode from the atomic mode. This is exactly what percpu_ref_reinit() called from __io_uring_register() is supposed to do. So let's initialize percpu refcounters with the PERCU_REF_ALLOW_REINIT flag. Signed-off-by: NRoman Gushchin <guro@fb.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NDennis Zhou <dennis@kernel.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Roman Gushchin 提交于
to #26323588 commit 7d9ab9b6adffd9c474c1274acb5f6208f9a09cf3 upstream. Release percpu memory after finishing the switch to the atomic mode if only PERCPU_REF_ALLOW_REINIT isn't set. Signed-off-by: NRoman Gushchin <guro@fb.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NDennis Zhou <dennis@kernel.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Roman Gushchin 提交于
to #26323588 commit 09ed79d6d75f06cc963a78f25463251b0a758dc7 upstream. In most cases percpu reference counters are not switched to the percpu mode after they reach the atomic mode. Some obvious exceptions are reference counters which are initialized into the atomic mode (using PERCPU_REF_INIT_ATOMIC and PERCPU_REF_INIT_DEAD flags), and there are few other exceptions. But in most cases there is no way back, and once the reference counter is switched to the atomic mode, there is no reason to wait for percpu_ref_exit() to release the percpu memory. Of course, the size of a single counter is not so big, but because it can pin the whole percpu block in memory, the memory footprint can be noticeable (e.g. on my 32 CPUs machine a percpu block is 8Mb large). To make releasing of the percpu memory as early as possible, let's introduce the PERCPU_REF_ALLOW_REINIT flag with the following semantics: it has to be set in order to switch a percpu reference counter to the percpu mode after the initialization. PERCPU_REF_INIT_ATOMIC and PERCPU_REF_INIT_DEAD flags will implicitly assume PERCPU_REF_ALLOW_REINIT. This patch doesn't introduce any functional change to avoid any regressions. It will be done later in the patchset after adjusting all call sites, which are reviving percpu counters. Signed-off-by: NRoman Gushchin <guro@fb.com> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NDennis Zhou <dennis@kernel.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit b5dba59e0cf7e2cc4d3b3b1ac5fe81ddf21959eb upstream. This works just like close(2), unsurprisingly. We remove the file descriptor and post the completion inline, then offload the actual (potential) last file put to async context. Mark the async part of this work as uncancellable, as we really must guarantee that the latter part of the close is run. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Todd Kjos 提交于
to #26323588 Cherry-pick from commit 80cd795630d6526ba729a089a435bf74a57af927 upstream. 44d8047f1d8 ("binder: use standard functions to allocate fds") exposed a pre-existing issue in the binder driver. fdget() is used in ksys_ioctl() as a performance optimization. One of the rules associated with fdget() is that ksys_close() must not be called between the fdget() and the fdput(). There is a case where this requirement is not met in the binder driver which results in the reference count dropping to 0 when the device is still in use. This can result in use-after-free or other issues. If userpace has passed a file-descriptor for the binder driver using a BINDER_TYPE_FDA object, then kys_close() is called on it when handling a binder_ioctl(BC_FREE_BUFFER) command. This violates the assumptions for using fdget(). The problem is fixed by deferring the close using task_work_add(). A new variant of __close_fd() was created that returns a struct file with a reference. The fput() is deferred instead of using ksys_close(). Fixes: 44d8047f1d87a ("binder: use standard functions to allocate fds") Suggested-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NTodd Kjos <tkjos@google.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit 0c9d5ccd26a004f59333c06fbbb98f9cb1eed93d upstream. Not all work can be cancelled, some of it we may need to guarantee that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL on work that must not be cancelled. Note that the caller work function must also check for IO_WQ_WORK_NO_CANCEL on work that is marked IO_WQ_WORK_CANCEL. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit 15b71abe7b52df214785dde0de9f581cc0216d17 upstream. This works just like openat(2), except it can be performed async. For the normal case of a non-blocking path lookup this will complete inline. If we have to do IO to perform the open, it'll be done from async context. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323588 commit 35cb6d54c1d5daf1d1ed585ef5ce4557e7ab284c upstream. This is a prep patch for supporting non-blocking open from io_uring. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Aleksa Sarai 提交于
to #26323588 commit b28a10aedcd4d175470171a32f4f20b0a60a612b upstream. Test all of the various openat2(2) flags. A small stress-test of a symlink-rename attack is included to show that the protections against ".."-based attacks are sufficient. The main things these self-tests are enforcing are: * The struct+usize ABI for openat2(2) and copy_struct_from_user() to ensure that upgrades will be handled gracefully (in addition, ensuring that misaligned structures are also handled correctly). * The -EINVAL checks for openat2(2) are all correctly handled to avoid userspace passing unknown or conflicting flag sets (most importantly, ensuring that invalid flag combinations are checked). * All of the RESOLVE_* semantics (including errno values) are correctly handled with various combinations of paths and flags. * RESOLVE_IN_ROOT correctly protects against the symlink rename(2) attack that has been responsible for several CVEs (and likely will be responsible for several more). Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAleksa Sarai <cyphar@cyphar.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Aleksa Sarai 提交于
to #26323588 commit fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179 upstream. /* Background. */ For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Userspace also has a hard time figuring out whether a particular flag is supported on a particular kernel. While it is now possible with contemporary kernels (thanks to [3]), older kernels will expose unknown flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during openat(2) time matches modern syscall designs and is far more fool-proof. In addition, the newly-added path resolution restriction LOOKUP flags (which we would like to expose to user-space) don't feel related to the pre-existing O_* flag set -- they affect all components of path lookup. We'd therefore like to add a new flag argument. Adding a new syscall allows us to finally fix the flag-ignoring problem, and we can make it extensible enough so that we will hopefully never need an openat3(2). /* Syscall Prototype. */ /* * open_how is an extensible structure (similar in interface to * clone3(2) or sched_setattr(2)). The size parameter must be set to * sizeof(struct open_how), to allow for future extensions. All future * extensions will be appended to open_how, with their zero value * acting as a no-op default. */ struct open_how { /* ... */ }; int openat2(int dfd, const char *pathname, struct open_how *how, size_t size); /* Description. */ The initial version of 'struct open_how' contains the following fields: flags Used to specify openat(2)-style flags. However, any unknown flag bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR) will result in -EINVAL. In addition, this field is 64-bits wide to allow for more O_ flags than currently permitted with openat(2). mode The file mode for O_CREAT or O_TMPFILE. Must be set to zero if flags does not contain O_CREAT or O_TMPFILE. resolve Restrict path resolution (in contrast to O_* flags they affect all path components). The current set of flags are as follows (at the moment, all of the RESOLVE_ flags are implemented as just passing the corresponding LOOKUP_ flag). RESOLVE_NO_XDEV => LOOKUP_NO_XDEV RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS RESOLVE_BENEATH => LOOKUP_BENEATH RESOLVE_IN_ROOT => LOOKUP_IN_ROOT open_how does not contain an embedded size field, because it is of little benefit (userspace can figure out the kernel open_how size at runtime fairly easily without it). It also only contains u64s (even though ->mode arguably should be a u16) to avoid having padding fields which are never used in the future. Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE is no longer permitted for openat(2). As far as I can tell, this has always been a bug and appears to not be used by userspace (and I've not seen any problems on my machines by disallowing it). If it turns out this breaks something, we can special-case it and only permit it for openat(2) but not openat2(2). After input from Florian Weimer, the new open_how and flag definitions are inside a separate header from uapi/linux/fcntl.h, to avoid problems that glibc has with importing that header. /* Testing. */ In a follow-up patch there are over 200 selftests which ensure that this syscall has the correct semantics and will correctly handle several attack scenarios. In addition, I've written a userspace library[4] which provides convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care must be taken when using RESOLVE_IN_ROOT'd file descriptors with other syscalls). During the development of this patch, I've run numerous verification tests using libpathrs (showing that the API is reasonably usable by userspace). /* Future Work. */ Additional RESOLVE_ flags have been suggested during the review period. These can be easily implemented separately (such as blocking auto-mount during resolution). Furthermore, there are some other proposed changes to the openat(2) interface (the most obvious example is magic-link hardening[5]) which would be a good opportunity to add a way for userspace to restrict how O_PATH file descriptors can be re-opened. Another possible avenue of future work would be some kind of CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace which openat2(2) flags and fields are supported by the current kernel (to avoid userspace having to go through several guesses to figure it out). [1]: https://lwn.net/Articles/588444/ [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com [3]: commit 629e014b ("fs: completely ignore unknown open flags") [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/ [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com> Signed-off-by: NAleksa Sarai <cyphar@cyphar.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Aleksa Sarai 提交于
to #26323588 commit f5a1a536fa14895ccff4e94e6a5af90901ce86aa upstream. A common pattern for syscall extensions is increasing the size of a struct passed from userspace, such that the zero-value of the new fields result in the old kernel behaviour (allowing for a mix of userspace and kernel vintages to operate on one another in most cases). While this interface exists for communication in both directions, only one interface is straightforward to have reasonable semantics for (userspace passing a struct to the kernel). For kernel returns to userspace, what the correct semantics are (whether there should be an error if userspace is unaware of a new extension) is very syscall-dependent and thus probably cannot be unified between syscalls (a good example of this problem is [1]). Previously there was no common lib/ function that implemented the necessary extension-checking semantics (and different syscalls implemented them slightly differently or incompletely[2]). Future patches replace common uses of this pattern to make use of copy_struct_from_user(). Some in-kernel selftests that insure that the handling of alignment and various byte patterns are all handled identically to memchr_inv() usage. [1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code") [2]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do similar checks to copy_struct_from_user() while rt_sigprocmask(2) always rejects differently-sized struct arguments. Suggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: NAleksa Sarai <cyphar@cyphar.com> Reviewed-by: NKees Cook <keescook@chromium.org> Reviewed-by: NChristian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20191001011055.19283-2-cyphar@cyphar.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Aleksa Sarai 提交于
to #26323588 commit ab87f9a56c8ee9fa6856cb13d8f2905db913baae upstream. Allow LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit ".." resolution (in the case of LOOKUP_BENEATH the resolution will still fail if ".." resolution would resolve a path outside of the root -- while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps are still disallowed entirely[*]. As Jann explains[1,2], the need for this patch (and the original no-".." restriction) is explained by observing there is a fairly easy-to-exploit race condition with chroot(2) (and thus by extension LOOKUP_IN_ROOT and LOOKUP_BENEATH if ".." is allowed) where a rename(2) of a path can be used to "skip over" nd->root and thus escape to the filesystem above nd->root. thread1 [attacker]: for (;;) renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE); thread2 [victim]: for (;;) openat2(dirb, "b/c/../../etc/shadow", { .flags = O_PATH, .resolve = RESOLVE_IN_ROOT } ); With fairly significant regularity, thread2 will resolve to "/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar (though somewhat more privileged) attack using MS_MOVE. With this patch, such cases will be detected *during* ".." resolution and will return -EAGAIN for userspace to decide to either retry or abort the lookup. It should be noted that ".." is the weak point of chroot(2) -- walking *into* a subdirectory tautologically cannot result in you walking *outside* nd->root (except through a bind-mount or magic-link). There is also no other way for a directory's parent to change (which is the primary worry with ".." resolution here) other than a rename or MS_MOVE. The primary reason for deferring to userspace with -EAGAIN is that an in-kernel retry loop (or doing a path_is_under() check after re-taking the relevant seqlocks) can become unreasonably expensive on machines with lots of VFS activity (nfsd can cause lots of rename_lock updates). Thus it should be up to userspace how many times they wish to retry the lookup -- the selftests for this attack indicate that there is a ~35% chance of the lookup succeeding on the first try even with an attacker thrashing rename_lock. A variant of the above attack is included in the selftests for openat2(2) later in this patch series. I've run this test on several machines for several days and no instances of a breakout were detected. While this is not concrete proof that this is safe, when combined with the above argument it should lend some trustworthiness to this construction. [*] It may be acceptable in the future to do a path_is_under() check for magic-links after they are resolved. However this seems unlikely to be a feature that people *really* need -- it can be added later if it turns out a lot of people want it. [1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/ [2]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/ Cc: Christian Brauner <christian.brauner@ubuntu.com> Suggested-by: NJann Horn <jannh@google.com> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAleksa Sarai <cyphar@cyphar.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Aleksa Sarai 提交于
to #26323588 commit 8db52c7e7ee1bd861b6096fcafc0fe7d0f24a994 upstream. /* Background. */ Container runtimes or other administrative management processes will often interact with root filesystems while in the host mount namespace, because the cost of doing a chroot(2) on every operation is too prohibitive (especially in Go, which cannot safely use vfork). However, a malicious program can trick the management process into doing operations on files outside of the root filesystem through careful crafting of symlinks. Most programs that need this feature have attempted to make this process safe, by doing all of the path resolution in userspace (with symlinks being scoped to the root of the malicious root filesystem). Unfortunately, this method is prone to foot-guns and usually such implementations have subtle security bugs. Thus, what userspace needs is a way to resolve a path as though it were in a chroot(2) -- with all absolute symlinks being resolved relative to the dirfd root (and ".." components being stuck under the dirfd root). It is much simpler and more straight-forward to provide this functionality in-kernel (because it can be done far more cheaply and correctly). More classical applications that also have this problem (which have their own potentially buggy userspace path sanitisation code) include web servers, archive extraction tools, network file servers, and so on. /* Userspace API. */ LOOKUP_IN_ROOT will be exposed to userspace through openat2(2). /* Semantics. */ Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW), LOOKUP_IN_ROOT applies to all components of the path. With LOOKUP_IN_ROOT, any path component which attempts to cross the starting point of the pathname lookup (the dirfd passed to openat) will remain at the starting point. Thus, all absolute paths and symlinks will be scoped within the starting point. There is a slight change in behaviour regarding pathnames -- if the pathname is absolute then the dirfd is still used as the root of resolution of LOOKUP_IN_ROOT is specified (this is to avoid obvious foot-guns, at the cost of a minor API inconsistency). As with LOOKUP_BENEATH, Jann's security concern about ".."[1] applies to LOOKUP_IN_ROOT -- therefore ".." resolution is blocked. This restriction will be lifted in a future patch, but requires more work to ensure that permitting ".." is done safely. Magic-link jumps are also blocked, because they can beam the path lookup across the starting point. It would be possible to detect and block only the "bad" crossings with path_is_under() checks, but it's unclear whether it makes sense to permit magic-links at all. However, userspace is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that magic-link crossing is entirely disabled. /* Testing. */ LOOKUP_IN_ROOT is tested as part of the openat2(2) selftests. [1]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/ Cc: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: NAleksa Sarai <cyphar@cyphar.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Aleksa Sarai 提交于
to #26323588 commit adb21d2b526f7f196b2f3fdca97d80ba05dd14a0 upstream. /* Background. */ There are many circumstances when userspace wants to resolve a path and ensure that it doesn't go outside of a particular root directory during resolution. Obvious examples include archive extraction tools, as well as other security-conscious userspace programs. FreeBSD spun out O_BENEATH from their Capsicum project[1,2], so it also seems reasonable to implement similar functionality for Linux. This is part of a refresh of Al's AT_NO_JUMPS patchset[3] (which was a variation on David Drysdale's O_BENEATH patchset[4], which in turn was based on the Capsicum project[5]). /* Userspace API. */ LOOKUP_BENEATH will be exposed to userspace through openat2(2). /* Semantics. */ Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW), LOOKUP_BENEATH applies to all components of the path. With LOOKUP_BENEATH, any path component which attempts to "escape" the starting point of the filesystem lookup (the dirfd passed to openat) will yield -EXDEV. Thus, all absolute paths and symlinks are disallowed. Due to a security concern brought up by Jann[6], any ".." path components are also blocked. This restriction will be lifted in a future patch, but requires more work to ensure that permitting ".." is done safely. Magic-link jumps are also blocked, because they can beam the path lookup across the starting point. It would be possible to detect and block only the "bad" crossings with path_is_under() checks, but it's unclear whether it makes sense to permit magic-links at all. However, userspace is recommended to pass LOOKUP_NO_MAGICLINKS if they want to ensure that magic-link crossing is entirely disabled. /* Testing. */ LOOKUP_BENEATH is tested as part of the openat2(2) selftests. [1]: https://reviews.freebsd.org/D2808 [2]: https://reviews.freebsd.org/D17547 [3]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/ [4]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/ [5]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/ [6]: https://lore.kernel.org/lkml/CAG48ez1jzNvxB+bfOBnERFGp=oMM0vHWuLD6EULmne3R6xa53w@mail.gmail.com/ Cc: Christian Brauner <christian.brauner@ubuntu.com> Suggested-by: NDavid Drysdale <drysdale@google.com> Suggested-by: NAl Viro <viro@zeniv.linux.org.uk> Suggested-by: NAndy Lutomirski <luto@kernel.org> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAleksa Sarai <cyphar@cyphar.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Al Viro 提交于
to #26323588 commit 84a2bd39405ffd5fa6d6d77e408c5b9210da98de upstream. The rules for nd->root are messy: * if we have LOOKUP_ROOT, it doesn't contribute to refcounts * if we have LOOKUP_RCU, it doesn't contribute to refcounts * if nd->root.mnt is NULL, it doesn't contribute to refcounts * otherwise it does contribute terminate_walk() needs to drop the references if they are contributing. So everything else should be careful not to confuse it, leading to rather convoluted code. It's easier to keep track of whether we'd grabbed the reference(s) explicitly. Use a new flag for that. Don't bother with zeroing nd->root.mnt on unlazy failures and in terminate_walk - it's not needed anymore (terminate_walk() won't care and the next path_init() will zero nd->root in !LOOKUP_ROOT case anyway). Resulting rules for nd->root refcounts are much simpler: they are contributing iff LOOKUP_ROOT_GRABBED is set in nd->flags. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-