提交 · 0a73d1631a54b9695bbcfdb03ebc5bdd0f10836e · openanolis / cloud-kernel

28 5月, 2020 10 次提交

fs/namei.c: keep track of nd->root refcount status · 0a73d163

由 Al Viro 提交于 7月 16, 2019

to #26323588

commit 84a2bd39405ffd5fa6d6d77e408c5b9210da98de upstream.

The rules for nd->root are messy:
	* if we have LOOKUP_ROOT, it doesn't contribute to refcounts
	* if we have LOOKUP_RCU, it doesn't contribute to refcounts
	* if nd->root.mnt is NULL, it doesn't contribute to refcounts
	* otherwise it does contribute

terminate_walk() needs to drop the references if they are contributing.
So everything else should be careful not to confuse it, leading to
rather convoluted code.

It's easier to keep track of whether we'd grabbed the reference(s)
explicitly.  Use a new flag for that.  Don't bother with zeroing
nd->root.mnt on unlazy failures and in terminate_walk - it's not
needed anymore (terminate_walk() won't care and the next path_init()
will zero nd->root in !LOOKUP_ROOT case anyway).

Resulting rules for nd->root refcounts are much simpler: they are
contributing iff LOOKUP_ROOT_GRABBED is set in nd->flags.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

0a73d163

fs/namei.c: new helper - legitimize_root() · 40107f78

由 Al Viro 提交于 7月 16, 2019

to #26323588

commit ee594bfff389aa9105f713135211c0da736e5698 upstream.

identical logics in unlazy_walk() and unlazy_child()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

40107f78

namei: LOOKUP_NO_XDEV: block mountpoint crossing · 38549ac1

由 Aleksa Sarai 提交于 12月 07, 2019

to #26323588

commit 72ba29297e1439efaa54d9125b866ae9d15df339 upstream.

/* Background. */
The need to contain path operations within a mountpoint has been a
long-standing usecase that userspace has historically implemented
manually with liberal usage of stat(). find, rsync, tar and
many other programs implement these semantics -- but it'd be much
simpler to have a fool-proof way of refusing to open a path if it
crosses a mountpoint.

This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
variation on David Drysdale's O_BENEATH patchset[2], which in turn was
based on the Capsicum project[3]).

/* Userspace API. */
LOOKUP_NO_XDEV will be exposed to userspace through openat2(2).

/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_NO_XDEV applies to all components of the path.

With LOOKUP_NO_XDEV, any path component which crosses a mount-point
during path resolution (including "..") will yield an -EXDEV. Absolute
paths, absolute symlinks, and magic-links will only yield an -EXDEV if
the jump involved changing mount-points.

/* Testing. */
LOOKUP_NO_XDEV is tested as part of the openat2(2) selftests.

[1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: NDavid Drysdale <drysdale@google.com>
Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
Suggested-by: NAndy Lutomirski <luto@kernel.org>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

38549ac1

namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution · 773b3516

由 Aleksa Sarai 提交于 12月 07, 2019

to #26323588

commit 4b99d4996979d582859c5a49072e92de124bf691 upstream.

/* Background. */
There has always been a special class of symlink-like objects in procfs
(and a few other pseudo-filesystems) which allow for non-lexical
resolution of paths using nd_jump_link(). These "magic-links" do not
follow traditional mount namespace boundaries, and have been used
consistently in container escape attacks because they can be used to
trick unsuspecting privileged processes into resolving unexpected paths.

It is also non-trivial for userspace to unambiguously avoid resolving
magic-links, because they do not have a reliable indication that they
are a magic-link (in order to verify them you'd have to manually open
the path given by readlink(2) and then verify that the two file
descriptors reference the same underlying file, which is plagued with
possible race conditions or supplementary attack scenarios).

It would therefore be very helpful for userspace to be able to avoid
these symlinks easily, thus hopefully removing a tool from attackers'
toolboxes.

This is part of a refresh of Al's AT_NO_JUMPS patchset[1] (which was a
variation on David Drysdale's O_BENEATH patchset[2], which in turn was
based on the Capsicum project[3]).

/* Userspace API. */
LOOKUP_NO_MAGICLINKS will be exposed to userspace through openat2(2).

/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_NO_MAGICLINKS applies to all components of the path.

With LOOKUP_NO_MAGICLINKS, any magic-link path component encountered
during path resolution will yield -ELOOP. The handling of ~LOOKUP_FOLLOW
for a trailing magic-link is identical to LOOKUP_NO_SYMLINKS.

LOOKUP_NO_SYMLINKS implies LOOKUP_NO_MAGICLINKS.

/* Testing. */
LOOKUP_NO_MAGICLINKS is tested as part of the openat2(2) selftests.

[1]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[2]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[3]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: NDavid Drysdale <drysdale@google.com>
Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
Suggested-by: NAndy Lutomirski <luto@kernel.org>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

773b3516

namei: LOOKUP_NO_SYMLINKS: block symlink resolution · 35e8a4ee

由 Aleksa Sarai 提交于 12月 07, 2019

to #26323588

commit 278121417a72d87fb29dd8c48801f80821e8f75a upstream.

/* Background. */
Userspace cannot easily resolve a path without resolving symlinks, and
would have to manually resolve each path component with O_PATH and
O_NOFOLLOW. This is clearly inefficient, and can be fairly easy to screw
up (resulting in possible security bugs). Linus has mentioned that Git
has a particular need for this kind of flag[1]. It also resolves a
fairly long-standing perceived deficiency in O_NOFOLLOw -- that it only
blocks the opening of trailing symlinks.

This is part of a refresh of Al's AT_NO_JUMPS patchset[2] (which was a
variation on David Drysdale's O_BENEATH patchset[3], which in turn was
based on the Capsicum project[4]).

/* Userspace API. */
LOOKUP_NO_SYMLINKS will be exposed to userspace through openat2(2).

/* Semantics. */
Unlike most other LOOKUP flags (most notably LOOKUP_FOLLOW),
LOOKUP_NO_SYMLINKS applies to all components of the path.

With LOOKUP_NO_SYMLINKS, any symlink path component encountered during
path resolution will yield -ELOOP. If the trailing component is a
symlink (and no other components were symlinks), then O_PATH|O_NOFOLLOW
will not error out and will instead provide a handle to the trailing
symlink -- without resolving it.

/* Testing. */
LOOKUP_NO_SYMLINKS is tested as part of the openat2(2) selftests.

[1]: https://lore.kernel.org/lkml/CA+55aFyOKM7DW7+0sdDFKdZFXgptb5r1id9=Wvhd8AgSP7qjwQ@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/20170429220414.GT29622@ZenIV.linux.org.uk/
[3]: https://lore.kernel.org/lkml/1415094884-18349-1-git-send-email-drysdale@google.com/
[4]: https://lore.kernel.org/lkml/1404124096-21445-1-git-send-email-drysdale@google.com/

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

35e8a4ee

namei: allow set_root() to produce errors · 53a5271b

由 Aleksa Sarai 提交于 12月 07, 2019

to #26323588

commit 740a16782750a5b6c7d1609a9c09641ce6753ea6 upstream.

For LOOKUP_BENEATH and LOOKUP_IN_ROOT it is necessary to ensure that
set_root() is never called, and thus (for hardening purposes) it should
return an error rather than permit a breakout from the root. In
addition, move all of the repetitive set_root() calls to nd_jump_root().
Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

53a5271b

namei: allow nd_jump_link() to produce errors · b63f56c4

由 Aleksa Sarai 提交于 12月 07, 2019

to #26323588

commit 1bc82070fa2763bdca626fa8bde72b35f11e8960 upstream.

In preparation for LOOKUP_NO_MAGICLINKS, it's necessary to add the
ability for nd_jump_link() to return an error which the corresponding
get_link() caller must propogate back up to the VFS.
Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

b63f56c4

nsfs: clean-up ns_get_path() signature to return int · 39f702ab

由 Aleksa Sarai 提交于 12月 07, 2019

to #26323588

commit ce623f89872df4253719be71531116751eeab85f upstream.

ns_get_path() and ns_get_path_cb() only ever return either NULL or an
ERR_PTR. It is far more idiomatic to simply return an integer, and it
makes all of the callers of ns_get_path() more straightforward to read.

Fixes: e149ed2b ("take the targets of /proc/*/ns/* symlinks to separate fs")
Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

39f702ab

namei: only return -ECHILD from follow_dotdot_rcu() · 8eef64a2

由 Aleksa Sarai 提交于 12月 07, 2019

to #26323588

commit 2b98149c2377bff12be5dd3ce02ae0506e2dd613 upstream.

It's over-zealous to return hard errors under RCU-walk here, given that
a REF-walk will be triggered for all other cases handling ".." under
RCU.

The original purpose of this check was to ensure that if a rename occurs
such that a directory is moved outside of the bind-mount which the
resolution started in, it would be detected and blocked to avoid being
able to mess with paths outside of the bind-mount. However, triggering a
new REF-walk is just as effective a solution.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Fixes: 397d425d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

8eef64a2

io_uring: add support for fallocate() · 41076b20

由 Jens Axboe 提交于 12月 10, 2019

to #26323588

commit d63d1b5edb7b832210bfde587ba9e7549fa064eb upstream.

This exposes fallocate(2) through io_uring.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

41076b20

27 5月, 2020 30 次提交

io_uring: don't cancel all work on process exit · 072f3d91

由 Jens Axboe 提交于 1月 26, 2020

to #26323578

commit ebe10026210f9ea740b9a050ee84a166690fddde upstream.

If we're sharing the ring across forks, then one process exiting means
that we cancel ALL work and prevent future work. This is overly
restrictive. As long as we cancel the work associated with the files
from the current task, it's safe to let others persist. Normal fd close
on exit will still wait (and cancel) pending work.

Fixes: fcb323cc53e2 ("io_uring: io_uring: add support for async work inheriting files")
Reported-by: NAndres Freund <andres@anarazel.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

072f3d91

io_uring: fix compat for IORING_REGISTER_FILES_UPDATE · ff2cdc98

由 Eugene Syromiatnikov 提交于 1月 15, 2020

to #26323578

commit 1292e972fff2b2d81e139e0c2fe5f50249e78c58 upstream.

fds field of struct io_uring_files_update is problematic with regards
to compat user space, as pointer size is different in 32-bit, 32-on-64-bit,
and 64-bit user space.  In order to avoid custom handling of compat in
the syscall implementation, make fds __u64 and use u64_to_user_ptr in
order to retrieve it.  Also, align the field naturally and check that
no garbage is passed there.

Fixes: c3a31e605620c279 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

ff2cdc98

io_uring: ensure workqueue offload grabs ring mutex for poll list · 8265261d

由 Jens Axboe 提交于 1月 15, 2020

to #26323578

commit 11ba820bf163e224bf5dd44e545a66a44a5b1d7a upstream.

A previous commit moved the locking for the async sqthread, but didn't
take into account that the io-wq workers still need it. We can't use
req->in_async for this anymore as both the sqthread and io-wq workers
set it, gate the need for locking on io_wq_current_is_worker() instead.

Fixes: 8a4955ff1cca ("io_uring: sqthread should grab ctx->uring_lock for submissions")
Reported-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

8265261d

io_uring: clear req->result always before issuing a read/write request · 813322dd

由 Bijan Mottahedeh 提交于 1月 15, 2020

to #26323578

commit 797f3f535d59f05ad12c629338beef6cb801d19e upstream.

req->result is cleared when io_issue_sqe() calls io_read/write_pre()
routines.  Those routines however are not called when the sqe
argument is NULL, which is the case when io_issue_sqe() is called from
io_wq_submit_work().  io_issue_sqe() may then examine a stale result if
a polled request had previously failed with -EAGAIN:

        if (ctx->flags & IORING_SETUP_IOPOLL) {
                if (req->result == -EAGAIN)
                        return -EAGAIN;

                io_iopoll_req_issued(req);
        }

and in turn cause a subsequently completed request to be re-issued in
io_wq_submit_work().
Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

813322dd

io_uring: be consistent in assigning next work from handler · 7e63ce0d

由 Jens Axboe 提交于 1月 14, 2020

to #26323578

commit 78912934f4f7dd7a424159c69bf9bdd46e823781 upstream.

If we pass back dependent work in case of links, we need to always
ensure that we call the link setup and work prep handler. If not, we
might be missing some setup for the next work item.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

7e63ce0d

io-wq: cancel work if we fail getting a mm reference · 6b68e78f

由 Jens Axboe 提交于 1月 14, 2020

to #26323578

commit e0bbb3461ae000baec13e8ec5b5063202df228df upstream.

If we require mm and user context, mark the request for cancellation
if we fail to acquire the desired mm.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

6b68e78f

io_uring: don't setup async context for read/write fixed · fcd3cdab

由 Jens Axboe 提交于 1月 13, 2020

to #26323578

commit 74566df3a71c1b92da608868cca787557d8be7b2 upstream.

We don't need it, and if we have it, then the retry handler will attempt
to copy the non-existent iovec with the inline iovec, with a segment
count that doesn't make sense.

Fixes: f67676d160c6 ("io_uring: ensure async punted read/write requests copy iovec")
Reported-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

fcd3cdab

io_uring: remove punt of short reads to async context · fdb55458

由 Jens Axboe 提交于 1月 07, 2020

to #26323578

commit eacc6dfaea963ef61540abb31ad7829be5eff284 upstream.

We currently punt any short read on a regular file to async context,
but this fails if the short read is due to running into EOF. This is
especially problematic since we only do the single prep for commands
now, as we don't reset kiocb->ki_pos. This can result in a 4k read on
a 1k file returning zero, as we detect the short read and then retry
from async context. At the time of retry, the position is now 1k, and
we end up reading nothing, and hence return 0.

Instead of trying to patch around the fact that short reads can be
legitimate and won't succeed in case of retry, remove the logic to punt
a short read to async context. Simply return it.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

fdb55458

io-wq: add cond_resched() to worker thread · 6c0002d3

由 Hillf Danton 提交于 12月 24, 2019

to #26323578

commit fd1c4bc6e9b34a5e4fe7a3130a49380ef9d7037c upstream.

Reschedule the current IO worker to cut the risk that it is becoming
a cpu hog.
Signed-off-by: NHillf Danton <hdanton@sina.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

6c0002d3

io-wq: remove unused busy list from io_sqe · 35521a06

由 Hillf Danton 提交于 12月 22, 2019

to #26323578

commit 1f424e8bd18754d27b15f49359004b0cea344fb5 upstream.

Commit e61df66c69b1 ("io-wq: ensure free/busy list browsing see all
items") added a list for io workers in addition to the free and busy
lists, not only making worker walk cleaner, but leaving the busy list
unused. Let's remove it.
Signed-off-by: NHillf Danton <hdanton@sina.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

35521a06

io_uring: pass in 'sqe' to the prep handlers · be421039

由 Jens Axboe 提交于 12月 19, 2019

to #26323578

commit 3529d8c2b353e6e446277ae96a36e7471cb070fc upstream.

This moves the prep handlers outside of the opcode handlers, and allows
us to pass in the sqe directly. If the sqe is non-NULL, it means that
the request should be prepared for the first time.

With the opcode handlers not having access to the sqe at all, we are
guaranteed that the prep handler has setup the request fully by the
time we get there. As before, for opcodes that need to copy in more
data then the io_kiocb allows for, the io_async_ctx holds that info. If
a prep handler is invoked with req->io set, it must use that to retain
information for later.

Finally, we can remove io_kiocb->sqe as well.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

be421039

io_uring: standardize the prep methods · c3c861b4

由 Jens Axboe 提交于 12月 19, 2019

to #26323578

commit 06b76d44ba25e52711dc7cc4fc75b50907bc6b8e upstream.

We currently have a mix of use cases. Most of the newer ones are pretty
uniform, but we have some older ones that use different calling
calling conventions. This is confusing.

For the opcodes that currently rely on the req->io->sqe copy saving
them from reuse, add a request type struct in the io_kiocb command
union to store the data they need.

Prepare for all opcodes having a standard prep method, so we can call
it in a uniform fashion and outside of the opcode handler. This is in
preparation for passing in the 'sqe' pointer, rather than storing it
in the io_kiocb. Once we have uniform prep handlers, we can leave all
the prep work to that part, and not even pass in the sqe to the opcode
handler. This ensures that we don't reuse sqe data inadvertently.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c3c861b4

io_uring: read 'count' for IORING_OP_TIMEOUT in prep handler · 3faa7027

由 Jens Axboe 提交于 12月 20, 2019

to #26323578

commit 26a61679f10c6f041726411964b172565021c2eb upstream.

Add the count field to struct io_timeout, and ensure the prep handler
has read it. Timeout also needs an async context always, set it up
in the prep handler if we don't have one.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

3faa7027

io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handler · d1b66a4f

由 Jens Axboe 提交于 12月 20, 2019

to #26323578

commit e47293fdf98998292a89d516c8f7b8b9eb5c5213 upstream.

Add struct io_sr_msg in our io_kiocb per-command union, and ensure that
the send/recvmsg prep handlers have grabbed what they need from the SQE
by the time prep is done.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

d1b66a4f

io_uring: move all prep state for IORING_OP_CONNECT to prep handler · 6ea60de6

由 Jens Axboe 提交于 12月 20, 2019

to #26323578

commit 3fbb51c18f5c15a23db74c4da79d3d035176c480 upstream.

Add struct io_connect in our io_kiocb per-command union, and ensure
that io_connect_prep() has grabbed what it needs from the SQE.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

6ea60de6

io_uring: add and use struct io_rw for read/writes · c5d6bd8a

由 Jens Axboe 提交于 12月 20, 2019

to #26323578

commit 9adbd45d6d32ffc1a03f3c51d72cfc69ebfc2ddb upstream.

Put the kiocb in struct io_rw, and add the addr/len for the request as
well. Use the kiocb->private field for the buffer index for fixed reads
and writes.

Any use of kiocb->ki_filp is flipped to req->file. It's the same thing,
and less confusing.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c5d6bd8a

io_uring: use u64_to_user_ptr() consistently · 08ba131e

由 Jens Axboe 提交于 12月 11, 2019

to #26323578

commit d55e5f5b70dd6214ef81fb2313121b72a7dd2200 upstream.

We use it in some spots, but not consistently. Convert the rest over,
makes it easier to read as well.

No functional changes in this patch.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

08ba131e

io_uring: io_wq_submit_work() should not touch req->rw · 531bd675

由 Jens Axboe 提交于 12月 18, 2019

to #26323578

commit fd6c2e4c063d64511657ad0031a1677b6a914859 upstream.

I've been chasing a weird and obscure crash that was userspace stack
corruption, and finally narrowed it down to a bit flip that made a
stack address invalid. io_wq_submit_work() unconditionally flips
the req->rw.ki_flags IOCB_NOWAIT bit, but since it's a generic work
handler, this isn't valid. Normal read/write operations own that
part of the request, on other types it could be something else.

Move the IOCB_NOWAIT clear to the read/write handlers where it belongs.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

531bd675

io_uring: don't wait when under-submitting · 13bd4808

由 Pavel Begunkov 提交于 12月 18, 2019

to #26323578

commit 7c504e65206a4379ff38fe41d21b32b6c2c3e53e upstream.

There is no reliable way to submit and wait in a single syscall, as
io_submit_sqes() may under-consume sqes (in case of an early error).
Then it will wait for not-yet-submitted requests, deadlocking the user
in most cases.

Don't wait/poll if can't submit all sqes
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

13bd4808

io_uring: warn about unhandled opcode · 861c7c78

由 Jens Axboe 提交于 12月 17, 2019

to #26323578

commit e781573e2fb1b75acdba61dcb9bcbfc16f288442 upstream.

Now that we have all the opcodes handled in terms of command prep and
SQE reuse, add a printk_once() to warn about any potentially new and
unhandled ones.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

861c7c78

io_uring: read opcode and user_data from SQE exactly once · 96baaf97

由 Jens Axboe 提交于 12月 17, 2019

to #26323578

commit d625c6ee4975000140c57da7e1ff244efefde274 upstream.

If we defer a request, we can't be reading the opcode again. Ensure that
the user_data and opcode fields are stable. For the user_data we already
have a place for it, for the opcode we can fill a one byte hold and store
that as well. For both of them, assign them when we originally read the
SQE in io_get_sqring(). Any code that uses sqe->opcode or sqe->user_data
is switched to req->opcode and req->user_data.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

96baaf97

io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable · baf2a6a7

由 Jens Axboe 提交于 12月 17, 2019

to #26323578

commit b29472ee7b53784f44011069fad15e539fd25bcf upstream.

If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the timeout remove op into
the prep handling to make it safe for SQE reuse.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

baf2a6a7

io_uring: make IORING_OP_CANCEL_ASYNC deferrable · bb92d9dd

由 Jens Axboe 提交于 12月 17, 2019

to #26323578

commit fbf23849b1724d3ea362e346d0877a8d87978fe6 upstream.

If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the async cancel op into
the prep handling to make it safe for SQE reuse.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

bb92d9dd

io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable · 5027d877

由 Jens Axboe 提交于 12月 17, 2019

to #26323578

commit 0969e783e3a8913f79df27286501a6c21e961524 upstream.

If we defer these commands as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the poll add/remove into
the prep handling to make it safe for SQE reuse.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

5027d877

io_uring: make HARDLINK imply LINK · 384d1eb0

由 Pavel Begunkov 提交于 12月 17, 2019

to #26323578

commit ffbb8d6b76910d4f3a2bafeaf68c419011e98d05 upstream.

The rules are as follows, if IOSQE_IO_HARDLINK is specified, then it's a
link and there is no need to set IOSQE_IO_LINK separately, though it
could be there. Add proper check and ensure that IOSQE_IO_HARDLINK
implies IOSQE_IO_LINK.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

384d1eb0

io_uring: any deferred command must have stable sqe data · 1d0e0743

由 Jens Axboe 提交于 12月 16, 2019

to #26323578

commit 8ed8d3c3bc32bf5b442c9f54013b4a47d5cae740 upstream.

We're currently not retaining sqe data for accept, fsync, and
sync_file_range. None of these commands need data outside of what
is directly provided, hence it can't go stale when the request is
deferred. However, it can get reused, if an application reuses
SQE entries.

Ensure that we retain the information we need and only read the sqe
contents once, off the submission path. Most of this is just moving
code into a prep and finish function.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

1d0e0743

io_uring: remove 'sqe' parameter to the OP helpers that take it · 16b249b5

由 Jens Axboe 提交于 12月 10, 2019

to #26323578

commit fc4df999e24fc3006441acd4ce6250e6a76ac851 upstream.

We pass in req->sqe for all of them, no need to pass it in as the
request is always passed in. This is a necessary prep patch to be
able to cleanup/fix the request prep path.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

16b249b5

io_uring: fix pre-prepped issue with force_nonblock == true · c16f35ff

由 Jens Axboe 提交于 12月 15, 2019

to #26323578

commit b7bb4f7da0a1a92f142697f1c9ce335e7a44f4b1 upstream.

Some of these code paths assume that any force_nonblock == true issue
is not prepped, but that's not true if we did prep as part of link setup
earlier. Check if we already have an async context allocate before
setting up a new one.

Cleanup the async context setup in general, we have a lot of duplicated
code there.

Fixes: 03b1230ca12a ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Fixes: f67676d160c6 ("io_uring: ensure async punted read/write requests copy iovec")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c16f35ff

io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG · c5eb3fd3

由 Jens Axboe 提交于 12月 15, 2019

to #26323578

commit 0b416c3e1345fd696db4c422643468d844410877 upstream.

If we have to punt the recvmsg to async context, we copy all the
context.  But since the iovec used can be either on-stack (if small) or
dynamically allocated, if it's on-stack, then we need to ensure we reset
the iov pointer. If we don't, then we're reusing old stack data, and
that can lead to -EFAULTs if things get overwritten.

Ensure we retain the right pointers for the iov, and free it as well if
we end up having to go beyond UIO_FASTIOV number of vectors.

Fixes: 03b1230ca12a ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c5eb3fd3

io_uring: fix stale comment and a few typos · 29e01b6a

由 Brian Gianforcaro 提交于 12月 13, 2019

to #26323578

commit d195a66e367b3d24fdd3c3565f37ab7c6882b9d2 upstream.

- Fix a few typos found while reading the code.

- Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter
  was renamed to 'req', but the comment still holds.
Signed-off-by: NBrian Gianforcaro <b.gianfo@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

29e01b6a

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功