提交 · c5eb3fd38528f950c4fbf0ebbe154d60fd483095 · openanolis / cloud-kernel

27 5月, 2020 40 次提交

io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG · c5eb3fd3

由 Jens Axboe 提交于 12月 15, 2019

to #26323578

commit 0b416c3e1345fd696db4c422643468d844410877 upstream.

If we have to punt the recvmsg to async context, we copy all the
context.  But since the iovec used can be either on-stack (if small) or
dynamically allocated, if it's on-stack, then we need to ensure we reset
the iov pointer. If we don't, then we're reusing old stack data, and
that can lead to -EFAULTs if things get overwritten.

Ensure we retain the right pointers for the iov, and free it as well if
we end up having to go beyond UIO_FASTIOV number of vectors.

Fixes: 03b1230ca12a ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c5eb3fd3

io_uring: fix stale comment and a few typos · 29e01b6a

由 Brian Gianforcaro 提交于 12月 13, 2019

to #26323578

commit d195a66e367b3d24fdd3c3565f37ab7c6882b9d2 upstream.

- Fix a few typos found while reading the code.

- Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter
  was renamed to 'req', but the comment still holds.
Signed-off-by: NBrian Gianforcaro <b.gianfo@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

29e01b6a

io_uring: ensure we return -EINVAL on unknown opcode · 569f4461

由 Jens Axboe 提交于 12月 11, 2019

to #26323578

commit 9e3aa61ae3e01ce1ce6361a41ef725e1f4d1d2bf upstream.

If we submit an unknown opcode and have fd == -1, io_op_needs_file()
will return true as we default to needing a file. Then when we go and
assign the file, we find the 'fd' invalid and return -EBADF. We really
should be returning -EINVAL for that case, as we normally do for
unsupported opcodes.

Change io_op_needs_file() to have the following return values:

0   - does not need a file
1   - does need a file
< 0 - error value

and use this to pass back the right value for this invalid case.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

569f4461

io_uring: add sockets to list of files that support non-blocking issue · d935af1c

由 Jens Axboe 提交于 12月 09, 2019

to #26323578

commit 10d59345578a116042c1a5d737a18234aaf3e0e6 upstream.

In chasing a performance issue between using IORING_OP_RECVMSG and
IORING_OP_READV on sockets, tracing showed that we always punt the
socket reads to async offload. This is due to io_file_supports_async()
not checking for S_ISSOCK on the inode. Since sockets supports the
O_NONBLOCK (or MSG_DONTWAIT) flag just fine, add sockets to the list
of file types that we can do a non-blocking issue to.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

d935af1c

io_uring: only hash regular files for async work execution · 1efbc3bd

由 Jens Axboe 提交于 12月 09, 2019

to #26323578

commit 53108d476a105ab2597d7a4e6040b127829391b5 upstream.

We hash regular files to avoid having multiple threads hammer on the
inode mutex, but it should not be needed on other types of files
(like sockets).
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

1efbc3bd

io_uring: run next sqe inline if possible · 4319a66f

由 Jens Axboe 提交于 12月 09, 2019

to #26323578

commit 4a0a7a187453e65bdd24b9ede045b4c36b958868 upstream.

One major use case of linked commands is the ability to run the next
link inline, if at all possible. This is done correctly for async
offload, but somewhere along the line we lost the ability to do so when
we were able to complete a request without having to punt it. Ensure
that we do so correctly.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4319a66f

io_uring: don't dynamically allocate poll data · c1f967d5

由 Jens Axboe 提交于 12月 09, 2019

to #26323578

commit 392edb45b24337eaa0bc1ecd4e3cf897e662ec61 upstream.

This essentially reverts commit e944475e6984. For high poll ops
workloads, like TAO, the dynamic allocation of the wait_queue
entry for IORING_OP_POLL_ADD adds considerable extra overhead.
Go back to embedding the wait_queue_entry, but keep the usage of
wait->private for the pointer stashing.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c1f967d5

io_uring: deferred send/recvmsg should assign iov · 284c1391

由 Jens Axboe 提交于 12月 09, 2019

to #26323578

commit d96885658d9971fc2c752b8699f17a42ef745db6 upstream.

Don't just assign it from the main call path, that can miss the case
when we're called from issue deferral.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

284c1391

io_uring: sqthread should grab ctx->uring_lock for submissions · e5b3fb54

由 Jens Axboe 提交于 12月 09, 2019

to #26323578

commit 8a4955ff1cca7d4da480774034a16e7c28bafec8 upstream.

We use the mutex to guard against registered file updates, for instance.
Ensure we're safe in accessing that state against concurrent updates.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e5b3fb54

io-wq: briefly spin for new work after finishing work · 52bb67cc

由 Jens Axboe 提交于 12月 07, 2019

to #26323578

commit e995d5123ed433e37a8d63ac528737c912592e3d upstream.

To avoid going to sleep only to get woken shortly thereafter, spin
briefly for new work upon completion of work.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

52bb67cc

io-wq: remove worker->wait waitqueue · 992c567e

由 Jens Axboe 提交于 12月 07, 2019

to #26323578

We only have one cases of using the waitqueue to wake the worker, the
rest are using wake_up_process(). Since we can save some cycles not
fiddling with the waitqueue io_wqe_worker(), switch the work activation
to task wakeup and get rid of the now unused wait_queue_head_t in
struct io_worker.

commit 506d95ff5d6aa0a099a116c49d3884e29801d843 upstream.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

992c567e

io_uring: allow unbreakable links · e10c16e1

由 Jens Axboe 提交于 12月 07, 2019

to #26323578

commit 4e88d6e7793f2f445f43bd608828541d7f43b608 upstream.

Some commands will invariably end in a failure in the sense that the
completion result will be less than zero. One such example is timeouts
that don't have a completion count set, they will always complete with
-ETIME unless cancelled.

For linked commands, we sever links and fail the rest of the chain if
the result is less than zero. Since we have commands where we know that
will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever
regardless of the completion result. Note that the link will still sever
if we fail submitting the parent request, hard links are only resilient
in the presence of completion results for requests that did submit
correctly.

Cc: stable@vger.kernel.org # v5.4
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e10c16e1

io_uring: fix a typo in a comment · c81a55f4

由 LimingWu 提交于 12月 05, 2019

to #26323578

commit 0b4295b5e2b9b42f3f3096496fe4775b656c9ba6 upstream.

thatn -> than.
Signed-off-by: NLiming Wu <19092205@suning.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c81a55f4

io_uring: hook all linked requests via link_list · 9e6624b6

由 Pavel Begunkov 提交于 12月 05, 2019

to #26323578

commit 4493233edcfc0ad0a7f76f1c83f95b1bcf280547 upstream.

Links are created by chaining requests through req->list with an
exception that head uses req->link_list. (e.g. link_list->list->list)
Because of that, io_req_link_next() needs complex splicing to advance.

Link them all through list_list. Also, it seems to be simpler and more
consistent IMHO.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

9e6624b6

io_uring: fix error handling in io_queue_link_head · 4e09c502

由 Pavel Begunkov 提交于 12月 05, 2019

to #26323578

commit 2e6e1fde32d7d41cf076c21060c329d3fdbce25c upstream.

In case of an error io_submit_sqe() drops a request and continues
without it, even if the request was a part of a link. Not only it
doesn't cancel links, but also may execute wrong sequence of actions.

Stop consuming sqes, and let the user handle errors.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4e09c502

io_uring: use hash table for poll command lookups · fc77bd62

由 Jens Axboe 提交于 12月 04, 2019

to #26323578

commit 78076bb64aa8ba5b7207c38b2660a9e10ffa8cc7 upstream.

We recently changed this from a single list to an rbtree, but for some
real life workloads, the rbtree slows down the submission/insertion
case enough so that it's the top cycle consumer on the io_uring side.
In testing, using a hash table is a more well rounded compromise. It
is fast for insertion, and as long as it's sized appropriately, it
works well for the cancellation case as well. Running TAO with a lot
of network sockets, this removes io_poll_req_insert() from spending
2% of the CPU cycles.
Reported-by: NDan Melnic <dmm@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

fc77bd62

io_uring: ensure deferred timeouts copy necessary data · bcad288f

由 Jens Axboe 提交于 12月 04, 2019

to #26323578

commit 2d28390aff879238f00e209e38c2a0b78717360e upstream.

If we defer a timeout, we should ensure that we copy the timespec
when we have consumed the sqe. This is similar to commit f67676d160c6
for read/write requests. We already did this correctly for timeouts
deferred as links, but do it generally and use the infrastructure added
by commit 1a6b74fc8702 instead of having the timeout deferral use its
own.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

bcad288f

io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT · f629be21

由 Jens Axboe 提交于 12月 04, 2019

to #26323578

commit 901e59bba9ddad4bc6994ecb8598ea60a993da4c upstream.

There's really no reason why we forbid things like link/drain etc on
regular timeout commands. Enable the usual SQE flags on timeouts.
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f629be21

io_uring: handle connect -EINPROGRESS like -EAGAIN · 11360117

由 Jens Axboe 提交于 12月 03, 2019

to #26323578

commit 87f80d623c6c93c721b2aaead8a45e848bc8ffbf upstream.

Right now we return it to userspace, which means the application has
to poll for the socket to be writeable. Let's just treat it like
-EAGAIN and have io_uring handle it internally, this makes it much
easier to use.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

11360117

io_uring: remove parameter ctx of io_submit_state_start · 708b8cda

由 Jackie Liu 提交于 12月 02, 2019

to #26323578

commit 22efde5998657f6d1f31592c659aa3a9c7ad65f1 upstream.

Parameter ctx we have never used, clean it up.
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

708b8cda

io_uring: mark us with IORING_FEAT_SUBMIT_STABLE · b3b5cb38

由 Jens Axboe 提交于 12月 02, 2019

to #26323578

commit da8c96906990f1108cb626ee7865e69267a3263b upstream.

If this flag is set, applications can be certain that any data for
async offload has been consumed when the kernel has consumed the
SQE.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

b3b5cb38

io_uring: ensure async punted connect requests copy data · 62753e93

由 Jens Axboe 提交于 12月 02, 2019

to #26323578

commit f499a021ea8c9f70321fce3d674d8eca5bbeee2c upstream.

Just like commit f67676d160c6 for read/write requests, this one ensures
that the sockaddr data has been copied for IORING_OP_CONNECT if we need
to punt the request to async context.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

62753e93

io_uring: ensure async punted sendmsg/recvmsg requests copy data · 43b411d0

由 Jens Axboe 提交于 12月 02, 2019

to #26323578

commit 03b1230ca12a12e045d83b0357792075bf94a1e0 upstream.

Just like commit f67676d160c6 for read/write requests, this one ensures
that the msghdr data is fully copied if we need to punt a recvmsg or
sendmsg system call to async context.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

43b411d0

net: disallow ancillary data for __sys_{send,recv}msg_file() · dbc2b5b9

由 Jens Axboe 提交于 11月 25, 2019

to #26323578

commit d69e07793f891524c6bbf1e75b9ae69db4450953 upstream.

Only io_uring uses (and added) these, and we want to disallow the
use of sendmsg/recvmsg for anything but regular data transfers.
Use the newly added prep helper to split the msghdr copy out from
the core function, to check for msg_control and msg_controllen
settings. If either is set, we return -EINVAL.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

dbc2b5b9

net: separate out the msghdr copy from ___sys_{send,recv}msg() · 15ec0cd5

由 Jens Axboe 提交于 11月 25, 2019

to #26323578

commit 4257c8ca13b084550574b8c9a667d9c90ff746eb upstream.

This is in preparation for enabling the io_uring helpers for sendmsg
and recvmsg to first copy the header for validation before continuing
with the operation.

There should be no functional changes in this patch.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

15ec0cd5

io_uring: ensure async punted read/write requests copy iovec · 99130197

由 Jens Axboe 提交于 12月 02, 2019

to #26323578

commit f67676d160c6ee2ed82917fadfed6d29cab8237c upstream.

Currently we don't copy the iovecs when we punt to async context. This
can be problematic for applications that store the iovec on the stack,
as they often assume that it's safe to let the iovec go out of scope
as soon as IO submission has been called. This isn't always safe, as we
will re-copy the iovec once we're in async context.

Make this 100% safe by copying the iovec just once. With this change,
applications may safely store the iovec on the stack for all cases.
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

99130197

io_uring: add general async offload context · 0876c718

由 Jens Axboe 提交于 12月 02, 2019

to #26323578

commit 1a6b74fc87024db59d41cd7346bd437f20fb3e2d upstream.

Right now we just copy the sqe for async offload, but we want to store
more context across an async punt. In preparation for doing so, put the
sqe copy inside a structure that we can expand. With this pointer added,
we can get rid of REQ_F_FREE_SQE, as that is now indicated by whether
req->io is NULL or not.

No functional changes in this patch.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

0876c718

io_uring: transform send/recvmsg() -ERESTARTSYS to -EINTR · 30d4a88e

由 Jens Axboe 提交于 12月 02, 2019

to #26323578

commit 441cdbd5449b4923cd413d3ba748124f91388be9 upstream.

We should never return -ERESTARTSYS to userspace, transform it into
-EINTR.

Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

30d4a88e

io_uring: fix missing kmap() declaration on powerpc · fafbb74f

由 Jens Axboe 提交于 11月 29, 2019

to #26323578

commit aa4c3967756c6c576a38a23ac511be211462a6b7 upstream.

Christophe reports that current master fails building on powerpc with
this error:

   CC      fs/io_uring.o
fs/io_uring.c: In function ‘loop_rw_iter’:
fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
[-Werror=implicit-function-declaration]
     iovec.iov_base = kmap(iter->bvec->bv_page)
                      ^
fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
without a cast [-Wint-conversion]
     iovec.iov_base = kmap(iter->bvec->bv_page)
                    ^
fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
[-Werror=implicit-function-declaration]
     kunmap(iter->bvec->bv_page);
     ^

which is caused by a missing highmem.h include. Fix it by including
it.

Fixes: 311ae9e159d8 ("io_uring: fix dead-hung for non-iter fixed rw")
Reported-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Tested-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

fafbb74f

io_uring: add mapping support for NOMMU archs · b1d8f472

由 Roman Penyaev 提交于 11月 28, 2019

to #26323578

commit 6c5c240e412682f97aecd233c1e706822704aa28 upstream.

That is a bit weird scenario but I find it interesting to run fio loads
using LKL linux, where MMU is disabled.  Probably other real archs which
run uClinux can also benefit from this patch.
Signed-off-by: NRoman Penyaev <rpenyaev@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

b1d8f472

io_uring: make poll->wait dynamically allocated · bf7088f1

由 Jens Axboe 提交于 11月 26, 2019

to #26323578

commit e944475e69849273ca8f1fe04a3ce81b5901d165 upstream.

In the quest to bring io_kiocb down to 3 cachelines, this one does
the trick. Make the wait_queue_entry for the poll command come out
of kmalloc instead of embedding it in struct io_poll_iocb, as the
latter is the largest member of io_kiocb. Once we trim this down a
bit, we're back at a healthy 192 bytes for struct io_kiocb.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

bf7088f1

io-wq: shrink io_wq_work a bit · e7680975

由 Jens Axboe 提交于 11月 26, 2019

to #26323578

commit 6206f0e180d4eddc0a178f57120ab1b913701f6e upstream.

Currently we're using 40 bytes for the io_wq_work structure, and 16 of
those is the doubly link list node. We don't need doubly linked lists,
we always add to tail to keep things ordered, and any other use case
is list traversal with deletion. For the deletion case, we can easily
support any node deletion by keeping track of the previous entry.

This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from
io_uring to 216 to 208 bytes.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e7680975

io-wq: fix handling of NUMA node IDs · 4e2bc1f5

由 Jann Horn 提交于 11月 26, 2019

to #26323578

commit 3fc50ab559f5ae400aa33bd0836b3602da7fa51b upstream.

There are several things that can go wrong in the current code on NUMA
systems, especially if not all nodes are online all the time:

 - If the identifiers of the online nodes do not form a single contiguous
   block starting at zero, wq->wqes will be too small, and OOB memory
   accesses will occur e.g. in the loop in io_wq_create().
 - If a node comes online between the call to num_online_nodes() and the
   for_each_node() loop in io_wq_create(), an OOB write will occur.
 - If a node comes online between io_wq_create() and io_wq_enqueue(), a
   lookup is performed for an element that doesn't exist, and an OOB read
   will probably occur.

Fix it by:

 - using nr_node_ids instead of num_online_nodes() for the allocation size;
   nr_node_ids is calculated by setup_nr_node_ids() to be bigger than the
   highest node ID that could possibly come online at some point, even if
   those nodes' identifiers are not a contiguous block
 - creating workers for all possible CPUs, not just all online ones

This is basically what the normal workqueue code also does, as far as I can
tell.
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4e2bc1f5

io_uring: use kzalloc instead of kcalloc for single-element allocations · b6fcf21d

由 Jann Horn 提交于 11月 26, 2019

to #26323578

commit ad6e005ca68de7af76f9ed3e4c9b6f0aa2f842e3 upstream.

These allocations are single-element allocations, so don't use the array
allocation wrapper for them.
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

b6fcf21d

io_uring: cleanup io_import_fixed() · e7436342

由 Pavel Begunkov 提交于 11月 25, 2019

to #26323578

commit 7d009165550adc64e3561c65ecce564125052e00 upstream.

Clean io_import_fixed() call site and make it return proper type.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e7436342

io_uring: inline struct sqe_submit · 505210f7

由 Pavel Begunkov 提交于 11月 25, 2019

to #26323578

commit cf6fd4bd559ee61a4454b161863c8de6f30f8dca upstream.

There is no point left in keeping struct sqe_submit. Inline it
into struct io_kiocb, so any req->submit.field is now just req->field

- moves initialisation of ring_file into io_get_req()
- removes duplicated req->sequence.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

505210f7

io_uring: store timeout's sqe->off in proper place · d18da430

由 Pavel Begunkov 提交于 11月 25, 2019

to #26323578

commit cc42e0ac17d3664a70e020dfe7897f14e7aa7453 upstream.

Timeouts' sequence offset (i.e. sqe->off) is stored in
req->submit.sequence under a false name. Keep it in timeout.data
instead. The unused space for sequence will be reclaimed in the
following patches.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

d18da430

io_uring: remove superfluous check for sqe->off in io_accept() · e8da7bb3

由 Hrvoje Zeba 提交于 11月 25, 2019

to #26323578

commit 8042d6ce8c40df0abb0d91662a754d074a3d3f16 upstream.

This field contains a pointer to addrlen and checking to see if it's set
returns -EINVAL if the caller sets addr & addrlen pointers.

Fixes: 17f2fe35d080 ("io_uring: add support for IORING_OP_ACCEPT")
Signed-off-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e8da7bb3

io_uring: fix dead-hung for non-iter fixed rw · aa572ac1

由 Pavel Begunkov 提交于 11月 24, 2019

to #26323578

commit 311ae9e159d81a1ec1cf645daf40b39ae5a0bd84 upstream.

Read/write requests to devices without implemented read/write_iter
using fixed buffers can cause general protection fault, which totally
hangs a machine.

io_import_fixed() initialises iov_iter with bvec, but loop_rw_iter()
accesses it as iovec, dereferencing random address.

kmap() page by page in this case

Cc: stable@vger.kernel.org
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

aa572ac1

io_uring: add support for IORING_OP_CONNECT · 78e9fdaa

由 Jens Axboe 提交于 11月 23, 2019

to #26323578

commit f8e85cf255ad57d65eeb9a9d0e59e3dec55bdd9e upstream.

This allows an application to call connect() in an async fashion. Like
other opcodes, we first try a non-blocking connect, then punt to async
context if we have to.

Note that we can still return -EINPROGRESS, and in that case the caller
should use IORING_OP_POLL_ADD to do an async wait for completion of the
connect request (just like for regular connect(2), except we can do it
async here too).
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

78e9fdaa

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功