提交 · 258b6eed3d0c672eb75c28a3837d807ee50caf9b · openanolis / cloud-kernel

28 5月, 2020 40 次提交

io_uring: fix use-after-free by io_cleanup_req() · 258b6eed

由 Pavel Begunkov 提交于 2月 19, 2020

to #26323588

commit 929a3af90f0f4bd7132d83552c1a98c83f60ef7e upstream.

io_cleanup_req() should be called before req->io is freed, and so
shouldn't be after __io_free_req() -> __io_req_aux_free(). Also,
it will be ignored for in io_free_req_many(), which use
__io_req_aux_free().

Place cleanup_req() into __io_req_aux_free().

Fixes: 99bc4c38537d774 ("io_uring: fix iovec leaks")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

258b6eed

io_uring: remove unnecessary NULL checks · b35af4a7

由 Dan Carpenter 提交于 2月 17, 2020

to #26323588

commit 297a31e3e8318f533cff4fe33ffaefb74f72c6e2 upstream.

The "kmsg" pointer can't be NULL and we have already dereferenced it so
a check here would be useless.
Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

b35af4a7

io_uring: add missing io_req_cancelled() · 26f5d3c5

由 Pavel Begunkov 提交于 2月 16, 2020

to #26323588

commit 7fbeb95d0f68e21e6ca61284f1ac681630976947 upstream.

fallocate_finish() is missing cancellation check. Add it.
It's safe to do that, as only flags setup and sqe fields copy are done
before it gets into __io_fallocate().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

26f5d3c5

io_uring: prune request from overflow list on flush · 17b038f1

由 Jens Axboe 提交于 2月 13, 2020

to #26323588

commit 2ca10259b4189a433c309054496dd6af1415f992 upstream.

Carter reported an issue where he could produce a stall on ring exit,
when we're cleaning up requests that match the given file table. For
this particular test case, a combination of a few things caused the
issue:

- The cq ring was overflown
- The request being canceled was in the overflow list

The combination of the above means that the cq overflow list holds a
reference to the request. The request is canceled correctly, but since
the overflow list holds a reference to it, the final put won't happen.
Since the final put doesn't happen, the request remains in the inflight.
Hence we never finish the cancelation flush.

Fix this by removing requests from the overflow list if we're canceling
them.

Cc: stable@vger.kernel.org # 5.5
Reported-by: NCarter Li 李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

17b038f1

io-wq: don't call kXalloc_node() with non-online node · 5e855d46

由 Jens Axboe 提交于 2月 11, 2020

to #26323588

commit 7563439adfae153b20331f1567c8b5d0e5cbd8a7 upstream.

Glauber reports a crash on init on a box he has:

 RIP: 0010:__alloc_pages_nodemask+0x132/0x340
 Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 <3b> 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02
 RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8
 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
 RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002
 R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000
 R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0
 FS:  00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  alloc_slab_page+0x46/0x320
  new_slab+0x9d/0x4e0
  ___slab_alloc+0x507/0x6a0
  ? io_wq_create+0xb4/0x2a0
  __slab_alloc+0x1c/0x30
  kmem_cache_alloc_node_trace+0xa6/0x260
  io_wq_create+0xb4/0x2a0
  io_uring_setup+0x97f/0xaa0
  ? io_remove_personalities+0x30/0x30
  ? io_poll_trigger_evfd+0x30/0x30
  do_syscall_64+0x5b/0x1c0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f4d116cb1ed

which is due to the 'wqe' and 'worker' allocation being node affine.
But it isn't valid to call the node affine allocation if the node isn't
online.

Setup structures for even offline nodes, as usual, but skip them in
terms of thread setup to not waste resources. If the node isn't online,
just alloc memory with NUMA_NO_NODE.
Reported-by: NGlauber Costa <glauber@scylladb.com>
Tested-by: NGlauber Costa <glauber@scylladb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

5e855d46

io_uring: retain sockaddr_storage across send/recvmsg async punt · c03336ef

由 Jens Axboe 提交于 2月 09, 2020

to #26323588

commit b537916ca5107c3a8714b8ab3099c0ec205aec12 upstream.

Jonas reports that he sometimes sees -97/-22 error returns from
sendmsg, if it gets punted async. This is due to not retaining the
sockaddr_storage between calls. Include that in the state we copy when
going async.

Cc: stable@vger.kernel.org # 5.3+
Reported-by: NJonas Bonn <jonas@norrbonn.se>
Tested-by: NJonas Bonn <jonas@norrbonn.se>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c03336ef

io_uring: cancel pending async work if task exits · e9fd2309

由 Jens Axboe 提交于 2月 08, 2020

to #26323588

commit 6ab231448fdc5e37c15a94a4700fca11e80007f7 upstream.

Normally we cancel all work we track, but for untracked work we could
leave the async worker behind until that work completes. This is totally
fine, but does leave resources pending after the task is gone until that
work completes.

Cancel work that this task queued up when it goes away.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e9fd2309

io-wq: add io_wq_cancel_pid() to cancel based on a specific pid · 40270319

由 Jens Axboe 提交于 2月 08, 2020

to #26323588

commit 36282881a795cbf717aca79392ae9cdf0fef59c9 upstream.

Add a helper that allows the caller to cancel work based on what mm
it belongs to. This allows io_uring to cancel work from a given
task or thread when it exits.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

40270319

io-wq: make io_wqe_cancel_work() take a match handler · 89768418

由 Jens Axboe 提交于 2月 08, 2020

to #26323588

commit 00bcda13dcbf6bf7fa6f2a5886dd555362de8cfa upstream.

We want to use the cancel functionality for canceling based on not
just the work itself. Instead of matching on the work address
manually, allow a match handler to tell us if we found the right work
item or not.

No functional changes in this patch.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

89768418

io_uring: fix openat/statx's filename leak · 97f6378d

由 Pavel Begunkov 提交于 2月 08, 2020

to #26323588

commit 0bdbdd08a8f991bdaee54465a168c0795ea5d28b upstream.

As in the previous patch, make openat*_prep() and statx_prep() handle
double preparation to avoid resource leakage.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

97f6378d

io_uring: fix double prep iovec leak · 8b3d1de0

由 Pavel Begunkov 提交于 2月 08, 2020

to #26323588

commit 5f798beaf35d79355cbf18019c1993a84475a2c3 upstream.

Requests may be prepared multiple times with ->io allocated (i.e. async
prepared). Preparation functions don't handle it and forget about
previously allocated resources. This may happen in case of:
- spurious defer_check
- non-head (i.e. async prepared) request executed in sync (via nxt).

Make the handlers check, whether they already allocated resources, which
is true IFF REQ_F_NEED_CLEANUP is set.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

8b3d1de0

io_uring: fix async close() with f_op->flush() · 4a6b66dd

由 Pavel Begunkov 提交于 2月 08, 2020

to #26323588

commit a93b33312f63ef6d5997f45d6fdf4de84c5396cc upstream.

First, io_close() misses filp_close() and io_cqring_add_event(), when
f_op->flush is defined. That's because in this case it will
io_queue_async_work() itself not grabbing files, so the corresponding
chunk in io_close_finish() won't be executed.

Second, when submitted through io_wq_submit_work(), it will do
filp_close() and *_add_event() twice: first inline in io_close(),
and the second one in call to io_close_finish() from io_close().
The second one will also fire, because it was submitted async through
generic path, and so have grabbed files.

And the last nice thing is to remove this weird pilgrimage with checking
work/old_work and casting it to nxt. Just use a helper instead.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4a6b66dd

io_uring: allow AT_FDCWD for non-file openat/openat2/statx · 083d6337

由 Jens Axboe 提交于 2月 06, 2020

to #26323588

commit 0b5faf6ba7fb78bb1fe7336d23ea1978386a6c3a upstream.

Don't just check for dirfd == -1, we should allow AT_FDCWD as well for
relative lookups.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

083d6337

io_uring: grab ->fs as part of async preparation · ddf405a0

由 Jens Axboe 提交于 2月 07, 2020

to #26323588

commit ff002b30181d30cdfbca316dadd099c3ca0d739c upstream.

This passes it in to io-wq, so it assumes the right fs_struct when
executing async work that may need to do lookups.

Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

ddf405a0

io-wq: add support for inheriting ->fs · 357f8607

由 Jens Axboe 提交于 2月 06, 2020

to #26323588

commit 9392a27d88b9707145d713654eb26f0c29789e50 upstream.

Some work items need this for relative path lookup, make it available
like the other inherited credentials/mm/etc.

Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

357f8607

io_uring: retry raw bdev writes if we hit -EOPNOTSUPP · 50df07cb

由 Jens Axboe 提交于 2月 07, 2020

to #26323588

commit faac996ccd5da95bc56b91aa80f2643c2d0a1c56 upstream.

For non-blocking issue, we set IOCB_NOWAIT in the kiocb. However, on a
raw block device, this yields an -EOPNOTSUPP return, as non-blocking
writes aren't supported. Turn this -EOPNOTSUPP into -EAGAIN, so we retry
from blocking context with IOCB_NOWAIT cleared.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

50df07cb

io_uring: add cleanup for openat()/statx() · 8c3e80dd

由 Pavel Begunkov 提交于 2月 07, 2020

to #26323588

commit 8fef80bf56a49c60b457dedb99fd6c5279a5dbe1 upstream.

openat() and statx() may have allocated ->open.filename, which should be
be put. Add cleanup handlers for them.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

8c3e80dd

io_uring: fix iovec leaks · 36d6c6ed

由 Pavel Begunkov 提交于 2月 07, 2020

to #26323588

commit 99bc4c38537d774e667d043c520914082da19abf upstream.

Allocated iovec is freed only in io_{read,write,send,recv)(), and just
leaves it if an error occured. There are plenty of such cases:
- cancellation of non-head requests
- fail grabbing files in __io_queue_sqe()
- set REQ_F_NOWAIT and returning in __io_queue_sqe()

Add REQ_F_NEED_CLEANUP, which will force such requests with custom
allocated resourses go through cleanup handlers on put.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

36d6c6ed

io_uring: remove unused struct io_async_open · 5d043bb1

由 Pavel Begunkov 提交于 2月 07, 2020

to #26323588

commit e96e977992d0ea40b6e70cb63dede85c9078e744 upstream.

struct io_async_open is unused, remove it.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

5d043bb1

io_uring: flush overflowed CQ events in the io_uring_poll() · 644813a9

由 Stefano Garzarella 提交于 2月 07, 2020

to #26323588

commit 63e5d81f72af1bf370bf8a6745b0a8d71a7bb37d upstream.

In io_uring_poll() we must flush overflowed CQ events before to
check if there are CQ events available, to avoid missing events.

We call the io_cqring_events() that checks and flushes any overflow
and returns the number of CQ events available.
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

644813a9

io_uring: statx/openat/openat2 don't support fixed files · c51ce185

由 Jens Axboe 提交于 2月 06, 2020

to #26323588

commit cf3040ca55f2085b0a372a620ee2cb93ae19b686 upstream.

All of these opcodes take a directory file descriptor. We can't easily
support fixed files for these operations, and the use case for that
probably isn't all that clear (or sensible) anyway.

Disable IOSQE_FIXED_FILE for these operations.
Reported-by: NStefan Metzmacher <metze@samba.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c51ce185

io_uring: fix deferred req iovec leak · 343ae230

由 Pavel Begunkov 提交于 2月 06, 2020

to #26323588

commit 1e95081cb5b4cf77065d37866f57cf3c90a3df78 upstream.

After defer, a request will be prepared, that includes allocating iovec
if needed, and then submitted through io_wq_submit_work() but not custom
handler (e.g. io_rw_async()/io_sendrecv_async()). However, it'll leak
iovec, as it's in io-wq and the code goes as follows:

io_read() {
	if (!io_wq_current_is_worker())
		kfree(iovec);
}

Put all deallocation logic in io_{read,write,send,recv}(), which will
leave the memory, if going async with -EAGAIN.

It also fixes a leak after failed io_alloc_async_ctx() in
io_{recv,send}_msg().

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

343ae230

io_uring: fix 1-bit bitfields to be unsigned · ced6d31b

由 Randy Dunlap 提交于 2月 05, 2020

to #26323588

commit e1d85334d62386e9503e4a0d5d022e2d8e0011a0 upstream.

Make bitfields of size 1 bit be unsigned (since there is no room
for the sign bit).
This clears up the sparse warnings:

  CHECK   ../fs/io_uring.c
../fs/io_uring.c:207:50: error: dubious one-bit signed bitfield
../fs/io_uring.c:208:55: error: dubious one-bit signed bitfield
../fs/io_uring.c:209:63: error: dubious one-bit signed bitfield
../fs/io_uring.c:210:54: error: dubious one-bit signed bitfield
../fs/io_uring.c:211:57: error: dubious one-bit signed bitfield

Found by sight and then verified with sparse.

Fixes: 69b3e546139a ("io_uring: change io_ring_ctx bool fields into bit fields")
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

ced6d31b

io_uring: get rid of delayed mm check · e181f2f8

由 Pavel Begunkov 提交于 2月 06, 2020

to #26323588

commit 1cb1edb2f5ba8a3e8d47ded391007c6fe3ac0ad7 upstream.

Fail fast if can't grab mm, so past that requests always have an mm
when required. This allows us to remove req->user altogether.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e181f2f8

io_uring: cleanup fixed file data table references · 93ba78fd

由 Jens Axboe 提交于 2月 04, 2020

to #26323588

commit 2faf852d1be8a4960d328492298da6448cca0279 upstream.

syzbot reports a use-after-free in io_ring_file_ref_switch() when it
tries to switch back to percpu mode. When we put the final reference to
the table by calling percpu_ref_kill_and_confirm(), we don't want the
zero reference to queue async work for flushing the potentially queued
up items. We currently do a few flush_work(), but they merely paper
around the issue, since the work item may not have been queued yet
depending on the when the percpu-ref callback gets run.

Coming into the file unregister, we know we have the ring quiesced.
io_ring_file_ref_switch() can check for whether or not the ref is dying
or not, and not queue anything async at that point. Once the ref has
been confirmed killed, flush any potential items manually.

Reported-by: syzbot+7caeaea49c2c8a591e3d@syzkaller.appspotmail.com
Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

93ba78fd

io_uring: spin for sq thread to idle on shutdown · 74b46d93

由 Jens Axboe 提交于 2月 04, 2020

to #26323588

commit df069d80c8e38c19531c392322e9a16617475c44 upstream.

As part of io_uring shutdown, we cancel work that is pending and won't
necessarily complete on its own. That includes requests like poll
commands and timeouts.

If we're using SQPOLL for kernel side submission and we shutdown the
ring immediately after queueing such work, we can race with the sqthread
doing the submission. This means we may miss cancelling some work, which
results in the io_uring shutdown hanging forever.

Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

74b46d93

io_uring: put the flag changing code in the same spot · a7c70de9

由 Pavel Begunkov 提交于 2月 01, 2020

to #26323588

commit 3e577dcd73a1fdc641bf45e5ea4a37869de221b5 upstream.

Both iocb_flags() and kiocb_set_rw_flags() are inline and modify
kiocb->ki_flags. Place them close, so they can be potentially better
optimised.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

a7c70de9

io_uring: iterate req cache backwards · f80ed44f

由 Pavel Begunkov 提交于 2月 01, 2020

to #26323588

commit 6c8a31346925cbb373f84a18428ab3df59d3950e upstream.

Grab requests from cache-array from the end, so can get by only
free_reqs.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f80ed44f

io_uring: punt even fadvise() WILLNEED to async context · a8af11c5

由 Jens Axboe 提交于 2月 01, 2020

to #26323588

commit 3e69426da2599677ebbe76e2d97a606c4797bd74 upstream.

Andres correctly points out that read-ahead can block, if it needs to
read in meta data (or even just through the page cache page allocations).
Play it safe for now and just ensure WILLNEED is also punted to async
context.

While in there, allow the file settings hints from non-blocking
context. They don't need to start/do IO, and we can safely do them
inline.

Fixes: 4840e418c2fc ("io_uring: add IORING_OP_FADVISE")
Reported-by: NAndres Freund <andres@anarazel.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

a8af11c5

io_uring: fix sporadic double CQE entry for close · f809fef1

由 Jens Axboe 提交于 1月 31, 2020

to #26323588

commit 1a417f4e618e05fba29ba222f1e8555c302376ce upstream.

We punt close to async for the final fput(), but we log the completion
even before that even in that case. We rely on the request not having
a files table assigned to detect what the final async close should do.
However, if we punt the async queue to __io_queue_sqe(), we'll get
->files assigned and this makes io_close_finish() think it should both
close the filp again (which does no harm) AND log a new CQE event for
this request. This causes duplicate CQEs.

Queue the request up for async manually so we don't grab files
needlessly and trigger this condition.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f809fef1

io_uring: remove extra ->file check · e0280e19

由 Pavel Begunkov 提交于 2月 01, 2020

to #26323588

commit 9250f9ee194dc3dcee28a42a1533fa2cc0edd215 upstream.

It won't ever get into io_prep_rw() when req->file haven't been set in
io_req_set_file(), hence remove the check.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e0280e19

io_uring: don't map read/write iovec potentially twice · 4bde6717

由 Jens Axboe 提交于 1月 31, 2020

to #26323588

commit 5d204bcfa09330972ad3428a8f81c23f371d3e6d upstream.

If we have a read/write that is deferred, we already setup the async IO
context for that request, and mapped it. When we later try and execute
the request and we get -EAGAIN, we don't want to attempt to re-map it.
If we do, we end up with garbage in the iovec, which typically leads
to an -EFAULT or -EINVAL completion.

Cc: stable@vger.kernel.org # 5.5
Reported-by: NDan Melnic <dmm@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4bde6717

io_uring: use the proper helpers for io_send/recv · a44f9077

由 Jens Axboe 提交于 1月 31, 2020

to #26323588

commit 0b7b21e42ba2d6ac9595a4358a9354249605a3af upstream.

Don't use the recvmsg/sendmsg helpers, use the same helpers that the
recv(2) and send(2) system calls use.
Reported-by: N李通洲 <carter.li@eoitek.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

a44f9077

io_uring: prevent potential eventfd recursion on poll · 0c6f7422

由 Jens Axboe 提交于 2月 01, 2020

to #26323588

commit f0b493e6b9a8959356983f57112229e69c2f7b8c upstream.

If we have nested or circular eventfd wakeups, then we can deadlock if
we run them inline from our poll waitqueue wakeup handler. It's also
possible to have very long chains of notifications, to the extent where
we could risk blowing the stack.

Check the eventfd recursion count before calling eventfd_signal(). If
it's non-zero, then punt the signaling to async context. This is always
safe, as it takes us out-of-line in terms of stack and locking context.

Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

0c6f7422

eventfd: track eventfd_signal() recursion depth · 0c570e16

由 Jens Axboe 提交于 2月 02, 2020

to #26323588

commit b5e683d5cab8cd433b06ae178621f083cabd4f63 upstream.

eventfd use cases from aio and io_uring can deadlock due to circular
or resursive calling, when eventfd_signal() tries to grab the waitqueue
lock. On top of that, it's also possible to construct notification
chains that are deep enough that we could blow the stack.

Add a percpu counter that tracks the percpu recursion depth, warn if we
exceed it. The counter is also exposed so that users of eventfd_signal()
can do the right thing if it's non-zero in the context where it is
called.

Cc: stable@vger.kernel.org # 4.19+
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

0c570e16

io_uring: add BUILD_BUG_ON() to assert the layout of struct io_uring_sqe · 955781d9

由 Stefan Metzmacher 提交于 1月 29, 2020

to #26323588

commit d7f62e825fd19202a0749d10fb439714c51f67d2 upstream.

With nesting of anonymous unions and structs it's hard to
review layout changes. It's better to ask the compiler
for these things.
Signed-off-by: NStefan Metzmacher <metze@samba.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

955781d9

io_uring: add ->show_fdinfo() for the io_uring file descriptor · 80f00261

由 Jens Axboe 提交于 1月 30, 2020

to #26323588

commit 87ce955b24c9940cb2ca7e5173fcf175578d9fe9 upstream.

It can be hard to know exactly what is registered with the ring.
Especially for credentials, it'd be handy to be able to see which
ones are registered, what personalities they have, and what the ID
of each of them is.

This adds support for showing information registered in the ring from
the fdinfo of the io_uring fd. Here's an example from a test case that
registers 4 files (two of them sparse), 4 buffers, and 2 personalities:

pos:	0
flags:	02000002
mnt_id:	14
UserFiles:	4
    0: file-no-1
    1: file-no-2
    2: <none>
    3: <none>
UserBufs:	4
    0: 0x563817c46000/128
    1: 0x563817c47000/256
    2: 0x563817c48000/512
    3: 0x563817c49000/1024
Personalities:
    1
	Uid:	0		0		0		0
	Gid:	0		0		0		0
	Groups:	0
	CapEff:	0000003fffffffff
    2
	Uid:	0		0		0		0
	Gid:	0		0		0		0
	Groups:	0
	CapEff:	0000003fffffffff
Suggested-by: NJann Horn <jannh@google.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

80f00261

io_uring: add support for epoll_ctl(2) · c9166992

由 Jens Axboe 提交于 1月 08, 2020

to #26323588

commit 3e4827b05d2ac2d377ed136a52829ec46787bf4b upstream.

This adds IORING_OP_EPOLL_CTL, which can perform the same work as the
epoll_ctl(2) system call.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c9166992

eventpoll: support non-blocking do_epoll_ctl() calls · 73681652

由 Jens Axboe 提交于 1月 08, 2020

to #26323588

commit 39220e8d4a2aaab045ea03cc16d737e85d0817bf upstream.

Also make it available outside of epoll, along with the helper that
decides if we need to copy the passed in epoll_event.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

73681652

eventpoll: abstract out epoll_ctl() handler · 10702bb9

由 Jens Axboe 提交于 1月 08, 2020

to #26323588

commit 58e41a44c488f3e9601fd8150f58377ef8f44889 upstream.

No functional changes in this patch.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

10702bb9

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功