提交 · ef4ff581102a917a69877feca2e5347e2f3e458c · openeuler / Kernel

12 4月, 2020 5 次提交

io_uring: move all request init code in one place · ef4ff581

由 Pavel Begunkov 提交于 4月 12, 2020

Requests initialisation is scattered across several functions, namely
io_init_req(), io_submit_sqes(), io_submit_sqe(). Put it
in io_init_req() for better data locality and code clarity.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ef4ff581

io_uring: keep all sqe->flags in req->flags · dea3b49c

由 Pavel Begunkov 提交于 4月 12, 2020

It's a good idea to not read sqe->flags twice, as it's prone to security
bugs. Instead of passing it around, embeed them in req->flags. It's
already so except for IOSQE_IO_LINK.
1. rename former REQ_F_LINK -> REQ_F_LINK_HEAD
2. introduce and copy REQ_F_LINK, which mimics IO_IOSQE_LINK

And leave req_set_fail_links() using new REQ_F_LINK, because it's more
sensible.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dea3b49c

io_uring: early submission req fail code · 1d4240cc

由 Pavel Begunkov 提交于 4月 12, 2020

Having only one place for cleaning up a request after a link assembly/
submission failure will play handy in the future. At least it allows
to remove duplicated cleanup sequence.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1d4240cc

io_uring: track mm through current->mm · bf9c2f1c

由 Pavel Begunkov 提交于 4月 12, 2020

As a preparation for extracting request init bits, remove self-coded mm
tracking from io_submit_sqes(), but rely on current->mm. It's more
convenient, than passing this piece of state in other functions.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bf9c2f1c

io_uring: remove obsolete @mm_fault · dccc587f

由 Pavel Begunkov 提交于 4月 12, 2020

If io_submit_sqes() can't grab an mm, it fails and exits right away.
There is no need to track the fact of the failure. Remove @mm_fault.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dccc587f

10 4月, 2020 1 次提交

io_uring: punt final io_ring_ctx wait-and-free to workqueue · 85faa7b8

由 Jens Axboe 提交于 4月 09, 2020

We can't reliably wait in io_ring_ctx_wait_and_kill(), since the
task_works list isn't ordered (in fact it's LIFO ordered). We could
either fix this with a separate task_works list for io_uring work, or
just punt the wait-and-free to async context. This ensures that
task_work that comes in while we're shutting down is processed
correctly. If we don't go async, we could have work past the fput()
work for the ring that depends on work that won't be executed until
after we're done with the wait-and-free. But as this operation is
blocking, it'll never get a chance to run.

This was reproduced with hundreds of thousands of sockets running
memcached, haven't been able to reproduce this synthetically.
Reported-by: NDan Melnic <dmm@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

85faa7b8

09 4月, 2020 1 次提交

io_uring: fix fs cleanup on cqe overflow · c398ecb3

由 Pavel Begunkov 提交于 4月 09, 2020

If completion queue overflow occurs, __io_cqring_fill_event() will
update req->cflags, which is in a union with req->work and happens to
be aliased to req->work.fs. Following io_free_req() ->
io_req_work_drop_env() may get a bunch of different problems (miscount
fs->users, segfault, etc) on cleaning @fs.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c398ecb3

08 4月, 2020 6 次提交

io_uring: don't read user-shared sqe flags twice · 9c280f90

由 Pavel Begunkov 提交于 4月 08, 2020

Don't re-read userspace-shared sqe->flags, it can be exploited.
sqe->flags are copied into req->flags in io_submit_sqe(), check them
there instead.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9c280f90

io_uring: remove req init from io_get_req() · 0553b8bd

由 Pavel Begunkov 提交于 4月 08, 2020

io_get_req() do two different things: io_kiocb allocation and
initialisation. Move init part out of it and rename into
io_alloc_req(). It's simpler this way and also have better data
locality.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0553b8bd

io_uring: alloc req only after getting sqe · b1e50e54

由 Pavel Begunkov 提交于 4月 08, 2020

As io_get_sqe() split into 2 stage get/consume, get an sqe before
allocating io_kiocb, so no free_req*() for a failure case is needed,
and inline back __io_req_do_free(), which has only 1 user.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b1e50e54

io_uring: simplify io_get_sqring · 709b302f

由 Pavel Begunkov 提交于 4月 08, 2020

Make io_get_sqring() care only about sqes themselves, not initialising
the io_kiocb. Also, split it into get + consume, that will be helpful in
the future.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

709b302f

io_uring: do not always copy iovec in io_req_map_rw() · 45097dae

由 Xiaoguang Wang 提交于 4月 08, 2020

In io_read_prep() or io_write_prep(), io_req_map_rw() takes
struct io_async_rw's fast_iov as argument to call io_import_iovec(),
and if io_import_iovec() uses struct io_async_rw's fast_iov as
valid iovec array, later indeed io_req_map_rw() does not need
to do the memcpy operation, because they are same pointers.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

45097dae

io_uring: ensure openat sets O_LARGEFILE if needed · 08a1d26e

由 Jens Axboe 提交于 4月 08, 2020

OPENAT2 correctly sets O_LARGEFILE if it has to, but that escaped the
OPENAT opcode. Dmitry reports that his test case that compares openat()
and IORING_OP_OPENAT sees failures on large files:

*** sync openat
openat succeeded
sync write at offset 0
write succeeded
sync write at offset 4294967296
write succeeded

*** sync openat
openat succeeded
io_uring write at offset 0
write succeeded
io_uring write at offset 4294967296
write succeeded

*** io_uring openat
openat succeeded
sync write at offset 0
write succeeded
sync write at offset 4294967296
write failed: File too large

*** io_uring openat
openat succeeded
io_uring write at offset 0
write succeeded
io_uring write at offset 4294967296
write failed: File too large

Ensure we set O_LARGEFILE, if force_o_largefile() is true.

Cc: stable@vger.kernel.org # v5.6
Fixes: 15b71abe ("io_uring: add support for IORING_OP_OPENAT")
Reported-by: NDmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

08a1d26e

07 4月, 2020 2 次提交

io_uring: initialize fixed_file_data lock · f7fe9346

由 Xiaoguang Wang 提交于 4月 07, 2020

syzbot reports below warning:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 7099 Comm: syz-executor897 Not tainted 5.6.0-next-20200406-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 assign_lock_key kernel/locking/lockdep.c:913 [inline]
 register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225
 __lock_acquire+0x104/0x4e00 kernel/locking/lockdep.c:4223
 lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4923
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159
 io_sqe_files_register fs/io_uring.c:6599 [inline]
 __io_uring_register+0x1fe8/0x2f00 fs/io_uring.c:8001
 __do_sys_io_uring_register fs/io_uring.c:8081 [inline]
 __se_sys_io_uring_register fs/io_uring.c:8063 [inline]
 __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:8063
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x440289
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffff1bbf558 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440289
RDX: 0000000020000280 RSI: 0000000000000002 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000401b10
R13: 0000000000401ba0 R14: 0000000000000000 R15: 0000000000000000

Initialize struct fixed_file_data's lock to fix this issue.

Reported-by: syzbot+e6eeca4a035da76b3065@syzkaller.appspotmail.com
Fixes: 05589553 ("io_uring: refactor file register/unregister/update handling")
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f7fe9346

io_uring: remove redundant variable pointer nxt and io_wq_assign_next call · 211fea18

由 Colin Ian King 提交于 4月 06, 2020

An earlier commit "io_uring: remove @nxt from handlers" removed the
setting of pointer nxt and now it is always null, hence the non-null
check and call to io_wq_assign_next is redundant and can be removed.

Addresses-Coverity: ("'Constant' variable guard")
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

211fea18

06 4月, 2020 1 次提交

io_uring: fix ctx refcounting in io_submit_sqes() · 48bdd849

由 Pavel Begunkov 提交于 4月 06, 2020

If io_get_req() fails, it drops a ref. Then, awhile keeping @Submitted
unmodified, io_submit_sqes() breaks the loop and puts @nr - @Submitted
refs. For each submitted req a ref is dropped in io_put_req() and
friends. So, for @nr taken refs there will be
(@nr - @Submitted + @Submitted + 1) dropped.

Remove ctx refcounting from io_get_req(), that at the same time makes
it clearer.

Fixes: 2b85edfc ("io_uring: batch getting pcpu references")
Cc: stable@vger.kernel.org # v5.6
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

48bdd849

04 4月, 2020 5 次提交

io_uring: process requests completed with -EAGAIN on poll list · 581f9810

由 Bijan Mottahedeh 提交于 4月 03, 2020

A request that completes with an -EAGAIN result after it has been added
to the poll list, will not be removed from that list in io_do_iopoll()
because the f_op->iopoll() will not succeed for that request.

Maintain a retryable local list similar to the done list, and explicity
reissue requests with an -EAGAIN result.
Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

581f9810

io_uring: remove bogus RLIMIT_NOFILE check in file registration · c336e992

由 Jens Axboe 提交于 4月 03, 2020

We already checked this limit when the file was opened, and we keep it
open in the file table. Hence when we added unit_inflight to the count
we want to register, we're doubly accounting these files. This results
in -EMFILE for file registration, if we're at half the limit.

Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c336e992

io_uring: use io-wq manager as backup task if task is exiting · aa96bf8a

由 Jens Axboe 提交于 4月 03, 2020

If the original task is (or has) exited, then the task work will not get
queued properly. Allow for using the io-wq manager task to queue this
work for execution, and ensure that the io-wq manager notices and runs
this work if woken up (or exiting).
Reported-by: NDan Melnic <dmm@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

aa96bf8a

io_uring: grab task reference for poll requests · 3537b6a7

由 Jens Axboe 提交于 4月 03, 2020

We can have a task exit if it's not the owner of the ring. Be safe and
grab an actual reference to it, to avoid a potential use-after-free.
Reported-by: NDan Melnic <dmm@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3537b6a7

io_uring: retry poll if we got woken with non-matching mask · a6ba632d

由 Jens Axboe 提交于 4月 03, 2020

If we get woken and the poll doesn't match our mask, re-add the task
to the poll waitqueue and try again instead of completing the request
with a mask of 0.
Reported-by: NDan Melnic <dmm@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a6ba632d

01 4月, 2020 1 次提交

io_uring: add missing finish_wait() in io_sq_thread() · 10bea96d

由 Hillf Danton 提交于 4月 01, 2020

Add it to pair with prepare_to_wait() in an attempt to avoid
anything weird in the field.

Fixes: b41e9852 ("io_uring: add per-task callback handler")
Reported-by: syzbot+0c3370f235b74b3cfd97@syzkaller.appspotmail.com
Signed-off-by: NHillf Danton <hdanton@sina.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

10bea96d

31 3月, 2020 1 次提交

io_uring: refactor file register/unregister/update handling · 05589553

由 Xiaoguang Wang 提交于 3月 31, 2020

While diving into io_uring fileset register/unregister/update codes, we
found one bug in the fileset update handling. io_uring fileset update
use a percpu_ref variable to check whether we can put the previously
registered file, only when the refcnt of the perfcpu_ref variable
reaches zero, can we safely put these files. But this doesn't work so
well. If applications always issue requests continually, this
perfcpu_ref will never have an chance to reach zero, and it'll always be
in atomic mode, also will defeat the gains introduced by fileset
register/unresiger/update feature, which are used to reduce the atomic
operation overhead of fput/fget.

To fix this issue, while applications do IORING_REGISTER_FILES or
IORING_REGISTER_FILES_UPDATE operations, we allocate a new percpu_ref
and kill the old percpu_ref, new requests will use the new percpu_ref.
Once all previous old requests complete, old percpu_refs will be dropped
and registered files will be put safely.

Link: https://lore.kernel.org/io-uring/5a8dac33-4ca2-4847-b091-f7dcd3ad0ff3@linux.alibaba.com/T/#tSigned-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

05589553

27 3月, 2020 1 次提交

io_uring: cleanup io_alloc_async_ctx() · 3d9932a8

由 Xiaoguang Wang 提交于 3月 27, 2020

Cleanup io_alloc_async_ctx() a bit, add a new __io_alloc_async_ctx(),
so io_setup_async_rw() won't need to check whether async_ctx is true
or false again.
Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3d9932a8

25 3月, 2020 1 次提交

io_uring: fix missing 'return' in comment · bff6035d

由 Chucheng Luo 提交于 3月 25, 2020

The missing 'return' work may make it hard for other developers to
understand it.
Signed-off-by: NChucheng Luo <luochucheng@vivo.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bff6035d

23 3月, 2020 3 次提交

io-uring: drop 'free_pfile' in struct io_file_put · a5318d3c

由 Hillf Danton 提交于 3月 23, 2020

Sync removal of file is only used in case of a GFP_KERNEL kmalloc
failure at the cost of io_file_put::done and work flush, while a
glich like it can be handled at the call site without too much pain.

That said, what is proposed is to drop sync removing of file, and
the kink in neck as well.
Signed-off-by: NHillf Danton <hdanton@sina.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a5318d3c

io-uring: drop completion when removing file · 4afdb733

由 Hillf Danton 提交于 3月 23, 2020

A case of task hung was reported by syzbot,

INFO: task syz-executor975:9880 blocked for more than 143 seconds.
      Not tainted 5.6.0-rc6-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor975 D27576  9880   9878 0x80004000
Call Trace:
 schedule+0xd0/0x2a0 kernel/sched/core.c:4154
 schedule_timeout+0x6db/0xba0 kernel/time/timer.c:1871
 do_wait_for_common kernel/sched/completion.c:83 [inline]
 __wait_for_common kernel/sched/completion.c:104 [inline]
 wait_for_common kernel/sched/completion.c:115 [inline]
 wait_for_completion+0x26a/0x3c0 kernel/sched/completion.c:136
 io_queue_file_removal+0x1af/0x1e0 fs/io_uring.c:5826
 __io_sqe_files_update.isra.0+0x3a1/0xb00 fs/io_uring.c:5867
 io_sqe_files_update fs/io_uring.c:5918 [inline]
 __io_uring_register+0x377/0x2c00 fs/io_uring.c:7131
 __do_sys_io_uring_register fs/io_uring.c:7202 [inline]
 __se_sys_io_uring_register fs/io_uring.c:7184 [inline]
 __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:7184
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

and bisect pointed to 05f3fb3c ("io_uring: avoid ring quiesce for
fixed file set unregister and update").

It is down to the order that we wait for work done before flushing it
while nobody is likely going to wake us up.

We can drop that completion on stack as flushing work itself is a sync
operation we need and no more is left behind it.

To that end, io_file_put::done is re-used for indicating if it can be
freed in the workqueue worker context.
Reported-and-Inspired-by: Nsyzbot <syzbot+538d1957ce178382a394@syzkaller.appspotmail.com>
Signed-off-by: NHillf Danton <hdanton@sina.com>

Rename ->done to ->free_pfile
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4afdb733

io_uring: Fix ->data corruption on re-enqueue · 18a542ff

由 Pavel Begunkov 提交于 3月 23, 2020

work->data and work->list are shared in union. io_wq_assign_next() sets
->data if a req having a linked_timeout, but then io-wq may want to use
work->list, e.g. to do re-enqueue of a request, so corrupting ->data.

->data is not necessary, just remove it and extract linked_timeout
through @Link_list.

Fixes: 60cf46ae ("io-wq: hash dependent work")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

18a542ff

21 3月, 2020 1 次提交

io_uring: honor original task RLIMIT_FSIZE · 4ed734b0

由 Jens Axboe 提交于 3月 20, 2020

With the previous fixes for number of files open checking, I added some
debug code to see if we had other spots where we're checking rlimit()
against the async io-wq workers. The only one I found was file size
checking, which we should also honor.

During write and fallocate prep, store the max file size and override
that for the current ask if we're in io-wq worker context.

Cc: stable@vger.kernel.org # 5.1+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4ed734b0

20 3月, 2020 2 次提交

io_uring: make sure accept honor rlimit nofile · 09952e3e

由 Jens Axboe 提交于 3月 19, 2020

Just like commit 4022e7af, this fixes the fact that
IORING_OP_ACCEPT ends up using get_unused_fd_flags(), which checks
current->signal->rlim[] for limits.

Add an extra argument to __sys_accept4_file() that allows us to pass
in the proper nofile limit, and grab it at request prep time.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

09952e3e

io_uring: make sure openat/openat2 honor rlimit nofile · 4022e7af

由 Jens Axboe 提交于 3月 19, 2020

Dmitry reports that a test case shows that io_uring isn't honoring a
modified rlimit nofile setting. get_unused_fd_flags() checks the task
signal->rlimi[] for the limits. As this isn't easily inheritable,
provide a __get_unused_fd_flags() that takes the value instead. Then we
can grab it when the request is prepared (from the original task), and
pass that in when we do the async part part of the open.
Reported-by: NDmitry Kadashev <dkadashev@gmail.com>
Tested-by: NDmitry Kadashev <dkadashev@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4022e7af

15 3月, 2020 2 次提交

io-wq: split hashing and enqueueing · 8766dd51

由 Pavel Begunkov 提交于 3月 14, 2020

It's a preparation patch removing io_wq_enqueue_hashed(), which
now should be done by io_wq_hash_work() + io_wq_enqueue().

Also, set hash value for dependant works, and do it as late as possible,
because req->file can be unavailable before. This hash will be ignored
by io-wq.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8766dd51

io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN} · f1d96a8f

由 Pavel Begunkov 提交于 3月 13, 2020

Processing links, io_submit_sqe() prepares requests, drops sqes, and
passes them with sqe=NULL to io_queue_sqe(). There IOSQE_DRAIN and/or
IOSQE_ASYNC requests will go through the same prep, which doesn't expect
sqe=NULL and fail with NULL pointer deference.

Always do full prepare including io_alloc_async_ctx() for linked
requests, and then it can skip the second preparation.

Cc: stable@vger.kernel.org # 5.5
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f1d96a8f

12 3月, 2020 1 次提交

io_uring: fix truncated async read/readv and write/writev retry · 3f9d6441

由 Jens Axboe 提交于 3月 11, 2020

Ensure we keep the truncated value, if we did truncate it. If not, we
might read/write more than the registered buffer size.

Also for retry, ensure that we return the truncated mapped value for
the vectorized versions of the read/write commands.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3f9d6441

11 3月, 2020 1 次提交

io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled · 32b2244a

由 Xiaoguang Wang 提交于 3月 11, 2020

When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need
to do io completion events polling again, they can rely on io_sq_thread to do
polling work, which can reduce cpu usage and uring_lock contention.

I modify fio io_uring engine codes a bit to evaluate the performance:
static int fio_ioring_getevents(struct thread_data *td, unsigned int min,
                        continue;
                }

-               if (!o->sqpoll_thread) {
+               if (o->sqpoll_thread && o->hipri) {
                        r = io_uring_enter(ld, 0, actual_min,
                                                IORING_ENTER_GETEVENTS);
                        if (r < 0) {

and use "fio  -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread
-rw=read -ioengine=io_uring  -hipri=1 -sqthread_poll=1  -direct=1 -bs=4k
-size=10G -numjobs=1  -time_based -runtime=120"

original codes
--------------------------------------------------------------------
iodepth       |        4 |        8 |       16 |       32 |       64
bw            | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s
fio cpu usage |     100% |     100% |     100% |     100% |     100%
--------------------------------------------------------------------

with patch
--------------------------------------------------------------------
iodepth       |        4 |        8 |       16 |       32 |       64
bw            | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s
fio cpu usage |    63.8% |   74.4%% |    81.1% |    83.7% |    82.4%
--------------------------------------------------------------------
bw improve    |     5.5% |    13.2% |    12.3% |     9.8% |    11.5%
--------------------------------------------------------------------

From above test results, we can see that bw has above 5.5%~13%
improvement, and fio process's cpu usage also drops much. Note this
won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL
are both enabled, in this case, io_sq_thread always has 100% cpu usage.
I think this patch will be friendly to applications which will often use
io_uring_wait_cqe() or similar from liburing.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

32b2244a

10 3月, 2020 5 次提交

io_uring: Fix unused function warnings · 469956e8

由 YueHaibing 提交于 3月 04, 2020

If CONFIG_NET is not set, gcc warns:

fs/io_uring.c:3110:12: warning: io_setup_async_msg defined but not used [-Wunused-function]
 static int io_setup_async_msg(struct io_kiocb *req,
            ^~~~~~~~~~~~~~~~~~

There are many funcions wraped by CONFIG_NET, move them
together to simplify code, also fix this warning.
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NYueHaibing <yuehaibing@huawei.com>

Minor tweaks.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

469956e8

io_uring: add end-of-bits marker and build time verify it · 84557871

由 Jens Axboe 提交于 3月 03, 2020

Not easy to tell if we're going over the size of bits we can shove
in req->flags, so add an end-of-bits marker and a BUILD_BUG_ON()
check for it.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

84557871

io_uring: provide means of removing buffers · 067524e9

由 Jens Axboe 提交于 3月 02, 2020

We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers
is to trigger IO on them. The usual case of shrinking a buffer pool
would be to just not replenish the buffers when IO completes, and
instead just free it. But it may be nice to have a way to manually
remove a number of buffers from a given group, and
IORING_OP_REMOVE_BUFFERS provides that functionality.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

067524e9

io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG · 52de1fe1

由 Jens Axboe 提交于 2月 27, 2020

Like IORING_OP_READV, this is limited to supporting just a single
segment in the iovec passed in.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

52de1fe1

io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READV · 4d954c25

由 Jens Axboe 提交于 2月 27, 2020

This adds support for the vectored read. This is limited to supporting
just 1 segment in the iov, and is provided just for convenience for
applications that use IORING_OP_READV already.

The iov helpers will be used for IORING_OP_RECVMSG as well.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4d954c25

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功