- 01 10月, 2020 38 次提交
-
-
由 Jens Axboe 提交于
We always use &req->task_work anyway, no point in passing it in. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
All request preparations are done only during submission, reflect it in the code by moving io_req_prep() much earlier into io_queue_sqe(). That's much cleaner, because it doen't expose bits to async code which it won't ever use. Also it makes the interface harder to misuse, and there are potential places for bugs. For instance, __io_queue() doesn't clear @sqe before proceeding to a next linked request, that could have been disastrous, but hopefully there are linked requests IFF sqe==NULL, so not actually a bug. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
io_issue_sqe() does two things at once, trying to prepare request and issuing them. Split it in two and deduplicate with io_defer_prep(). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
All io_*_prep() functions including io_{read,write}_prep() are called only during submission where @force_nonblock is always true. Don't keep propagating it and instead remove the @force_nonblock argument from prep() altogether. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Move setting IOCB_NOWAIT from io_prep_rw() into io_read()/io_write(), so it's set/cleared in a single place. Also remove @force_nonblock parameter from io_prep_rw(). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
REQ_F_NEED_CLEANUP is set only by io_*_prep() and they're guaranteed to be called only once, so there is no one who may have set the flag before. Kill REQ_F_NEED_CLEANUP check in these *prep() handlers. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Put brackets around bitwise ops in a complex expression Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Extract common code from if/else branches. That is cleaner and optimised even better. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Joseph Qi 提交于
In most cases we'll specify IORING_SETUP_SQPOLL and run multiple io_uring instances in a host. Since all sqthreads are named "io_uring-sq", it's hard to distinguish the relations between application process and its io_uring sqthread. With this patch, application can get its corresponding sqthread pid and cpu through show_fdinfo. Steps: 1. Get io_uring fd first. $ ls -l /proc/<pid>/fd | grep -w io_uring 2. Then get io_uring instance related info, including corresponding sqthread pid and cpu. $ cat /proc/<pid>/fdinfo/<io_uring_fd> pos: 0 flags: 02000002 mnt_id: 13 SqThread: 6929 SqThreadCpu: 2 UserFiles: 1 0: testfile UserBufs: 0 PollList: Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NStefano Garzarella <sgarzare@redhat.com> [axboe: fixed for new shared SQPOLL] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We do this for CQ ring wait, in case task_work completions come in. We should do the same in io_uring_register(), to avoid spurious -EINTR if the ring quiescing ends up having to process task_work to complete the operation Reported-by: NDan Melnic <dmm@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dennis Zhou 提交于
There are a few operations that are offloaded to the worker threads. In this case, we lose process context and end up in kthread context. This results in ios to be not accounted to the issuing cgroup and consequently end up as issued by root. Just like others, adopt the personality of the blkcg too when issuing via the workqueues. For the SQPOLL thread, it will live and attach in the inited cgroup's context. Signed-off-by: NDennis Zhou <dennis@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
io_uring does account any registered buffer as pinned/locked memory, and checks limit and fails if the given user doesn't have a big enough limit to register the ranges specified. However, if huge pages are used, we are potentially under-accounting the memory in terms of what gets pinned on the vm side. This patch rectifies that, by ensuring that we account the full size of a compound page, regardless of how much of it is being registered. Huge pages are not accounted mulitple times - if multiple sections of a huge page is registered, then the page is only accounted once. Reported-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Zheng Bin 提交于
Fixes coccicheck warning: fs/io_uring.c:4242:13-14: Unneeded semicolon Signed-off-by: NZheng Bin <zhengbin13@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
In the spirit of fairness, cap the max number of SQ entries we'll submit for SQPOLL if we have multiple rings. If we don't do that, we could be submitting tons of entries for one ring, while others are waiting to get service. The value of 8 is somewhat arbitrarily chosen as something that allows a fair bit of batching, without using an excessive time per ring. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
There's really no point in having this union, it just means that we're always allocating enough room to cater to any command. But that's pointless, as the ->io field is request type private anyway. This gets rid of the io_async_ctx structure, and fills in the required size in the io_op_defs[] instead. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary, because in that case ctx->nr_user_bufs would be zero, and the following check would fail. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
When io_req_map_rw() is called from io_rw_prep_async(), it memcpy() iorw->iter into itself. Even though it doesn't lead to an error, such a memcpy()'s aliasing rules violation is considered to be a bad practise. Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any remapping there, so it's much simpler than the generic implementation. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Set rw->free_iovec to @iovec, that gives an identical result and stresses that @iovec param rw->free_iovec play the same role. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Don't touch iter->iov and iov in between __io_import_iovec() and io_req_map_rw(), the former function aleady sets it correctly, because it creates one more case with NULL'ed iov to consider in io_req_map_rw(). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
When using SQPOLL, applications can run into the issue of running out of SQ ring entries because the thread hasn't consumed them yet. The only option for dealing with that is checking later, or busy checking for the condition. Provide IORING_ENTER_SQ_WAIT if applications want to wait on this condition. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
These structures are never written, move them appropriately. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We support using IORING_SETUP_ATTACH_WQ to share async backends between rings created by the same process, this now also allows the same to happen with SQPOLL. The setup procedure remains the same, the caller sets io_uring_params->wq_fd to the 'parent' context, and then the newly created ring will attach to that async backend. This means that multiple rings can share the same SQPOLL thread, saving resources. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Remove the SQPOLL thread from the ctx, and use the io_sq_data as the data structure we pass in. io_sq_data has a list of ctx's that we can then iterate over and handle. As of now we're ready to handle multiple ctx's, though we're still just handling a single one after this patch. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Move all the necessary state out of io_ring_ctx, and into a new structure, io_sq_data. The latter now deals with any state or variables associated with the SQPOLL thread itself. In preparation for supporting more than one io_ring_ctx per SQPOLL thread. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
This is done in preparation for handling more than one ctx, but it also cleans up the code a bit since io_sq_thread() was a bit too unwieldy to get a get overview on. __io_sq_thread() is now the main handler, and it returns an enum sq_ret that tells io_sq_thread() what it ended up doing. The parent then makes a decision on idle, spinning, or work handling based on that. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We need to decouple the clearing on wakeup from the the inline schedule, as that is going to be required for handling multiple rings in one thread. Wrap our wakeup handler so we can clear it when we get the wakeup, by definition that is when we no longer need the flag set. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
This is in preparation to sharing the poller thread between rings. For that we need per-ring wait_queue_entry storage, and we can't easily put that on the stack if one thread is managing multiple rings. We'll also be sharing the wait_queue_head across rings for the purposes of wakeups, provide the usual private ring wait_queue_head for now but make it a pointer so we can easily override it when sharing. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We're not handling signals by default in kernel threads, and we never use TWA_SIGNAL for the SQPOLL thread internally. Hence we can never have a signal pending, and we don't need to check for it (nor flush it). Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Stefano Garzarella 提交于
This patch adds a new IORING_SETUP_R_DISABLED flag to start the rings disabled, allowing the user to register restrictions, buffers, files, before to start processing SQEs. When IORING_SETUP_R_DISABLED is set, SQE are not processed and SQPOLL kthread is not started. The restrictions registration are allowed only when the rings are disable to prevent concurrency issue while processing SQEs. The rings can be enabled using IORING_REGISTER_ENABLE_RINGS opcode with io_uring_register(2). Suggested-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NStefano Garzarella <sgarzare@redhat.com> Reviewed-by: NKees Cook <keescook@chromium.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Stefano Garzarella 提交于
The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently installs a feature allowlist on an io_ring_ctx. The io_ring_ctx can then be passed to untrusted code with the knowledge that only operations present in the allowlist can be executed. The allowlist approach ensures that new features added to io_uring do not accidentally become available when an existing application is launched on a newer kernel version. Currently is it possible to restrict sqe opcodes, sqe flags, and register opcodes. IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards it is not possible to change restrictions anymore. This prevents untrusted code from removing restrictions. Suggested-by: NStefan Hajnoczi <stefanha@redhat.com> Signed-off-by: NStefano Garzarella <sgarzare@redhat.com> Reviewed-by: NKees Cook <keescook@chromium.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If we don't get and assign the namespace for the async work, then certain paths just don't work properly (like /dev/stdin, /proc/mounts, etc). Anything that references the current namespace of the given task should be assigned for async work on behalf of that task. Cc: stable@vger.kernel.org # v5.5+ Reported-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Grab actual references to the files_struct. To avoid circular references issues due to this, we add a per-task note that keeps track of what io_uring contexts a task has used. When the tasks execs or exits its assigned files, we cancel requests based on this tracking. With that, we can grab proper references to the files table, and no longer need to rely on stashing away ring_fd and ring_file to check if the ring_fd may have been closed. Cc: stable@vger.kernel.org # v5.5+ Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
This allows us to selectively flush out pending overflows, depending on the task and/or files_struct being passed in. No intended functional changes in this patch. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Return whether we found and canceled requests or not. This is in preparation for using this information, no functional changes in this patch. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Sometimes we assign a weak reference to it, sometimes we grab a reference to it. Clean this up and make it unconditional, and drop the flag related to tracking this state. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We can grab a reference to the task instead of stashing away the task files_struct. This is doable without creating a circular reference between the ring fd and the task itself. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
No functional changes in this patch, prep patch for grabbing references to the files_struct. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We currently cancel these when the ring exits, and we cancel all of them. This is in preparation for killing only the ones associated with a given task. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 29 9月, 2020 1 次提交
-
-
由 Hao Xu 提交于
The async buffered reads feature is not working when readahead is turned off. There are two things to concern: - when doing retry in io_read, not only the IOCB_WAITQ flag but also the IOCB_NOWAIT flag is still set, which makes it goes to would_block phase in generic_file_buffered_read() and then return -EAGAIN. After that, the io-wq thread work is queued, and later doing the async reads in the old way. - even if we remove IOCB_NOWAIT when doing retry, the feature is still not running properly, since in generic_file_buffered_read() it goes to lock_page_killable() after calling mapping->a_ops->readpage() to do IO, and thus causing process to sleep. Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()") Fixes: 3b2a4439 ("io_uring: get rid of kiocb_wait_page_queue_init()") Signed-off-by: NHao Xu <haoxu@linux.alibaba.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 28 9月, 2020 1 次提交
-
-
由 Jens Axboe 提交于
syzbot reports a potential lock deadlock between the normal IO path and ->show_fdinfo(): ====================================================== WARNING: possible circular locking dependency detected 5.9.0-rc6-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor.2/19710 is trying to acquire lock: ffff888098ddc450 (sb_writers#4){.+.+}-{0:0}, at: io_write+0x6b5/0xb30 fs/io_uring.c:3296 but task is already holding lock: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&ctx->uring_lock){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:956 [inline] __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103 __io_uring_show_fdinfo fs/io_uring.c:8417 [inline] io_uring_show_fdinfo+0x194/0xc70 fs/io_uring.c:8460 seq_show+0x4a8/0x700 fs/proc/fd.c:65 seq_read+0x432/0x1070 fs/seq_file.c:208 do_loop_readv_writev fs/read_write.c:734 [inline] do_loop_readv_writev fs/read_write.c:721 [inline] do_iter_read+0x48e/0x6e0 fs/read_write.c:955 vfs_readv+0xe5/0x150 fs/read_write.c:1073 kernel_readv fs/splice.c:355 [inline] default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412 do_splice_to+0x137/0x170 fs/splice.c:871 splice_direct_to_actor+0x307/0x980 fs/splice.c:950 do_splice_direct+0x1b3/0x280 fs/splice.c:1059 do_sendfile+0x55f/0xd40 fs/read_write.c:1540 __do_sys_sendfile64 fs/read_write.c:1601 [inline] __se_sys_sendfile64 fs/read_write.c:1587 [inline] __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&p->lock){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:956 [inline] __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103 seq_read+0x61/0x1070 fs/seq_file.c:155 pde_read fs/proc/inode.c:306 [inline] proc_reg_read+0x221/0x300 fs/proc/inode.c:318 do_loop_readv_writev fs/read_write.c:734 [inline] do_loop_readv_writev fs/read_write.c:721 [inline] do_iter_read+0x48e/0x6e0 fs/read_write.c:955 vfs_readv+0xe5/0x150 fs/read_write.c:1073 kernel_readv fs/splice.c:355 [inline] default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412 do_splice_to+0x137/0x170 fs/splice.c:871 splice_direct_to_actor+0x307/0x980 fs/splice.c:950 do_splice_direct+0x1b3/0x280 fs/splice.c:1059 do_sendfile+0x55f/0xd40 fs/read_write.c:1540 __do_sys_sendfile64 fs/read_write.c:1601 [inline] __se_sys_sendfile64 fs/read_write.c:1587 [inline] __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (sb_writers#4){.+.+}-{0:0}: check_prev_add kernel/locking/lockdep.c:2496 [inline] check_prevs_add kernel/locking/lockdep.c:2601 [inline] validate_chain kernel/locking/lockdep.c:3218 [inline] __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441 lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write+0x228/0x450 fs/super.c:1672 io_write+0x6b5/0xb30 fs/io_uring.c:3296 io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719 __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175 io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254 io_submit_sqe fs/io_uring.c:6324 [inline] io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521 __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: sb_writers#4 --> &p->lock --> &ctx->uring_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ctx->uring_lock); lock(&p->lock); lock(&ctx->uring_lock); lock(sb_writers#4); *** DEADLOCK *** 1 lock held by syz-executor.2/19710: #0: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348 stack backtrace: CPU: 0 PID: 19710 Comm: syz-executor.2 Not tainted 5.9.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827 check_prev_add kernel/locking/lockdep.c:2496 [inline] check_prevs_add kernel/locking/lockdep.c:2601 [inline] validate_chain kernel/locking/lockdep.c:3218 [inline] __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441 lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write+0x228/0x450 fs/super.c:1672 io_write+0x6b5/0xb30 fs/io_uring.c:3296 io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719 __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175 io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254 io_submit_sqe fs/io_uring.c:6324 [inline] io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521 __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45e179 Code: 3d b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 0b b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f1194e74c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa RAX: ffffffffffffffda RBX: 00000000000082c0 RCX: 000000000045e179 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004 RBP: 000000000118cf98 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118cf4c R13: 00007ffd1aa5756f R14: 00007f1194e759c0 R15: 000000000118cf4c Fix this by just not diving into details if we fail to trylock the io_uring mutex. We know the ctx isn't going away during this operation, but we cannot safely iterate buffers/files/personalities if we don't hold the io_uring mutex. Reported-by: syzbot+2f8fa4e860edc3066aba@syzkaller.appspotmail.com Signed-off-by: NJens Axboe <axboe@kernel.dk>
-