- 20 12月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
We use it in some spots, but not consistently. Convert the rest over, makes it easier to read as well. No functional changes in this patch. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 19 12月, 2019 2 次提交
-
-
由 Jens Axboe 提交于
I've been chasing a weird and obscure crash that was userspace stack corruption, and finally narrowed it down to a bit flip that made a stack address invalid. io_wq_submit_work() unconditionally flips the req->rw.ki_flags IOCB_NOWAIT bit, but since it's a generic work handler, this isn't valid. Normal read/write operations own that part of the request, on other types it could be something else. Move the IOCB_NOWAIT clear to the read/write handlers where it belongs. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
There is no reliable way to submit and wait in a single syscall, as io_submit_sqes() may under-consume sqes (in case of an early error). Then it will wait for not-yet-submitted requests, deadlocking the user in most cases. Don't wait/poll if can't submit all sqes Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 18 12月, 2019 9 次提交
-
-
由 Jens Axboe 提交于
Now that we have all the opcodes handled in terms of command prep and SQE reuse, add a printk_once() to warn about any potentially new and unhandled ones. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If we defer a request, we can't be reading the opcode again. Ensure that the user_data and opcode fields are stable. For the user_data we already have a place for it, for the opcode we can fill a one byte hold and store that as well. For both of them, assign them when we originally read the SQE in io_get_sqring(). Any code that uses sqe->opcode or sqe->user_data is switched to req->opcode and req->user_data. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If we defer this command as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the timeout remove op into the prep handling to make it safe for SQE reuse. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If we defer this command as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the async cancel op into the prep handling to make it safe for SQE reuse. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If we defer these commands as part of a link, we have to make sure that the SQE data has been read upfront. Integrate the poll add/remove into the prep handling to make it safe for SQE reuse. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
The rules are as follows, if IOSQE_IO_HARDLINK is specified, then it's a link and there is no need to set IOSQE_IO_LINK separately, though it could be there. Add proper check and ensure that IOSQE_IO_HARDLINK implies IOSQE_IO_LINK. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We're currently not retaining sqe data for accept, fsync, and sync_file_range. None of these commands need data outside of what is directly provided, hence it can't go stale when the request is deferred. However, it can get reused, if an application reuses SQE entries. Ensure that we retain the information we need and only read the sqe contents once, off the submission path. Most of this is just moving code into a prep and finish function. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We pass in req->sqe for all of them, no need to pass it in as the request is always passed in. This is a necessary prep patch to be able to cleanup/fix the request prep path. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Some of these code paths assume that any force_nonblock == true issue is not prepped, but that's not true if we did prep as part of link setup earlier. Check if we already have an async context allocate before setting up a new one. Cleanup the async context setup in general, we have a lot of duplicated code there. Fixes: 03b1230c ("io_uring: ensure async punted sendmsg/recvmsg requests copy data") Fixes: f67676d1 ("io_uring: ensure async punted read/write requests copy iovec") Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 16 12月, 2019 2 次提交
-
-
由 Jens Axboe 提交于
If we have to punt the recvmsg to async context, we copy all the context. But since the iovec used can be either on-stack (if small) or dynamically allocated, if it's on-stack, then we need to ensure we reset the iov pointer. If we don't, then we're reusing old stack data, and that can lead to -EFAULTs if things get overwritten. Ensure we retain the right pointers for the iov, and free it as well if we end up having to go beyond UIO_FASTIOV number of vectors. Fixes: 03b1230c ("io_uring: ensure async punted sendmsg/recvmsg requests copy data") Reported-by: N李通洲 <carter.li@eoitek.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Brian Gianforcaro 提交于
- Fix a few typos found while reading the code. - Fix stale io_get_sqring comment referencing s->sqe, the 's' parameter was renamed to 'req', but the comment still holds. Signed-off-by: NBrian Gianforcaro <b.gianfo@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 12 12月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
If we submit an unknown opcode and have fd == -1, io_op_needs_file() will return true as we default to needing a file. Then when we go and assign the file, we find the 'fd' invalid and return -EBADF. We really should be returning -EINVAL for that case, as we normally do for unsupported opcodes. Change io_op_needs_file() to have the following return values: 0 - does not need a file 1 - does need a file < 0 - error value and use this to pass back the right value for this invalid case. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 11 12月, 2019 7 次提交
-
-
由 Jens Axboe 提交于
In chasing a performance issue between using IORING_OP_RECVMSG and IORING_OP_READV on sockets, tracing showed that we always punt the socket reads to async offload. This is due to io_file_supports_async() not checking for S_ISSOCK on the inode. Since sockets supports the O_NONBLOCK (or MSG_DONTWAIT) flag just fine, add sockets to the list of file types that we can do a non-blocking issue to. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We hash regular files to avoid having multiple threads hammer on the inode mutex, but it should not be needed on other types of files (like sockets). Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
One major use case of linked commands is the ability to run the next link inline, if at all possible. This is done correctly for async offload, but somewhere along the line we lost the ability to do so when we were able to complete a request without having to punt it. Ensure that we do so correctly. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
This essentially reverts commit e944475e. For high poll ops workloads, like TAO, the dynamic allocation of the wait_queue entry for IORING_OP_POLL_ADD adds considerable extra overhead. Go back to embedding the wait_queue_entry, but keep the usage of wait->private for the pointer stashing. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Don't just assign it from the main call path, that can miss the case when we're called from issue deferral. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We use the mutex to guard against registered file updates, for instance. Ensure we're safe in accessing that state against concurrent updates. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Some commands will invariably end in a failure in the sense that the completion result will be less than zero. One such example is timeouts that don't have a completion count set, they will always complete with -ETIME unless cancelled. For linked commands, we sever links and fail the rest of the chain if the result is less than zero. Since we have commands where we know that will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever regardless of the completion result. Note that the link will still sever if we fail submitting the parent request, hard links are only resilient in the presence of completion results for requests that did submit correctly. Cc: stable@vger.kernel.org # v5.4 Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Reported-by: N李通洲 <carter.li@eoitek.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 05 12月, 2019 6 次提交
-
-
由 LimingWu 提交于
thatn -> than. Signed-off-by: NLiming Wu <19092205@suning.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Links are created by chaining requests through req->list with an exception that head uses req->link_list. (e.g. link_list->list->list) Because of that, io_req_link_next() needs complex splicing to advance. Link them all through list_list. Also, it seems to be simpler and more consistent IMHO. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
In case of an error io_submit_sqe() drops a request and continues without it, even if the request was a part of a link. Not only it doesn't cancel links, but also may execute wrong sequence of actions. Stop consuming sqes, and let the user handle errors. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We recently changed this from a single list to an rbtree, but for some real life workloads, the rbtree slows down the submission/insertion case enough so that it's the top cycle consumer on the io_uring side. In testing, using a hash table is a more well rounded compromise. It is fast for insertion, and as long as it's sized appropriately, it works well for the cancellation case as well. Running TAO with a lot of network sockets, this removes io_poll_req_insert() from spending 2% of the CPU cycles. Reported-by: NDan Melnic <dmm@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If we defer a timeout, we should ensure that we copy the timespec when we have consumed the sqe. This is similar to commit f67676d1 for read/write requests. We already did this correctly for timeouts deferred as links, but do it generally and use the infrastructure added by commit 1a6b74fc instead of having the timeout deferral use its own. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
There's really no reason why we forbid things like link/drain etc on regular timeout commands. Enable the usual SQE flags on timeouts. Reported-by: N李通洲 <carter.li@eoitek.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 04 12月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
Right now we return it to userspace, which means the application has to poll for the socket to be writeable. Let's just treat it like -EAGAIN and have io_uring handle it internally, this makes it much easier to use. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 03 12月, 2019 7 次提交
-
-
由 Jackie Liu 提交于
Parameter ctx we have never used, clean it up. Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If this flag is set, applications can be certain that any data for async offload has been consumed when the kernel has consumed the SQE. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Just like commit f67676d1 for read/write requests, this one ensures that the sockaddr data has been copied for IORING_OP_CONNECT if we need to punt the request to async context. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Just like commit f67676d1 for read/write requests, this one ensures that the msghdr data is fully copied if we need to punt a recvmsg or sendmsg system call to async context. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Currently we don't copy the iovecs when we punt to async context. This can be problematic for applications that store the iovec on the stack, as they often assume that it's safe to let the iovec go out of scope as soon as IO submission has been called. This isn't always safe, as we will re-copy the iovec once we're in async context. Make this 100% safe by copying the iovec just once. With this change, applications may safely store the iovec on the stack for all cases. Reported-by: N李通洲 <carter.li@eoitek.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Right now we just copy the sqe for async offload, but we want to store more context across an async punt. In preparation for doing so, put the sqe copy inside a structure that we can expand. With this pointer added, we can get rid of REQ_F_FREE_SQE, as that is now indicated by whether req->io is NULL or not. No functional changes in this patch. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We should never return -ERESTARTSYS to userspace, transform it into -EINTR. Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 02 12月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
syzbot reports: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274 kthread+0x361/0x430 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Modules linked in: ---[ end trace f2e1a4307fbe2245 ]--- RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline] RIP: 0010:__validate_creds include/linux/cred.h:187 [inline] RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550 Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c 24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318 RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010 RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849 R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000 R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 which is caused by slab fault injection triggering a failure in prepare_creds(). We don't actually need to create a copy of the creds as we're not modifying it, we just need a reference on the current task creds. This avoids the failure case as well, and propagates the const throughout the stack. Fixes: 181e448d ("io_uring: async workers should inherit the user creds") Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 30 11月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
Christophe reports that current master fails building on powerpc with this error: CC fs/io_uring.o fs/io_uring.c: In function ‘loop_rw_iter’: fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’ [-Werror=implicit-function-declaration] iovec.iov_base = kmap(iter->bvec->bv_page) ^ fs/io_uring.c:1628:19: warning: assignment makes pointer from integer without a cast [-Wint-conversion] iovec.iov_base = kmap(iter->bvec->bv_page) ^ fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’ [-Werror=implicit-function-declaration] kunmap(iter->bvec->bv_page); ^ which is caused by a missing highmem.h include. Fix it by including it. Fixes: 311ae9e1 ("io_uring: fix dead-hung for non-iter fixed rw") Reported-by: NChristophe Leroy <christophe.leroy@c-s.fr> Tested-by: NChristophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 29 11月, 2019 1 次提交
-
-
由 Roman Penyaev 提交于
That is a bit weird scenario but I find it interesting to run fio loads using LKL linux, where MMU is disabled. Probably other real archs which run uClinux can also benefit from this patch. Signed-off-by: NRoman Penyaev <rpenyaev@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 11月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
In the quest to bring io_kiocb down to 3 cachelines, this one does the trick. Make the wait_queue_entry for the poll command come out of kmalloc instead of embedding it in struct io_poll_iocb, as the latter is the largest member of io_kiocb. Once we trim this down a bit, we're back at a healthy 192 bytes for struct io_kiocb. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-