- 26 9月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
For batched IO, it's not uncommon for waiters to ask for more than 1 IO to complete before being woken up. This is a problem with wait_event() since tasks will get woken for every IO that completes, re-check condition, then go back to sleep. For batch counts on the order of what you do for high IOPS, that can result in 10s of extra wakeups for the waiting task. Add a private wake function that checks for the wake up count criteria being met before calling autoremove_wake_function(). Pavel reports that one test case he has runs 40% faster with proper batching of wakeups. Reported-by: NPavel Begunkov <asml.silence@gmail.com> Tested-by: NPavel Begunkov <asml.silence@gmail.com> Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 24 9月, 2019 2 次提交
-
-
由 yangerkun 提交于
After 75b28aff("io_uring: allocate the two rings together"), we compare sq.head with cached_cq_tail to determine does there any cq invalid. Actually, we should use cq.head. Fixes: 75b28aff ("io_uring: allocate the two rings together") Signed-off-by: Nyangerkun <yangerkun@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Currently we just -EINVAL a read or write to an fd that isn't backed by ->read_iter() or ->write_iter(). But we can handle them just fine, as long as we punt fo async context first. Implement a simple loop function for doing ->read() or ->write() instead, and ensure we call it appropriately. Reported-by: N李通洲 <carter.li@eoitek.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 19 9月, 2019 6 次提交
-
-
由 Jens Axboe 提交于
There's been a few requests for functionality similar to io_getevents() and epoll_wait(), where the user can specify a timeout for waiting on events. I deliberately did not add support for this through the system call initially to avoid overloading the args, but I can see that the use cases for this are valid. This adds support for IORING_OP_TIMEOUT. If a user wants to get woken when waiting for events, simply submit one of these timeout commands with your wait call (or before). This ensures that the application sleeping on the CQ ring waiting for events will get woken. The timeout command is passed in as a pointer to a struct timespec. Timeouts are relative. The timeout command also includes a way to auto-cancel after N events has passed. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If preempt isn't enabled in the kernel, we can run into hang issues with sqthread submissions. Use cond_resched() to play nice instead of cpu_relax(), if we end up starting the loop and not having any events pending for submissions. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jackie Liu 提交于
Sometimes io_get_req will return a NUL, then we need to do the correct error handling, otherwise it will cause the kernel null pointer exception. Fixes: 4fe2c963 ("io_uring: add support for link with drain") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If we end up getting woken in poll (due to a signal), then we may need to punt the poll request to an async worker. When we do that, we look up the list to queue at, deferefencing req->submit.sqe, however that is only set for requests we initially decided to queue async. This fixes a crash with poll command usage and wakeups that need to punt to async context. Fixes: 54a91f3b ("io_uring: limit parallelism of buffered writes") Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jackie Liu 提交于
There is a potential dangling pointer problem. we never clean shadow_req, if there are multiple link lists in this series of sqes, then the shadow_req will not reallocate, and continue to use the last one. but in the previous, his memory has been released, thus forming a dangling pointer. let's clean up him and make sure that every new link list can reapply for a new shadow_req. Fixes: 4fe2c963 ("io_uring: add support for link with drain") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jackie Liu 提交于
Just clean up the code, no function changes. Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 15 9月, 2019 1 次提交
-
-
由 Daniel Xu 提交于
Some workloads can require far more than 4K oustanding entries. For example memcached can have ~300K sockets over ~40 cores. Bumping the max to 32K seems to work pretty well. Reported-by: NDan Melnic <dmm@fb.com> Signed-off-by: NDaniel Xu <dxu@dxuuu.xyz> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 13 9月, 2019 2 次提交
-
-
由 Jens Axboe 提交于
The way the logic is setup in io_uring_enter() means that you can't wake up the SQ poller thread while at the same time waiting (or polling) for completions afterwards. There's no reason for that to be the case. Reported-by: NLewis Baker <lbaker@fb.com> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We currently merge async work items if we see a strict sequential hit. This helps avoid unnecessary workqueue switches when we don't need them. We can extend this merging to cover cases where it's not a strict sequential hit, but the IO still fits within the same page. If an application is doing multiple requests within the same page, we don't want separate workers waiting on the same page to complete IO. It's much faster to let the first worker bring in the page, then operate on that page from the same worker to complete the next request(s). Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 9月, 2019 5 次提交
-
-
由 Jens Axboe 提交于
All the popular filesystems need to grab the inode lock for buffered writes. With io_uring punting buffered writes to async context, we observe a lot of contention with all workers hamming this mutex. For buffered writes, we generally don't need a lot of parallelism on the submission side, as the flushing will take care of that for us. Hence we don't need a deep queue on the write side, as long as we can safely punt from the original submission context. Add a workqueue with a limit of 2 that we can use for buffered writes. This greatly improves the performance and efficiency of higher queue depth buffered async writes with io_uring. Reported-by: NAndres Freund <andres@anarazel.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Add a helper for queueing a request for async execution, in preparation for optimizing it. No functional change in this patch. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
For some applications that end up using a submit-and-wait type of approach for certain batches of IO, we can make that a bit more efficient by allowing the application to block for the last IO submission. This prevents an async when we don't need it, as the application will be blocking for the completion event(s) anyway. Typical use cases are using the liburing io_uring_submit_and_wait() API, or just using io_uring_enter() doing both submissions and completions. As a specific example, RocksDB doing MultiGet() is sped up quite a bit with this change. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jackie Liu 提交于
To support the link with drain, we need to do two parts. There is an sqes: 0 1 2 3 4 5 6 +-----+-----+-----+-----+-----+-----+-----+ | N | L | L | L+D | N | N | N | +-----+-----+-----+-----+-----+-----+-----+ First, we need to ensure that the io before the link is completed, there is a easy way is set drain flag to the link list's head, so all subsequent io will be inserted into the defer_list. +-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+ Second, ensure that the following IO will not be completed first, an easy way is to create a mirror of drain io and insert it into defer_list, in this way, as long as drain io is not processed, the following io in the defer_list will not be actively process. +-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ ('3) | D | <== This is a shadow of (3) +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+ Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jackie Liu 提交于
Sqo_thread will get sqring in batches, which will cause ctx->cached_sq_head to be added in batches. if one of these sqes is set with the DRAIN flag, then he will never get a chance to process, and finally sqo_thread will not exit. Fixes: de0617e4 ("io_uring: add support for marking commands as draining") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 07 9月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
After commit 75b28aff we can get by with just a single mmap to map both the sq and cq ring. However, userspace doesn't know that. Add a features variable to io_uring_params, and notify userspace that the kernel has this ability. This can then be used in liburing (or in applications directly) to avoid the second mmap. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 28 8月, 2019 2 次提交
-
-
由 Hristo Venev 提交于
Both the sq and the cq rings have sizes just over a power of two, and the sq ring is significantly smaller. By bundling them in a single alllocation, we get the sq ring for free. This also means that IORING_OFF_SQ_RING and IORING_OFF_CQ_RING now mean the same thing. If we indicate this to userspace, we can save a mmap call. Signed-off-by: NHristo Venev <hristo@venev.name> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 John Hubbard 提交于
For pages that were retained via get_user_pages*(), release those pages via the new put_user_page*() routines, instead of via put_page() or release_pages(). This is part a tree-wide conversion, as described in commit fc1d8e7c ("mm: introduce put_user_page*(), placeholder versions"). Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-block@vger.kernel.org Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 23 8月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
The outer poll loop checks for whether we need to reschedule, and returns to userspace if we do. However, it's possible to get stuck in the inner loop as well, if the CPU we are running on needs to reschedule to finish the IO work. Add the need_resched() check in the inner loop as well. This fixes a potential hang if the kernel is configured with CONFIG_PREEMPT_VOLUNTARY=y. Reported-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Tested-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 8月, 2019 2 次提交
-
-
由 Jens Axboe 提交于
We need to check if we have CQEs pending before starting a poll loop, as those could be the events we will be spinning for (and hence we'll find none). This can happen if a CQE triggers an error, or if it is found by eg an IRQ before we get a chance to find it through polling. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If a request issue ends up being punted to async context to avoid blocking, we can get into a situation where the original application enters the poll loop for that very request before it has been issued. This should not be an issue, except that the polling will hold the io_uring uring_ctx mutex for the duration of the poll. When the async worker has actually issued the request, it needs to acquire this mutex to add the request to the poll issued list. Since the application polling is already holding this mutex, the workqueue sleeps on the mutex forever, and the application thus never gets a chance to poll for the very request it was interested in. Fix this by ensuring that the polling drops the uring_ctx occasionally if it's not making any progress. Reported-by: NJeffrey M. Birnbaum <jmbnyc@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 16 8月, 2019 2 次提交
-
-
由 Jackie Liu 提交于
This patch may fix two issues: First, when IOSQE_IO_DRAIN set, the next IOs need to be inserted into defer list to delay execution, but link io will be actively scheduled to run by calling io_queue_sqe. Second, when multiple LINK_IOs are inserted together with defer_list, the LINK_IO is no longer keep order. |-------------| | LINK_IO | ----> insert to defer_list ----------- |-------------| | | LINK_IO | ----> insert to defer_list ----------| |-------------| | | LINK_IO | ----> insert to defer_list ----------| |-------------| | | NORMAL_IO | ----> insert to defer_list ----------| |-------------| | | queue_work at same time <-----| Fixes: 9e645e11 ("io_uring: add support for sqe links") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Aleix Roca Nonell 提交于
Commit bd11b3a3 ("io_uring: don't use iov_iter_advance() for fixed buffers") introduced an optimization to avoid using the slow iov_iter_advance by manually populating the iov_iter iterator in some cases. However, the computation of the iterator count field was erroneous: The first bvec was always accounted for an extent of page size even if the bvec length was smaller. In consequence, some I/O operations on fixed buffers were unable to operate on the full extent of the buffer, consistently skipping some bytes at the end of it. Fixes: bd11b3a3 ("io_uring: don't use iov_iter_advance() for fixed buffers") Cc: stable@vger.kernel.org Signed-off-by: NAleix Roca Nonell <aleix.rocanonell@bsc.es> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 31 7月, 2019 1 次提交
-
-
由 Jackie Liu 提交于
[root@localhost ~]# ./liburing/test/link QEMU Standard PC report that: [ 29.379892] CPU: 0 PID: 84 Comm: kworker/u2:2 Not tainted 5.3.0-rc2-00051-g4010b622-dirty #86 [ 29.379902] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 29.379913] Workqueue: io_ring-wq io_sq_wq_submit_work [ 29.379929] Call Trace: [ 29.379953] dump_stack+0xa9/0x10e [ 29.379970] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.379986] print_address_description.cold.6+0x9/0x317 [ 29.379999] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380010] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380026] __kasan_report.cold.7+0x1a/0x34 [ 29.380044] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380061] kasan_report+0xe/0x12 [ 29.380076] io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380104] ? io_sq_thread+0xaf0/0xaf0 [ 29.380152] process_one_work+0xb59/0x19e0 [ 29.380184] ? pwq_dec_nr_in_flight+0x2c0/0x2c0 [ 29.380221] worker_thread+0x8c/0xf40 [ 29.380248] ? __kthread_parkme+0xab/0x110 [ 29.380265] ? process_one_work+0x19e0/0x19e0 [ 29.380278] kthread+0x30b/0x3d0 [ 29.380292] ? kthread_create_on_node+0xe0/0xe0 [ 29.380311] ret_from_fork+0x3a/0x50 [ 29.380635] Allocated by task 209: [ 29.381255] save_stack+0x19/0x80 [ 29.381268] __kasan_kmalloc.constprop.6+0xc1/0xd0 [ 29.381279] kmem_cache_alloc+0xc0/0x240 [ 29.381289] io_submit_sqe+0x11bc/0x1c70 [ 29.381300] io_ring_submit+0x174/0x3c0 [ 29.381311] __x64_sys_io_uring_enter+0x601/0x780 [ 29.381322] do_syscall_64+0x9f/0x4d0 [ 29.381336] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 29.381633] Freed by task 84: [ 29.382186] save_stack+0x19/0x80 [ 29.382198] __kasan_slab_free+0x11d/0x160 [ 29.382210] kmem_cache_free+0x8c/0x2f0 [ 29.382220] io_put_req+0x22/0x30 [ 29.382230] io_sq_wq_submit_work+0x28b/0xe90 [ 29.382241] process_one_work+0xb59/0x19e0 [ 29.382251] worker_thread+0x8c/0xf40 [ 29.382262] kthread+0x30b/0x3d0 [ 29.382272] ret_from_fork+0x3a/0x50 [ 29.382569] The buggy address belongs to the object at ffff888067172140 which belongs to the cache io_kiocb of size 224 [ 29.384692] The buggy address is located 120 bytes inside of 224-byte region [ffff888067172140, ffff888067172220) [ 29.386723] The buggy address belongs to the page: [ 29.387575] page:ffffea00019c5c80 refcount:1 mapcount:0 mapping:ffff88806ace5180 index:0x0 [ 29.387587] flags: 0x100000000000200(slab) [ 29.387603] raw: 0100000000000200 dead000000000100 dead000000000122 ffff88806ace5180 [ 29.387617] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000 [ 29.387624] page dumped because: kasan: bad access detected [ 29.387920] Memory state around the buggy address: [ 29.388771] ffff888067172080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 29.390062] ffff888067172100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb [ 29.391325] >ffff888067172180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 29.392578] ^ [ 29.393480] ffff888067172200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc [ 29.394744] ffff888067172280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 29.396003] ================================================================== [ 29.397260] Disabling lock debugging due to kernel taint io_sq_wq_submit_work free and read req again. Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Cc: linux-block@vger.kernel.org Cc: stable@vger.kernel.org Fixes: f7b76ac9 ("io_uring: fix counter inc/dec mismatch in async_list") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 7月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
Daniel reports that when testing an http server that uses io_uring to poll for incoming connections, sometimes it hard crashes. This is due to an uninitialized list member for the io_uring request. Normally this doesn't trigger and none of the test cases caught it. Reported-by: NDaniel Kozak <kozzi11@gmail.com> Tested-by: NDaniel Kozak <kozzi11@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 22 7月, 2019 2 次提交
-
-
由 Zhengyuan Liu 提交于
We are using PAGE_SIZE as the unit to determine if the total len in async_list has exceeded max_pages, it's not fair for smaller io sizes. For example, if we are doing 1k-size io streams, we will never exceed max_pages since len >>= PAGE_SHIFT always gets zero. So use original bytes to make it more accurate. Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Hrvoje reports that when a large fixed buffer is registered and IO is being done to the latter pages of said buffer, the IO submission time is much worse: reading to the start of the buffer: 11238 ns reading to the end of the buffer: 1039879 ns In fact, it's worse by two orders of magnitude. The reason for that is how io_uring figures out how to setup the iov_iter. We point the iter at the first bvec, and then use iov_iter_advance() to fast-forward to the offset within that buffer we need. However, that is abysmally slow, as it entails iterating the bvecs that we setup as part of buffer registration. There's really no need to use this generic helper, as we know it's a BVEC type iterator, and we also know that each bvec is PAGE_SIZE in size, apart from possibly the first and last. Hence we can just use a shift on the offset to find the right index, and then adjust the iov_iter appropriately. After this fix, the timings are: reading to the start of the buffer: 10135 ns reading to the end of the buffer: 1377 ns Or about an 755x improvement for the tail page. Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com> Tested-by: NHrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 19 7月, 2019 1 次提交
-
-
由 Zhengyuan Liu 提交于
There is a hang issue while using fio to do some basic test. The issue can be easily reproduced using the below script: while true do fio --ioengine=io_uring -rw=write -bs=4k -numjobs=1 \ -size=1G -iodepth=64 -name=uring --filename=/dev/zero done After several minutes (or more), fio would block at io_uring_enter->io_cqring_wait in order to waiting for previously committed sqes to be completed and can't return to user anymore until we send a SIGTERM to fio. After receiving SIGTERM, fio hangs at io_ring_ctx_wait_and_kill with a backtrace like this: [54133.243816] Call Trace: [54133.243842] __schedule+0x3a0/0x790 [54133.243868] schedule+0x38/0xa0 [54133.243880] schedule_timeout+0x218/0x3b0 [54133.243891] ? sched_clock+0x9/0x10 [54133.243903] ? wait_for_completion+0xa3/0x130 [54133.243916] ? _raw_spin_unlock_irq+0x2c/0x40 [54133.243930] ? trace_hardirqs_on+0x3f/0xe0 [54133.243951] wait_for_completion+0xab/0x130 [54133.243962] ? wake_up_q+0x70/0x70 [54133.243984] io_ring_ctx_wait_and_kill+0xa0/0x1d0 [54133.243998] io_uring_release+0x20/0x30 [54133.244008] __fput+0xcf/0x270 [54133.244029] ____fput+0xe/0x10 [54133.244040] task_work_run+0x7f/0xa0 [54133.244056] do_exit+0x305/0xc40 [54133.244067] ? get_signal+0x13b/0xbd0 [54133.244088] do_group_exit+0x50/0xd0 [54133.244103] get_signal+0x18d/0xbd0 [54133.244112] ? _raw_spin_unlock_irqrestore+0x36/0x60 [54133.244142] do_signal+0x34/0x720 [54133.244171] ? exit_to_usermode_loop+0x7e/0x130 [54133.244190] exit_to_usermode_loop+0xc0/0x130 [54133.244209] do_syscall_64+0x16b/0x1d0 [54133.244221] entry_SYSCALL_64_after_hwframe+0x49/0xbe The reason is that we had added a req to ctx->pending_async at the very end, but it didn't get a chance to be processed. How could this happen? fio#cpu0 wq#cpu1 io_add_to_prev_work io_sq_wq_submit_work atomic_read() <<< 1 atomic_dec_return() << 1->0 list_empty(); <<< true; list_add_tail() atomic_read() << 0 or 1? As atomic_ops.rst states, atomic_read does not guarantee that the runtime modification by any other thread is visible yet, so we must take care of that with a proper implicit or explicit memory barrier. This issue was detected with the help of Jackie's <liuyun01@kylinos.cn> Fixes: 31b51510 ("io_uring: allow workqueue item to handle multiple buffered requests") Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 17 7月, 2019 1 次提交
-
-
由 Oleg Nesterov 提交于
task->saved_sigmask and ->restore_sigmask are only used in the ret-from- syscall paths. This means that set_user_sigmask() can save ->blocked in ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked was modified. This way the callers do not need 2 sigset_t's passed to set/restore and restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns into the trivial helper which just calls restore_saved_sigmask(). Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric Wong <e@80x24.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David Laight <David.Laight@aculab.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 7月, 2019 2 次提交
-
-
由 Zhengyuan Liu 提交于
We could queue a work for each req in defer and link list without increasing async_list->cnt, so we shouldn't decrease it while exiting from workqueue as well if we didn't process the req in async list. Thanks to Jens Axboe <axboe@kernel.dk> for his guidance. Fixes: 31b51510 ("io_uring: allow workqueue item to handle multiple buffered requests") Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Zhengyuan Liu 提交于
sq->cached_sq_head and cq->cached_cq_tail are both unsigned int. If cached_sq_head overflows before cached_cq_tail, then we may miss a barrier req. As cached_cq_tail always follows cached_sq_head, the NQ should be enough. Cc: stable@vger.kernel.org Fixes: de0617e4 ("io_uring: add support for marking commands as draining") Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 7月, 2019 3 次提交
-
-
由 Jackie Liu 提交于
INFO: task syz-executor.5:8634 blocked for more than 143 seconds. Not tainted 5.2.0-rc5+ #3 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor.5 D25632 8634 8224 0x00004004 Call Trace: context_switch kernel/sched/core.c:2818 [inline] __schedule+0x658/0x9e0 kernel/sched/core.c:3445 schedule+0x131/0x1d0 kernel/sched/core.c:3509 schedule_timeout+0x9a/0x2b0 kernel/time/timer.c:1783 do_wait_for_common+0x35e/0x5a0 kernel/sched/completion.c:83 __wait_for_common kernel/sched/completion.c:104 [inline] wait_for_common kernel/sched/completion.c:115 [inline] wait_for_completion+0x47/0x60 kernel/sched/completion.c:136 kthread_stop+0xb4/0x150 kernel/kthread.c:559 io_sq_thread_stop fs/io_uring.c:2252 [inline] io_finish_async fs/io_uring.c:2259 [inline] io_ring_ctx_free fs/io_uring.c:2770 [inline] io_ring_ctx_wait_and_kill+0x268/0x880 fs/io_uring.c:2834 io_uring_release+0x5d/0x70 fs/io_uring.c:2842 __fput+0x2e4/0x740 fs/file_table.c:280 ____fput+0x15/0x20 fs/file_table.c:313 task_work_run+0x17e/0x1b0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x402/0x4f0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x110/0x440 arch/x86/entry/common.c:279 do_syscall_64+0x126/0x140 arch/x86/entry/common.c:304 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x412fb1 Code: 80 3b 7c 0f 84 c7 02 00 00 c7 85 d0 00 00 00 00 00 00 00 48 8b 05 cf a6 24 00 49 8b 14 24 41 b9 cb 2a 44 00 48 89 ee 48 89 df <48> 85 c0 4c 0f 45 c8 45 31 c0 31 c9 e8 0e 5b 00 00 85 c0 41 89 c7 RSP: 002b:00007ffe7ee6a180 EFLAGS: 00000293 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000412fb1 RDX: 0000001b2d920000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 0000000000000001 R08: 00000000f3a3e1f8 R09: 00000000f3a3e1fc R10: 00007ffe7ee6a260 R11: 0000000000000293 R12: 000000000075c9a0 R13: 000000000075c9a0 R14: 0000000000024c00 R15: 000000000075bf2c ============================================= There is an wrong logic, when kthread_park running in front of io_sq_thread. CPU#0 CPU#1 io_sq_thread_stop: int kthread(void *_create): kthread_park() __kthread_parkme(self); <<< Wrong kthread_stop() << wait for self->exited << clear_bit KTHREAD_SHOULD_PARK ret = threadfn(data); | |- io_sq_thread |- kthread_should_park() << false |- schedule() <<< nobody wake up stuck CPU#0 stuck CPU#1 So, use a new variable sqo_thread_started to ensure that io_sq_thread run first, then io_sq_thread_stop. Reported-by: syzbot+94324416c485d422fe15@syzkaller.appspotmail.com Suggested-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
This is done through IORING_OP_RECVMSG. This opcode uses the same sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the msghdr struct in the sqe->addr field as well. We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't block, and punt to async execution if it would have. Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags for the flags argument, and the msghdr struct is passed in the sqe->addr field. We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't block, and punt to async execution if it would have. Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 29 6月, 2019 2 次提交
-
-
由 Christoph Hellwig 提交于
If we pass pages through an iov_iter we always already have a reference in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take reference to pages by default for bvec backed iov_iters. Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Oleg Nesterov 提交于
This is the minimal fix for stable, I'll send cleanups later. Commit 854a6ed5 ("signal: Add restore_user_sigmask()") introduced the visible change which breaks user-space: a signal temporary unblocked by set_user_sigmask() can be delivered even if the caller returns success or timeout. Change restore_user_sigmask() to accept the additional "interrupted" argument which should be used instead of signal_pending() check, and update the callers. Eric said: : For clarity. I don't think this is required by posix, or fundamentally to : remove the races in select. It is what linux has always done and we have : applications who care so I agree this fix is needed. : : Further in any case where the semantic change that this patch rolls back : (aka where allowing a signal to be delivered and the select like call to : complete) would be advantage we can do as well if not better by using : signalfd. : : Michael is there any chance we can get this guarantee of the linux : implementation of pselect and friends clearly documented. The guarantee : that if the system call completes successfully we are guaranteed that no : signal that is unblocked by using sigmask will be delivered? Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com Fixes: 854a6ed5 ("signal: Add restore_user_sigmask()") Signed-off-by: NOleg Nesterov <oleg@redhat.com> Reported-by: NEric Wong <e@80x24.org> Tested-by: NEric Wong <e@80x24.org> Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com> Acked-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Laight <David.Laight@ACULAB.COM> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 6月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
With SQE links, we can create chains of dependent SQEs. One example would be queueing an SQE that's a read from one file descriptor, with the linked SQE being a write to another with the same set of buffers. An SQE link will not stall the pipeline, it'll just ensure that dependent SQEs aren't issued before the previous link has completed. Any error at submission or completion time will break the chain of SQEs. For completions, this also includes short reads or writes, as the next SQE could depend on the previous one being fully completed. Any SQE in a chain that gets canceled due to any of the above errors, will get an CQE fill with -ECANCELED as the error value. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 22 6月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
Stephen reports: I hit the following General Protection Fault when testing io_uring via the io_uring engine in fio. This was on a VM running 5.2-rc5 and the latest version of fio. The issue occurs for both null_blk and fake NVMe drives. I have not tested bare metal or real NVMe SSDs. The fio script used is given below. [io_uring] time_based=1 runtime=60 filename=/dev/nvme2n1 (note /dev/nullb0 also fails) ioengine=io_uring bs=4k rw=readwrite direct=1 fixedbufs=1 sqthread_poll=1 sqthread_poll_cpu=0 general protection fault: 0000 [#1] SMP PTI CPU: 0 PID: 872 Comm: io_uring-sq Not tainted 5.2.0-rc5-cpacket-io-uring #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:fput_many+0x7/0x90 Code: 01 48 85 ff 74 17 55 48 89 e5 53 48 8b 1f e8 a0 f9 ff ff 48 85 db 48 89 df 75 f0 5b 5d f3 c3 0f 1f 40 00 0f 1f 44 00 00 89 f6 <f0> 48 29 77 38 74 01 c3 55 48 89 e5 53 48 89 fb 65 48 \ RSP: 0018:ffffadeb817ebc50 EFLAGS: 00010246 RAX: 0000000000000004 RBX: ffff8f46ad477480 RCX: 0000000000001805 RDX: 0000000000000000 RSI: 0000000000000001 RDI: f18b51b9a39552b5 RBP: ffffadeb817ebc58 R08: ffff8f46b7a318c0 R09: 000000000000015d R10: ffffadeb817ebce8 R11: 0000000000000020 R12: ffff8f46ad4cd000 R13: 00000000fffffff7 R14: ffffadeb817ebe30 R15: 0000000000000004 FS: 0000000000000000(0000) GS:ffff8f46b7a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055828f0bbbf0 CR3: 0000000232176004 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? fput+0x13/0x20 io_free_req+0x20/0x40 io_put_req+0x1b/0x20 io_submit_sqe+0x40a/0x680 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 io_submit_sqes+0xb9/0x160 ? io_submit_sqes+0xb9/0x160 ? __switch_to_asm+0x40/0x70 ? __switch_to_asm+0x34/0x70 ? __schedule+0x3f2/0x6a0 ? __switch_to_asm+0x34/0x70 io_sq_thread+0x1af/0x470 ? __switch_to_asm+0x34/0x70 ? wait_woken+0x80/0x80 ? __switch_to+0x85/0x410 ? __switch_to_asm+0x40/0x70 ? __switch_to_asm+0x34/0x70 ? __schedule+0x3f2/0x6a0 kthread+0x105/0x140 ? io_submit_sqes+0x160/0x160 ? kthread+0x105/0x140 ? io_submit_sqes+0x160/0x160 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x35/0x40 which occurs because using a kernel side submission thread isn't valid without using fixed files (registered through io_uring_register()). This causes io_uring to put the request after logging an error, but before the file field is set in the request. If it happens to be non-zero, we attempt to fput() garbage. Fix this by ensuring that req->file is initialized when the request is allocated. Cc: stable@vger.kernel.org # 5.1+ Reported-by: NStephen Bates <sbates@raithlin.com> Tested-by: NStephen Bates <sbates@raithlin.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-