- 18 10月, 2019 2 次提交
-
-
由 yangerkun 提交于
If ctx->cached_sq_head < nxt_sq_head, we should add UINT_MAX to tmp, not tmp_nxt. Fixes: 5da0fb1a ("io_uring: consider the overflow of sequence for timeout req") Signed-off-by: Nyangerkun <yangerkun@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We've got two issues with the non-regular file handling for non-blocking IO: 1) We don't want to re-do a short read in full for a non-regular file, as we can't just read the data again. 2) For non-regular files that don't support non-blocking IO attempts, we need to punt to async context even if the file is opened as non-blocking. Otherwise the caller always gets -EAGAIN. Add two new request flags to handle these cases. One is just a cache of the inode S_ISREG() status, the other tells io_uring that we always need to punt this request to async context, even if REQ_F_NOWAIT is set. Cc: stable@vger.kernel.org Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com> Tested-by: NHrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 15 10月, 2019 1 次提交
-
-
由 yangerkun 提交于
Now we recalculate the sequence of timeout with 'req->sequence = ctx->cached_sq_head + count - 1', judge the right place to insert for timeout_list by compare the number of request we still expected for completion. But we have not consider about the situation of overflow: 1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for the new timeout req can have a small req->sequence. 2. cached_sq_head of now may overflow compare with before req. And it will lead the timeout req with small req->sequence. This overflow will lead to the misorder of timeout_list, which can lead to the wrong order of the completion of timeout_list. Fix it by reuse req->submit.sequence to store the count, and change the logic of inserting sort in io_timeout. Signed-off-by: Nyangerkun <yangerkun@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 11 10月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
We have two ways a request can be deferred: 1) It's a regular request that depends on another one 2) It's a timeout that tracks completions We have a shared helper to determine whether to defer, and that attempts to make the right decision based on the request. But we only have some of this information in the caller. Un-share the two timeout/defer helpers so the caller can use the right one. Fixes: 5262f567 ("io_uring: IORING_OP_TIMEOUT support") Reported-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 10月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
We should not remove the workqueue, we just need to ensure that the workqueues are synced. The workqueues are torn down on ctx removal. Cc: stable@vger.kernel.org Fixes: 6b06314c ("io_uring: add file set registration") Reported-by: NStefan Hajnoczi <stefanha@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 08 10月, 2019 1 次提交
-
-
由 Pavel Begunkov 提交于
Any changes interesting to tasks waiting in io_cqring_wait() are commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs() also tries to do that but with no reason, that means spurious wakeups every io_free_req() and io_uring_enter(). Just use percpu_ref_put() instead. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 04 10月, 2019 1 次提交
-
-
由 Pavel Begunkov 提交于
io_queue_link_head() accepts @force_nonblock flag, but io_ring_submit() passes something opposite. Fixes: c5766668 ("io_uring: optimize submit_and_wait API") Reported-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 10月, 2019 1 次提交
-
-
由 Arnd Bergmann 提交于
All system calls use struct __kernel_timespec instead of the old struct timespec, but this one was just added with the old-style ABI. Change it now to enforce the use of __kernel_timespec, avoiding ABI confusion and the need for compat handlers on 32-bit architectures. Any user space caller will have to use __kernel_timespec now, but this is unambiguous and works for any C library regardless of the time_t definition. A nicer way to specify the timeout would have been a less ambiguous 64-bit nanosecond value, but I suppose it's too late now to change that as this would impact both 32-bit and 64-bit users. Fixes: 5262f567 ("io_uring: IORING_OP_TIMEOUT support") Signed-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 9月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
For batched IO, it's not uncommon for waiters to ask for more than 1 IO to complete before being woken up. This is a problem with wait_event() since tasks will get woken for every IO that completes, re-check condition, then go back to sleep. For batch counts on the order of what you do for high IOPS, that can result in 10s of extra wakeups for the waiting task. Add a private wake function that checks for the wake up count criteria being met before calling autoremove_wake_function(). Pavel reports that one test case he has runs 40% faster with proper batching of wakeups. Reported-by: NPavel Begunkov <asml.silence@gmail.com> Tested-by: NPavel Begunkov <asml.silence@gmail.com> Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 9月, 2019 1 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
Patch series "Make working with compound pages easier", v2. These three patches add three helpers and convert the appropriate places to use them. This patch (of 3): It's unnecessarily hard to find out the size of a potentially huge page. Replace 'PAGE_SIZE << compound_order(page)' with page_size(page). Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NIra Weiny <ira.weiny@intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 9月, 2019 2 次提交
-
-
由 yangerkun 提交于
After 75b28aff("io_uring: allocate the two rings together"), we compare sq.head with cached_cq_tail to determine does there any cq invalid. Actually, we should use cq.head. Fixes: 75b28aff ("io_uring: allocate the two rings together") Signed-off-by: Nyangerkun <yangerkun@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Currently we just -EINVAL a read or write to an fd that isn't backed by ->read_iter() or ->write_iter(). But we can handle them just fine, as long as we punt fo async context first. Implement a simple loop function for doing ->read() or ->write() instead, and ensure we call it appropriately. Reported-by: N李通洲 <carter.li@eoitek.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 19 9月, 2019 6 次提交
-
-
由 Jens Axboe 提交于
There's been a few requests for functionality similar to io_getevents() and epoll_wait(), where the user can specify a timeout for waiting on events. I deliberately did not add support for this through the system call initially to avoid overloading the args, but I can see that the use cases for this are valid. This adds support for IORING_OP_TIMEOUT. If a user wants to get woken when waiting for events, simply submit one of these timeout commands with your wait call (or before). This ensures that the application sleeping on the CQ ring waiting for events will get woken. The timeout command is passed in as a pointer to a struct timespec. Timeouts are relative. The timeout command also includes a way to auto-cancel after N events has passed. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If preempt isn't enabled in the kernel, we can run into hang issues with sqthread submissions. Use cond_resched() to play nice instead of cpu_relax(), if we end up starting the loop and not having any events pending for submissions. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jackie Liu 提交于
Sometimes io_get_req will return a NUL, then we need to do the correct error handling, otherwise it will cause the kernel null pointer exception. Fixes: 4fe2c963 ("io_uring: add support for link with drain") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If we end up getting woken in poll (due to a signal), then we may need to punt the poll request to an async worker. When we do that, we look up the list to queue at, deferefencing req->submit.sqe, however that is only set for requests we initially decided to queue async. This fixes a crash with poll command usage and wakeups that need to punt to async context. Fixes: 54a91f3b ("io_uring: limit parallelism of buffered writes") Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jackie Liu 提交于
There is a potential dangling pointer problem. we never clean shadow_req, if there are multiple link lists in this series of sqes, then the shadow_req will not reallocate, and continue to use the last one. but in the previous, his memory has been released, thus forming a dangling pointer. let's clean up him and make sure that every new link list can reapply for a new shadow_req. Fixes: 4fe2c963 ("io_uring: add support for link with drain") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jackie Liu 提交于
Just clean up the code, no function changes. Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 15 9月, 2019 1 次提交
-
-
由 Daniel Xu 提交于
Some workloads can require far more than 4K oustanding entries. For example memcached can have ~300K sockets over ~40 cores. Bumping the max to 32K seems to work pretty well. Reported-by: NDan Melnic <dmm@fb.com> Signed-off-by: NDaniel Xu <dxu@dxuuu.xyz> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 13 9月, 2019 2 次提交
-
-
由 Jens Axboe 提交于
The way the logic is setup in io_uring_enter() means that you can't wake up the SQ poller thread while at the same time waiting (or polling) for completions afterwards. There's no reason for that to be the case. Reported-by: NLewis Baker <lbaker@fb.com> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We currently merge async work items if we see a strict sequential hit. This helps avoid unnecessary workqueue switches when we don't need them. We can extend this merging to cover cases where it's not a strict sequential hit, but the IO still fits within the same page. If an application is doing multiple requests within the same page, we don't want separate workers waiting on the same page to complete IO. It's much faster to let the first worker bring in the page, then operate on that page from the same worker to complete the next request(s). Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 9月, 2019 5 次提交
-
-
由 Jens Axboe 提交于
All the popular filesystems need to grab the inode lock for buffered writes. With io_uring punting buffered writes to async context, we observe a lot of contention with all workers hamming this mutex. For buffered writes, we generally don't need a lot of parallelism on the submission side, as the flushing will take care of that for us. Hence we don't need a deep queue on the write side, as long as we can safely punt from the original submission context. Add a workqueue with a limit of 2 that we can use for buffered writes. This greatly improves the performance and efficiency of higher queue depth buffered async writes with io_uring. Reported-by: NAndres Freund <andres@anarazel.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Add a helper for queueing a request for async execution, in preparation for optimizing it. No functional change in this patch. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
For some applications that end up using a submit-and-wait type of approach for certain batches of IO, we can make that a bit more efficient by allowing the application to block for the last IO submission. This prevents an async when we don't need it, as the application will be blocking for the completion event(s) anyway. Typical use cases are using the liburing io_uring_submit_and_wait() API, or just using io_uring_enter() doing both submissions and completions. As a specific example, RocksDB doing MultiGet() is sped up quite a bit with this change. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jackie Liu 提交于
To support the link with drain, we need to do two parts. There is an sqes: 0 1 2 3 4 5 6 +-----+-----+-----+-----+-----+-----+-----+ | N | L | L | L+D | N | N | N | +-----+-----+-----+-----+-----+-----+-----+ First, we need to ensure that the io before the link is completed, there is a easy way is set drain flag to the link list's head, so all subsequent io will be inserted into the defer_list. +-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+ Second, ensure that the following IO will not be completed first, an easy way is to create a mirror of drain io and insert it into defer_list, in this way, as long as drain io is not processed, the following io in the defer_list will not be actively process. +-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ ('3) | D | <== This is a shadow of (3) +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+ Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jackie Liu 提交于
Sqo_thread will get sqring in batches, which will cause ctx->cached_sq_head to be added in batches. if one of these sqes is set with the DRAIN flag, then he will never get a chance to process, and finally sqo_thread will not exit. Fixes: de0617e4 ("io_uring: add support for marking commands as draining") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 07 9月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
After commit 75b28aff we can get by with just a single mmap to map both the sq and cq ring. However, userspace doesn't know that. Add a features variable to io_uring_params, and notify userspace that the kernel has this ability. This can then be used in liburing (or in applications directly) to avoid the second mmap. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 28 8月, 2019 2 次提交
-
-
由 Hristo Venev 提交于
Both the sq and the cq rings have sizes just over a power of two, and the sq ring is significantly smaller. By bundling them in a single alllocation, we get the sq ring for free. This also means that IORING_OFF_SQ_RING and IORING_OFF_CQ_RING now mean the same thing. If we indicate this to userspace, we can save a mmap call. Signed-off-by: NHristo Venev <hristo@venev.name> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 John Hubbard 提交于
For pages that were retained via get_user_pages*(), release those pages via the new put_user_page*() routines, instead of via put_page() or release_pages(). This is part a tree-wide conversion, as described in commit fc1d8e7c ("mm: introduce put_user_page*(), placeholder versions"). Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-block@vger.kernel.org Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 23 8月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
The outer poll loop checks for whether we need to reschedule, and returns to userspace if we do. However, it's possible to get stuck in the inner loop as well, if the CPU we are running on needs to reschedule to finish the IO work. Add the need_resched() check in the inner loop as well. This fixes a potential hang if the kernel is configured with CONFIG_PREEMPT_VOLUNTARY=y. Reported-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Tested-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 8月, 2019 2 次提交
-
-
由 Jens Axboe 提交于
We need to check if we have CQEs pending before starting a poll loop, as those could be the events we will be spinning for (and hence we'll find none). This can happen if a CQE triggers an error, or if it is found by eg an IRQ before we get a chance to find it through polling. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If a request issue ends up being punted to async context to avoid blocking, we can get into a situation where the original application enters the poll loop for that very request before it has been issued. This should not be an issue, except that the polling will hold the io_uring uring_ctx mutex for the duration of the poll. When the async worker has actually issued the request, it needs to acquire this mutex to add the request to the poll issued list. Since the application polling is already holding this mutex, the workqueue sleeps on the mutex forever, and the application thus never gets a chance to poll for the very request it was interested in. Fix this by ensuring that the polling drops the uring_ctx occasionally if it's not making any progress. Reported-by: NJeffrey M. Birnbaum <jmbnyc@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 16 8月, 2019 2 次提交
-
-
由 Jackie Liu 提交于
This patch may fix two issues: First, when IOSQE_IO_DRAIN set, the next IOs need to be inserted into defer list to delay execution, but link io will be actively scheduled to run by calling io_queue_sqe. Second, when multiple LINK_IOs are inserted together with defer_list, the LINK_IO is no longer keep order. |-------------| | LINK_IO | ----> insert to defer_list ----------- |-------------| | | LINK_IO | ----> insert to defer_list ----------| |-------------| | | LINK_IO | ----> insert to defer_list ----------| |-------------| | | NORMAL_IO | ----> insert to defer_list ----------| |-------------| | | queue_work at same time <-----| Fixes: 9e645e11 ("io_uring: add support for sqe links") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Aleix Roca Nonell 提交于
Commit bd11b3a3 ("io_uring: don't use iov_iter_advance() for fixed buffers") introduced an optimization to avoid using the slow iov_iter_advance by manually populating the iov_iter iterator in some cases. However, the computation of the iterator count field was erroneous: The first bvec was always accounted for an extent of page size even if the bvec length was smaller. In consequence, some I/O operations on fixed buffers were unable to operate on the full extent of the buffer, consistently skipping some bytes at the end of it. Fixes: bd11b3a3 ("io_uring: don't use iov_iter_advance() for fixed buffers") Cc: stable@vger.kernel.org Signed-off-by: NAleix Roca Nonell <aleix.rocanonell@bsc.es> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 31 7月, 2019 1 次提交
-
-
由 Jackie Liu 提交于
[root@localhost ~]# ./liburing/test/link QEMU Standard PC report that: [ 29.379892] CPU: 0 PID: 84 Comm: kworker/u2:2 Not tainted 5.3.0-rc2-00051-g4010b622-dirty #86 [ 29.379902] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 29.379913] Workqueue: io_ring-wq io_sq_wq_submit_work [ 29.379929] Call Trace: [ 29.379953] dump_stack+0xa9/0x10e [ 29.379970] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.379986] print_address_description.cold.6+0x9/0x317 [ 29.379999] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380010] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380026] __kasan_report.cold.7+0x1a/0x34 [ 29.380044] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380061] kasan_report+0xe/0x12 [ 29.380076] io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380104] ? io_sq_thread+0xaf0/0xaf0 [ 29.380152] process_one_work+0xb59/0x19e0 [ 29.380184] ? pwq_dec_nr_in_flight+0x2c0/0x2c0 [ 29.380221] worker_thread+0x8c/0xf40 [ 29.380248] ? __kthread_parkme+0xab/0x110 [ 29.380265] ? process_one_work+0x19e0/0x19e0 [ 29.380278] kthread+0x30b/0x3d0 [ 29.380292] ? kthread_create_on_node+0xe0/0xe0 [ 29.380311] ret_from_fork+0x3a/0x50 [ 29.380635] Allocated by task 209: [ 29.381255] save_stack+0x19/0x80 [ 29.381268] __kasan_kmalloc.constprop.6+0xc1/0xd0 [ 29.381279] kmem_cache_alloc+0xc0/0x240 [ 29.381289] io_submit_sqe+0x11bc/0x1c70 [ 29.381300] io_ring_submit+0x174/0x3c0 [ 29.381311] __x64_sys_io_uring_enter+0x601/0x780 [ 29.381322] do_syscall_64+0x9f/0x4d0 [ 29.381336] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 29.381633] Freed by task 84: [ 29.382186] save_stack+0x19/0x80 [ 29.382198] __kasan_slab_free+0x11d/0x160 [ 29.382210] kmem_cache_free+0x8c/0x2f0 [ 29.382220] io_put_req+0x22/0x30 [ 29.382230] io_sq_wq_submit_work+0x28b/0xe90 [ 29.382241] process_one_work+0xb59/0x19e0 [ 29.382251] worker_thread+0x8c/0xf40 [ 29.382262] kthread+0x30b/0x3d0 [ 29.382272] ret_from_fork+0x3a/0x50 [ 29.382569] The buggy address belongs to the object at ffff888067172140 which belongs to the cache io_kiocb of size 224 [ 29.384692] The buggy address is located 120 bytes inside of 224-byte region [ffff888067172140, ffff888067172220) [ 29.386723] The buggy address belongs to the page: [ 29.387575] page:ffffea00019c5c80 refcount:1 mapcount:0 mapping:ffff88806ace5180 index:0x0 [ 29.387587] flags: 0x100000000000200(slab) [ 29.387603] raw: 0100000000000200 dead000000000100 dead000000000122 ffff88806ace5180 [ 29.387617] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000 [ 29.387624] page dumped because: kasan: bad access detected [ 29.387920] Memory state around the buggy address: [ 29.388771] ffff888067172080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 29.390062] ffff888067172100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb [ 29.391325] >ffff888067172180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 29.392578] ^ [ 29.393480] ffff888067172200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc [ 29.394744] ffff888067172280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 29.396003] ================================================================== [ 29.397260] Disabling lock debugging due to kernel taint io_sq_wq_submit_work free and read req again. Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Cc: linux-block@vger.kernel.org Cc: stable@vger.kernel.org Fixes: f7b76ac9 ("io_uring: fix counter inc/dec mismatch in async_list") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 7月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
Daniel reports that when testing an http server that uses io_uring to poll for incoming connections, sometimes it hard crashes. This is due to an uninitialized list member for the io_uring request. Normally this doesn't trigger and none of the test cases caught it. Reported-by: NDaniel Kozak <kozzi11@gmail.com> Tested-by: NDaniel Kozak <kozzi11@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 22 7月, 2019 2 次提交
-
-
由 Zhengyuan Liu 提交于
We are using PAGE_SIZE as the unit to determine if the total len in async_list has exceeded max_pages, it's not fair for smaller io sizes. For example, if we are doing 1k-size io streams, we will never exceed max_pages since len >>= PAGE_SHIFT always gets zero. So use original bytes to make it more accurate. Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Hrvoje reports that when a large fixed buffer is registered and IO is being done to the latter pages of said buffer, the IO submission time is much worse: reading to the start of the buffer: 11238 ns reading to the end of the buffer: 1039879 ns In fact, it's worse by two orders of magnitude. The reason for that is how io_uring figures out how to setup the iov_iter. We point the iter at the first bvec, and then use iov_iter_advance() to fast-forward to the offset within that buffer we need. However, that is abysmally slow, as it entails iterating the bvecs that we setup as part of buffer registration. There's really no need to use this generic helper, as we know it's a BVEC type iterator, and we also know that each bvec is PAGE_SIZE in size, apart from possibly the first and last. Hence we can just use a shift on the offset to find the right index, and then adjust the iov_iter appropriately. After this fix, the timings are: reading to the start of the buffer: 10135 ns reading to the end of the buffer: 1377 ns Or about an 755x improvement for the tail page. Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com> Tested-by: NHrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 19 7月, 2019 1 次提交
-
-
由 Zhengyuan Liu 提交于
There is a hang issue while using fio to do some basic test. The issue can be easily reproduced using the below script: while true do fio --ioengine=io_uring -rw=write -bs=4k -numjobs=1 \ -size=1G -iodepth=64 -name=uring --filename=/dev/zero done After several minutes (or more), fio would block at io_uring_enter->io_cqring_wait in order to waiting for previously committed sqes to be completed and can't return to user anymore until we send a SIGTERM to fio. After receiving SIGTERM, fio hangs at io_ring_ctx_wait_and_kill with a backtrace like this: [54133.243816] Call Trace: [54133.243842] __schedule+0x3a0/0x790 [54133.243868] schedule+0x38/0xa0 [54133.243880] schedule_timeout+0x218/0x3b0 [54133.243891] ? sched_clock+0x9/0x10 [54133.243903] ? wait_for_completion+0xa3/0x130 [54133.243916] ? _raw_spin_unlock_irq+0x2c/0x40 [54133.243930] ? trace_hardirqs_on+0x3f/0xe0 [54133.243951] wait_for_completion+0xab/0x130 [54133.243962] ? wake_up_q+0x70/0x70 [54133.243984] io_ring_ctx_wait_and_kill+0xa0/0x1d0 [54133.243998] io_uring_release+0x20/0x30 [54133.244008] __fput+0xcf/0x270 [54133.244029] ____fput+0xe/0x10 [54133.244040] task_work_run+0x7f/0xa0 [54133.244056] do_exit+0x305/0xc40 [54133.244067] ? get_signal+0x13b/0xbd0 [54133.244088] do_group_exit+0x50/0xd0 [54133.244103] get_signal+0x18d/0xbd0 [54133.244112] ? _raw_spin_unlock_irqrestore+0x36/0x60 [54133.244142] do_signal+0x34/0x720 [54133.244171] ? exit_to_usermode_loop+0x7e/0x130 [54133.244190] exit_to_usermode_loop+0xc0/0x130 [54133.244209] do_syscall_64+0x16b/0x1d0 [54133.244221] entry_SYSCALL_64_after_hwframe+0x49/0xbe The reason is that we had added a req to ctx->pending_async at the very end, but it didn't get a chance to be processed. How could this happen? fio#cpu0 wq#cpu1 io_add_to_prev_work io_sq_wq_submit_work atomic_read() <<< 1 atomic_dec_return() << 1->0 list_empty(); <<< true; list_add_tail() atomic_read() << 0 or 1? As atomic_ops.rst states, atomic_read does not guarantee that the runtime modification by any other thread is visible yet, so we must take care of that with a proper implicit or explicit memory barrier. This issue was detected with the help of Jackie's <liuyun01@kylinos.cn> Fixes: 31b51510 ("io_uring: allow workqueue item to handle multiple buffered requests") Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 17 7月, 2019 1 次提交
-
-
由 Oleg Nesterov 提交于
task->saved_sigmask and ->restore_sigmask are only used in the ret-from- syscall paths. This means that set_user_sigmask() can save ->blocked in ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked was modified. This way the callers do not need 2 sigset_t's passed to set/restore and restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns into the trivial helper which just calls restore_saved_sigmask(). Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric Wong <e@80x24.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David Laight <David.Laight@aculab.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-