- 18 3月, 2020 40 次提交
-
-
由 Jackie Liu 提交于
commit 5f5ad9ced33621d353be6429c3900f8a526fcae8 upstream. There is a potential dangling pointer problem. we never clean shadow_req, if there are multiple link lists in this series of sqes, then the shadow_req will not reallocate, and continue to use the last one. but in the previous, his memory has been released, thus forming a dangling pointer. let's clean up him and make sure that every new link list can reapply for a new shadow_req. Fixes: 4fe2c963154c ("io_uring: add support for link with drain") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jackie Liu 提交于
commit 954dab193d19cbbff8f83b58c9360bf00ddb273c upstream. Just clean up the code, no function changes. Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Daniel Xu 提交于
commit 5277deaab9f98229bdfb8d1e30019b6c25052708 upstream. Some workloads can require far more than 4K oustanding entries. For example memcached can have ~300K sockets over ~40 cores. Bumping the max to 32K seems to work pretty well. Reported-by: NDan Melnic <dmm@fb.com> Signed-off-by: NDaniel Xu <dxu@dxuuu.xyz> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit b2a9eadab85730935f5a6fe19f3f61faaaced601 upstream. The way the logic is setup in io_uring_enter() means that you can't wake up the SQ poller thread while at the same time waiting (or polling) for completions afterwards. There's no reason for that to be the case. Reported-by: NLewis Baker <lbaker@fb.com> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 6d5d5ac522b20b65167dafe0656b7cad05ec48b3 upstream. We currently merge async work items if we see a strict sequential hit. This helps avoid unnecessary workqueue switches when we don't need them. We can extend this merging to cover cases where it's not a strict sequential hit, but the IO still fits within the same page. If an application is doing multiple requests within the same page, we don't want separate workers waiting on the same page to complete IO. It's much faster to let the first worker bring in the page, then operate on that page from the same worker to complete the next request(s). Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 54a91f3bb9b96ed86bc12b2f7e06b3fce8e86503 upstream. All the popular filesystems need to grab the inode lock for buffered writes. With io_uring punting buffered writes to async context, we observe a lot of contention with all workers hamming this mutex. For buffered writes, we generally don't need a lot of parallelism on the submission side, as the flushing will take care of that for us. Hence we don't need a deep queue on the write side, as long as we can safely punt from the original submission context. Add a workqueue with a limit of 2 that we can use for buffered writes. This greatly improves the performance and efficiency of higher queue depth buffered async writes with io_uring. Reported-by: NAndres Freund <andres@anarazel.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 18d9be1a970c3704366df902b00871bea88d9f14 upstream. Add a helper for queueing a request for async execution, in preparation for optimizing it. No functional change in this patch. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit c576666863b788c2d7e8ab4ef4edd0e9059cb47b upstream. For some applications that end up using a submit-and-wait type of approach for certain batches of IO, we can make that a bit more efficient by allowing the application to block for the last IO submission. This prevents an async when we don't need it, as the application will be blocking for the completion event(s) anyway. Typical use cases are using the liburing io_uring_submit_and_wait() API, or just using io_uring_enter() doing both submissions and completions. As a specific example, RocksDB doing MultiGet() is sped up quite a bit with this change. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jackie Liu 提交于
commit 4fe2c963154c31227bec2f2d690e01f9cab383ea upstream. To support the link with drain, we need to do two parts. There is an sqes: 0 1 2 3 4 5 6 +-----+-----+-----+-----+-----+-----+-----+ | N | L | L | L+D | N | N | N | +-----+-----+-----+-----+-----+-----+-----+ First, we need to ensure that the io before the link is completed, there is a easy way is set drain flag to the link list's head, so all subsequent io will be inserted into the defer_list. +-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+ Second, ensure that the following IO will not be completed first, an easy way is to create a mirror of drain io and insert it into defer_list, in this way, as long as drain io is not processed, the following io in the defer_list will not be actively process. +-----+ (0) | N | +-----+ | (2) (3) (4) +-----+ +-----+ +-----+ +-----+ (1) | L+D | --> | L | --> | L+D | --> | N | +-----+ +-----+ +-----+ +-----+ | +-----+ ('3) | D | <== This is a shadow of (3) +-----+ | +-----+ (5) | N | +-----+ | +-----+ (6) | N | +-----+ Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jackie Liu 提交于
commit 8776f3fa15a5cd213c4dfab7ddaf557983374ea6 upstream. Sqo_thread will get sqring in batches, which will cause ctx->cached_sq_head to be added in batches. if one of these sqes is set with the DRAIN flag, then he will never get a chance to process, and finally sqo_thread will not exit. Fixes: de0617e4671 ("io_uring: add support for marking commands as draining") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit ac90f249e15cd2a850daa9e36e15f81ce1ff6550 upstream. After commit 75b28affdd6a we can get by with just a single mmap to map both the sq and cq ring. However, userspace doesn't know that. Add a features variable to io_uring_params, and notify userspace that the kernel has this ability. This can then be used in liburing (or in applications directly) to avoid the second mmap. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Hristo Venev 提交于
commit 75b28affdd6aed1c68073ef53907c7bd822aff84 upstream. Both the sq and the cq rings have sizes just over a power of two, and the sq ring is significantly smaller. By bundling them in a single alllocation, we get the sq ring for free. This also means that IORING_OFF_SQ_RING and IORING_OFF_CQ_RING now mean the same thing. If we indicate this to userspace, we can save a mmap call. Signed-off-by: NHristo Venev <hristo@venev.name> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
Fix below build errors, 'wait_sum' and 'iowait_sum' need CONFIG_SCHEDSTATS to be configured. fs/jbd2/transaction.c: In function 'new_handle': fs/jbd2/transaction.c:406:51: error: 'struct sched_statistics' has no member named 'wait_sum' handle->h_sched_wait_sum = current->se.statistics.wait_sum; fs/jbd2/transaction.c:407:48: error: 'struct sched_statistics' has no member named 'iowait_sum' handle->h_io_wait_sum = current->se.statistics.iowait_sum; fs/jbd2/transaction.c: In function 'jbd2_journal_stop': fs/jbd2/transaction.c:1790:38: error: 'struct sched_statistics' has no member named 'wait_sum' sched_wait = current->se.statistics.wait_sum - fs/jbd2/transaction.c:1792:35: error: 'struct sched_statistics' has no member named 'iowait_sum' io_wait = current->se.statistics.iowait_sum - Fixes: 67393bb9f538 ("alinux: jbd2: track slow handle which is preventing transaction committing") Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Cambda Zhu 提交于
commit 4fad78ad6422d9bca62135bbed8b6abc4cbb85b8 upstream This patch fixes the calculation of queue when we restore flow director filters after resetting adapter. In ixgbe_fdir_filter_restore(), filter's vf may be zero which makes the queue outside of the rx_ring array. The calculation is changed to the same as ixgbe_add_ethtool_fdir_entry(). Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com> Tested-by: NAndrew Bowers <andrewx.bowers@intel.com> Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: NTony Lu <tonylu@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com>
-
由 Eric Dumazet 提交于
[ Upstream commit 2bec445f9bf35e52e395b971df48d3e1e5dc704a ] Latest commit 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq") apparently allowed syzbot to trigger various crashes in TCP stack [1] I believe this commit only made things easier for syzbot to find its way into triggering use-after-frees. But really the bugs could lead to bad TCP behavior or even plain crashes even for non malicious peers. I have audited all calls to tcp_rtx_queue_unlink() and tcp_rtx_queue_unlink_and_free() and made sure tp->highest_sack would be updated if we are removing from rtx queue the skb that tp->highest_sack points to. These updates were missing in three locations : 1) tcp_clean_rtx_queue() [This one seems quite serious, I have no idea why this was not caught earlier] 2) tcp_rtx_queue_purge() [Probably not a big deal for normal operations] 3) tcp_send_synack() [Probably not a big deal for normal operations] [1] BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1864 [inline] BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1856 [inline] BUG: KASAN: use-after-free in tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891 Read of size 4 at addr ffff8880a488d068 by task ksoftirqd/1/16 CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.5.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x197/0x210 lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374 __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506 kasan_report+0x12/0x20 mm/kasan/common.c:639 __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134 tcp_highest_sack_seq include/net/tcp.h:1864 [inline] tcp_highest_sack_seq include/net/tcp.h:1856 [inline] tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891 tcp_try_undo_partial net/ipv4/tcp_input.c:2730 [inline] tcp_fastretrans_alert+0xf74/0x23f0 net/ipv4/tcp_input.c:2847 tcp_ack+0x2577/0x5bf0 net/ipv4/tcp_input.c:3710 tcp_rcv_established+0x6dd/0x1e90 net/ipv4/tcp_input.c:5706 tcp_v4_do_rcv+0x619/0x8d0 net/ipv4/tcp_ipv4.c:1619 tcp_v4_rcv+0x307f/0x3b40 net/ipv4/tcp_ipv4.c:2001 ip_protocol_deliver_rcu+0x5a/0x880 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x23b/0x380 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:307 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x1db/0x2f0 net/ipv4/ip_input.c:428 NF_HOOK include/linux/netfilter.h:307 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:538 __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5148 __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5262 process_backlog+0x206/0x750 net/core/dev.c:6093 napi_poll net/core/dev.c:6530 [inline] net_rx_action+0x508/0x1120 net/core/dev.c:6598 __do_softirq+0x262/0x98c kernel/softirq.c:292 run_ksoftirqd kernel/softirq.c:603 [inline] run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595 smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165 kthread+0x361/0x430 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Allocated by task 10091: save_stack+0x23/0x90 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] __kasan_kmalloc mm/kasan/common.c:513 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521 slab_post_alloc_hook mm/slab.h:584 [inline] slab_alloc_node mm/slab.c:3263 [inline] kmem_cache_alloc_node+0x138/0x740 mm/slab.c:3575 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:198 alloc_skb_fclone include/linux/skbuff.h:1099 [inline] sk_stream_alloc_skb net/ipv4/tcp.c:875 [inline] sk_stream_alloc_skb+0x113/0xc90 net/ipv4/tcp.c:852 tcp_sendmsg_locked+0xcf9/0x3470 net/ipv4/tcp.c:1282 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1432 inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:672 __sys_sendto+0x262/0x380 net/socket.c:1998 __do_sys_sendto net/socket.c:2010 [inline] __se_sys_sendto net/socket.c:2006 [inline] __x64_sys_sendto+0xe1/0x1a0 net/socket.c:2006 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 10095: save_stack+0x23/0x90 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] kasan_set_free_info mm/kasan/common.c:335 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474 kasan_slab_free+0xe/0x10 mm/kasan/common.c:483 __cache_free mm/slab.c:3426 [inline] kmem_cache_free+0x86/0x320 mm/slab.c:3694 kfree_skbmem+0x178/0x1c0 net/core/skbuff.c:645 __kfree_skb+0x1e/0x30 net/core/skbuff.c:681 sk_eat_skb include/net/sock.h:2453 [inline] tcp_recvmsg+0x1252/0x2930 net/ipv4/tcp.c:2166 inet_recvmsg+0x136/0x610 net/ipv4/af_inet.c:838 sock_recvmsg_nosec net/socket.c:886 [inline] sock_recvmsg net/socket.c:904 [inline] sock_recvmsg+0xce/0x110 net/socket.c:900 __sys_recvfrom+0x1ff/0x350 net/socket.c:2055 __do_sys_recvfrom net/socket.c:2073 [inline] __se_sys_recvfrom net/socket.c:2069 [inline] __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:2069 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff8880a488d040 which belongs to the cache skbuff_fclone_cache of size 456 The buggy address is located 40 bytes inside of 456-byte region [ffff8880a488d040, ffff8880a488d208) The buggy address belongs to the page: page:ffffea0002922340 refcount:1 mapcount:0 mapping:ffff88821b057000 index:0x0 raw: 00fffe0000000200 ffffea00022a5788 ffffea0002624a48 ffff88821b057000 raw: 0000000000000000 ffff8880a488d040 0000000100000006 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880a488cf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff8880a488cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff8880a488d000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff8880a488d080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880a488d100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq") Fixes: 50895b9d ("tcp: highest_sack fix") Fixes: 737ff314 ("tcp: use sequence distance to detect reordering") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Cambda Zhu <cambda@linux.alibaba.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NTony Lu <tonylu@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com>
-
由 Sam Protsenko 提交于
commit 94e297c50b529f5d01cfd1dbc808d61e95180ab7 upstream. ctags indexing ("make tags" command) throws this warning: ctags: Warning: include/linux/notifier.h:125: null expansion of name pattern "\1" This is the result of DEFINE_PER_CPU() macro expansion. Fix that by getting rid of line break. Similar fix was already done in commit 25528213 ("tags: Fix DEFINE_PER_CPU expansions"), but this one probably wasn't noticed. Link: http://lkml.kernel.org/r/20181030202808.28027-1-semen.protsenko@linaro.org Fixes: 9c80172b ("kernel/SRCU: provide a static initializer") Signed-off-by: NSam Protsenko <semen.protsenko@linaro.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com> Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Joseph Qi 提交于
To fix the following build warning: mm/memcontrol.c: In function ‘mem_cgroup_move_account’: mm/memcontrol.c:5604:6: warning: unused variable ‘nid’ [-Wunused-variable] int nid = page_to_nid(page); ^ Fixes: 96298509 ("mm: thp: don't need care deferred split queue in memcg charge move path") Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Wei Yang 提交于
commit fac0516b5534897bf4c4a88daa06a8cfa5611b23 upstream If compound is true, this means it is a PMD mapped THP. Which implies the page is not linked to any defer list. So the first code chunk will not be executed. Also with this reason, it would not be proper to add this page to a defer list. So the second code chunk is not correct. Based on this, we should remove the defer list related code. [yang.shi@linux.alibaba.com: better patch title] Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Signed-off-by: NWei Yang <richardw.yang@linux.intel.com> Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NYang Shi <yang.shi@linux.alibaba.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [5.4+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> [Fixed conflicts with our 4.19 kernel] Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 08f5439f1df25a6cf6cf4c72cf6c13025599ce67 upstream. The outer poll loop checks for whether we need to reschedule, and returns to userspace if we do. However, it's possible to get stuck in the inner loop as well, if the CPU we are running on needs to reschedule to finish the IO work. Add the need_resched() check in the inner loop as well. This fixes a potential hang if the kernel is configured with CONFIG_PREEMPT_VOLUNTARY=y. Reported-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Tested-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit a3a0e43fd77013819e4b6f55e37e0efe8e35d805 upstream. We need to check if we have CQEs pending before starting a poll loop, as those could be the events we will be spinning for (and hence we'll find none). This can happen if a CQE triggers an error, or if it is found by eg an IRQ before we get a chance to find it through polling. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 500f9fbadef86466a435726192f4ca4df7d94236 upstream. If a request issue ends up being punted to async context to avoid blocking, we can get into a situation where the original application enters the poll loop for that very request before it has been issued. This should not be an issue, except that the polling will hold the io_uring uring_ctx mutex for the duration of the poll. When the async worker has actually issued the request, it needs to acquire this mutex to add the request to the poll issued list. Since the application polling is already holding this mutex, the workqueue sleeps on the mutex forever, and the application thus never gets a chance to poll for the very request it was interested in. Fix this by ensuring that the polling drops the uring_ctx occasionally if it's not making any progress. Reported-by: NJeffrey M. Birnbaum <jmbnyc@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jackie Liu 提交于
commit a982eeb09b6030e567b8b815277c8c9197168040 upstream. This patch may fix two issues: First, when IOSQE_IO_DRAIN set, the next IOs need to be inserted into defer list to delay execution, but link io will be actively scheduled to run by calling io_queue_sqe. Second, when multiple LINK_IOs are inserted together with defer_list, the LINK_IO is no longer keep order. |-------------| | LINK_IO | ----> insert to defer_list ----------- |-------------| | | LINK_IO | ----> insert to defer_list ----------| |-------------| | | LINK_IO | ----> insert to defer_list ----------| |-------------| | | NORMAL_IO | ----> insert to defer_list ----------| |-------------| | | queue_work at same time <-----| Fixes: 9e645e1105c ("io_uring: add support for sqe links") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Aleix Roca Nonell 提交于
commit 99c79f6692ccdc42e04deea8a36e22bb48168a62 upstream. Commit bd11b3a391e3 ("io_uring: don't use iov_iter_advance() for fixed buffers") introduced an optimization to avoid using the slow iov_iter_advance by manually populating the iov_iter iterator in some cases. However, the computation of the iterator count field was erroneous: The first bvec was always accounted for an extent of page size even if the bvec length was smaller. In consequence, some I/O operations on fixed buffers were unable to operate on the full extent of the buffer, consistently skipping some bytes at the end of it. Fixes: bd11b3a391e3 ("io_uring: don't use iov_iter_advance() for fixed buffers") Cc: stable@vger.kernel.org Signed-off-by: NAleix Roca Nonell <aleix.rocanonell@bsc.es> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jackie Liu 提交于
commit d0ee879187df966ef638031b5f5183078d672141 upstream. [root@localhost ~]# ./liburing/test/link QEMU Standard PC report that: [ 29.379892] CPU: 0 PID: 84 Comm: kworker/u2:2 Not tainted 5.3.0-rc2-00051-g4010b622f1d2-dirty #86 [ 29.379902] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 29.379913] Workqueue: io_ring-wq io_sq_wq_submit_work [ 29.379929] Call Trace: [ 29.379953] dump_stack+0xa9/0x10e [ 29.379970] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.379986] print_address_description.cold.6+0x9/0x317 [ 29.379999] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380010] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380026] __kasan_report.cold.7+0x1a/0x34 [ 29.380044] ? io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380061] kasan_report+0xe/0x12 [ 29.380076] io_sq_wq_submit_work+0xbf4/0xe90 [ 29.380104] ? io_sq_thread+0xaf0/0xaf0 [ 29.380152] process_one_work+0xb59/0x19e0 [ 29.380184] ? pwq_dec_nr_in_flight+0x2c0/0x2c0 [ 29.380221] worker_thread+0x8c/0xf40 [ 29.380248] ? __kthread_parkme+0xab/0x110 [ 29.380265] ? process_one_work+0x19e0/0x19e0 [ 29.380278] kthread+0x30b/0x3d0 [ 29.380292] ? kthread_create_on_node+0xe0/0xe0 [ 29.380311] ret_from_fork+0x3a/0x50 [ 29.380635] Allocated by task 209: [ 29.381255] save_stack+0x19/0x80 [ 29.381268] __kasan_kmalloc.constprop.6+0xc1/0xd0 [ 29.381279] kmem_cache_alloc+0xc0/0x240 [ 29.381289] io_submit_sqe+0x11bc/0x1c70 [ 29.381300] io_ring_submit+0x174/0x3c0 [ 29.381311] __x64_sys_io_uring_enter+0x601/0x780 [ 29.381322] do_syscall_64+0x9f/0x4d0 [ 29.381336] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 29.381633] Freed by task 84: [ 29.382186] save_stack+0x19/0x80 [ 29.382198] __kasan_slab_free+0x11d/0x160 [ 29.382210] kmem_cache_free+0x8c/0x2f0 [ 29.382220] io_put_req+0x22/0x30 [ 29.382230] io_sq_wq_submit_work+0x28b/0xe90 [ 29.382241] process_one_work+0xb59/0x19e0 [ 29.382251] worker_thread+0x8c/0xf40 [ 29.382262] kthread+0x30b/0x3d0 [ 29.382272] ret_from_fork+0x3a/0x50 [ 29.382569] The buggy address belongs to the object at ffff888067172140 which belongs to the cache io_kiocb of size 224 [ 29.384692] The buggy address is located 120 bytes inside of 224-byte region [ffff888067172140, ffff888067172220) [ 29.386723] The buggy address belongs to the page: [ 29.387575] page:ffffea00019c5c80 refcount:1 mapcount:0 mapping:ffff88806ace5180 index:0x0 [ 29.387587] flags: 0x100000000000200(slab) [ 29.387603] raw: 0100000000000200 dead000000000100 dead000000000122 ffff88806ace5180 [ 29.387617] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000 [ 29.387624] page dumped because: kasan: bad access detected [ 29.387920] Memory state around the buggy address: [ 29.388771] ffff888067172080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ 29.390062] ffff888067172100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb [ 29.391325] >ffff888067172180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 29.392578] ^ [ 29.393480] ffff888067172200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc [ 29.394744] ffff888067172280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 29.396003] ================================================================== [ 29.397260] Disabling lock debugging due to kernel taint io_sq_wq_submit_work free and read req again. Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Cc: linux-block@vger.kernel.org Cc: stable@vger.kernel.org Fixes: f7b76ac9d17e ("io_uring: fix counter inc/dec mismatch in async_list") Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 36703247d5f52a679df9da51192b6950fe81689f upstream. Daniel reports that when testing an http server that uses io_uring to poll for incoming connections, sometimes it hard crashes. This is due to an uninitialized list member for the io_uring request. Normally this doesn't trigger and none of the test cases caught it. Reported-by: NDaniel Kozak <kozzi11@gmail.com> Tested-by: NDaniel Kozak <kozzi11@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Zhengyuan Liu 提交于
commit 9310a7ba6de8cce6209e3e8a3cdf733f824cdd9b upstream. We are using PAGE_SIZE as the unit to determine if the total len in async_list has exceeded max_pages, it's not fair for smaller io sizes. For example, if we are doing 1k-size io streams, we will never exceed max_pages since len >>= PAGE_SHIFT always gets zero. So use original bytes to make it more accurate. Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit bd11b3a391e3df6fa958facbe4b3f9f4cca9bd49 upstream. Hrvoje reports that when a large fixed buffer is registered and IO is being done to the latter pages of said buffer, the IO submission time is much worse: reading to the start of the buffer: 11238 ns reading to the end of the buffer: 1039879 ns In fact, it's worse by two orders of magnitude. The reason for that is how io_uring figures out how to setup the iov_iter. We point the iter at the first bvec, and then use iov_iter_advance() to fast-forward to the offset within that buffer we need. However, that is abysmally slow, as it entails iterating the bvecs that we setup as part of buffer registration. There's really no need to use this generic helper, as we know it's a BVEC type iterator, and we also know that each bvec is PAGE_SIZE in size, apart from possibly the first and last. Hence we can just use a shift on the offset to find the right index, and then adjust the iov_iter appropriately. After this fix, the timings are: reading to the start of the buffer: 10135 ns reading to the end of the buffer: 1377 ns Or about an 755x improvement for the tail page. Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com> Tested-by: NHrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Zhengyuan Liu 提交于
commit c0e48f9dea9129aa11bec3ed13803bcc26e96e49 upstream. There is a hang issue while using fio to do some basic test. The issue can be easily reproduced using the below script: while true do fio --ioengine=io_uring -rw=write -bs=4k -numjobs=1 \ -size=1G -iodepth=64 -name=uring --filename=/dev/zero done After several minutes (or more), fio would block at io_uring_enter->io_cqring_wait in order to waiting for previously committed sqes to be completed and can't return to user anymore until we send a SIGTERM to fio. After receiving SIGTERM, fio hangs at io_ring_ctx_wait_and_kill with a backtrace like this: [54133.243816] Call Trace: [54133.243842] __schedule+0x3a0/0x790 [54133.243868] schedule+0x38/0xa0 [54133.243880] schedule_timeout+0x218/0x3b0 [54133.243891] ? sched_clock+0x9/0x10 [54133.243903] ? wait_for_completion+0xa3/0x130 [54133.243916] ? _raw_spin_unlock_irq+0x2c/0x40 [54133.243930] ? trace_hardirqs_on+0x3f/0xe0 [54133.243951] wait_for_completion+0xab/0x130 [54133.243962] ? wake_up_q+0x70/0x70 [54133.243984] io_ring_ctx_wait_and_kill+0xa0/0x1d0 [54133.243998] io_uring_release+0x20/0x30 [54133.244008] __fput+0xcf/0x270 [54133.244029] ____fput+0xe/0x10 [54133.244040] task_work_run+0x7f/0xa0 [54133.244056] do_exit+0x305/0xc40 [54133.244067] ? get_signal+0x13b/0xbd0 [54133.244088] do_group_exit+0x50/0xd0 [54133.244103] get_signal+0x18d/0xbd0 [54133.244112] ? _raw_spin_unlock_irqrestore+0x36/0x60 [54133.244142] do_signal+0x34/0x720 [54133.244171] ? exit_to_usermode_loop+0x7e/0x130 [54133.244190] exit_to_usermode_loop+0xc0/0x130 [54133.244209] do_syscall_64+0x16b/0x1d0 [54133.244221] entry_SYSCALL_64_after_hwframe+0x49/0xbe The reason is that we had added a req to ctx->pending_async at the very end, but it didn't get a chance to be processed. How could this happen? fio#cpu0 wq#cpu1 io_add_to_prev_work io_sq_wq_submit_work atomic_read() <<< 1 atomic_dec_return() << 1->0 list_empty(); <<< true; list_add_tail() atomic_read() << 0 or 1? As atomic_ops.rst states, atomic_read does not guarantee that the runtime modification by any other thread is visible yet, so we must take care of that with a proper implicit or explicit memory barrier. This issue was detected with the help of Jackie's <liuyun01@kylinos.cn> Fixes: 31b515106428 ("io_uring: allow workqueue item to handle multiple buffered requests") Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Oleg Nesterov 提交于
commit b772434be0891ed1081a08ae7cfd4666728f8e82 upstream. task->saved_sigmask and ->restore_sigmask are only used in the ret-from- syscall paths. This means that set_user_sigmask() can save ->blocked in ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked was modified. This way the callers do not need 2 sigset_t's passed to set/restore and restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns into the trivial helper which just calls restore_saved_sigmask(). Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric Wong <e@80x24.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David Laight <David.Laight@aculab.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Zhengyuan Liu 提交于
commit f7b76ac9d17e16e44feebb6d2749fec92bfd6dd4 upstream. We could queue a work for each req in defer and link list without increasing async_list->cnt, so we shouldn't decrease it while exiting from workqueue as well if we didn't process the req in async list. Thanks to Jens Axboe <axboe@kernel.dk> for his guidance. Fixes: 31b515106428 ("io_uring: allow workqueue item to handle multiple buffered requests") Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Zhengyuan Liu 提交于
commit dbd0f6d6c2a11eb9c31ca9cd454f95bb5713e92e upstream. sq->cached_sq_head and cq->cached_cq_tail are both unsigned int. If cached_sq_head overflows before cached_cq_tail, then we may miss a barrier req. As cached_cq_tail always follows cached_sq_head, the NQ should be enough. Cc: stable@vger.kernel.org Fixes: de0617e46717 ("io_uring: add support for marking commands as draining") Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jackie Liu 提交于
commit a4c0b3decb33fb4a2b5ecc6234a50680f0b21e7d upstream. INFO: task syz-executor.5:8634 blocked for more than 143 seconds. Not tainted 5.2.0-rc5+ #3 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor.5 D25632 8634 8224 0x00004004 Call Trace: context_switch kernel/sched/core.c:2818 [inline] __schedule+0x658/0x9e0 kernel/sched/core.c:3445 schedule+0x131/0x1d0 kernel/sched/core.c:3509 schedule_timeout+0x9a/0x2b0 kernel/time/timer.c:1783 do_wait_for_common+0x35e/0x5a0 kernel/sched/completion.c:83 __wait_for_common kernel/sched/completion.c:104 [inline] wait_for_common kernel/sched/completion.c:115 [inline] wait_for_completion+0x47/0x60 kernel/sched/completion.c:136 kthread_stop+0xb4/0x150 kernel/kthread.c:559 io_sq_thread_stop fs/io_uring.c:2252 [inline] io_finish_async fs/io_uring.c:2259 [inline] io_ring_ctx_free fs/io_uring.c:2770 [inline] io_ring_ctx_wait_and_kill+0x268/0x880 fs/io_uring.c:2834 io_uring_release+0x5d/0x70 fs/io_uring.c:2842 __fput+0x2e4/0x740 fs/file_table.c:280 ____fput+0x15/0x20 fs/file_table.c:313 task_work_run+0x17e/0x1b0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x402/0x4f0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x110/0x440 arch/x86/entry/common.c:279 do_syscall_64+0x126/0x140 arch/x86/entry/common.c:304 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x412fb1 Code: 80 3b 7c 0f 84 c7 02 00 00 c7 85 d0 00 00 00 00 00 00 00 48 8b 05 cf a6 24 00 49 8b 14 24 41 b9 cb 2a 44 00 48 89 ee 48 89 df <48> 85 c0 4c 0f 45 c8 45 31 c0 31 c9 e8 0e 5b 00 00 85 c0 41 89 c7 RSP: 002b:00007ffe7ee6a180 EFLAGS: 00000293 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000412fb1 RDX: 0000001b2d920000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 0000000000000001 R08: 00000000f3a3e1f8 R09: 00000000f3a3e1fc R10: 00007ffe7ee6a260 R11: 0000000000000293 R12: 000000000075c9a0 R13: 000000000075c9a0 R14: 0000000000024c00 R15: 000000000075bf2c ============================================= There is an wrong logic, when kthread_park running in front of io_sq_thread. CPU#0 CPU#1 io_sq_thread_stop: int kthread(void *_create): kthread_park() __kthread_parkme(self); <<< Wrong kthread_stop() << wait for self->exited << clear_bit KTHREAD_SHOULD_PARK ret = threadfn(data); | |- io_sq_thread |- kthread_should_park() << false |- schedule() <<< nobody wake up stuck CPU#0 stuck CPU#1 So, use a new variable sqo_thread_started to ensure that io_sq_thread run first, then io_sq_thread_stop. Reported-by: syzbot+94324416c485d422fe15@syzkaller.appspotmail.com Suggested-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit aa1fa28fc73ea6b740ee7b62bf3b07141883dbb8 upstream. This is done through IORING_OP_RECVMSG. This opcode uses the same sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the msghdr struct in the sqe->addr field as well. We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't block, and punt to async execution if it would have. Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 0fa03c624d8fc9932d0f27c39a9deca6a37e0e17 upstream. This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags for the flags argument, and the msghdr struct is passed in the sqe->addr field. We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't block, and punt to async execution if it would have. Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
Cherry-pick from commit b620743077e291ae7d0debd21f50413a8c266229 upstream. If we pass pages through an iov_iter we always already have a reference in the caller. Thus remove the ITER_BVEC_FLAG_NO_REF and don't take reference to pages by default for bvec backed iov_iters. [Joseph] Resolve conflicts since we don't have: 81ba6abd2bcd "block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF" 7321ecbfc7cf "block: change how we get page references in bio_iov_iter_get_pages" Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Oleg Nesterov 提交于
commit 97abc889ee296faf95ca0e978340fb7b942a3e32 upstream. This is the minimal fix for stable, I'll send cleanups later. Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced the visible change which breaks user-space: a signal temporary unblocked by set_user_sigmask() can be delivered even if the caller returns success or timeout. Change restore_user_sigmask() to accept the additional "interrupted" argument which should be used instead of signal_pending() check, and update the callers. Eric said: : For clarity. I don't think this is required by posix, or fundamentally to : remove the races in select. It is what linux has always done and we have : applications who care so I agree this fix is needed. : : Further in any case where the semantic change that this patch rolls back : (aka where allowing a signal to be delivered and the select like call to : complete) would be advantage we can do as well if not better by using : signalfd. : : Michael is there any chance we can get this guarantee of the linux : implementation of pselect and friends clearly documented. The guarantee : that if the system call completes successfully we are guaranteed that no : signal that is unblocked by using sigmask will be delivered? Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()") Signed-off-by: NOleg Nesterov <oleg@redhat.com> Reported-by: NEric Wong <e@80x24.org> Tested-by: NEric Wong <e@80x24.org> Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com> Acked-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Laight <David.Laight@ACULAB.COM> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 9e645e1105ca60fbbc6bddf2fd5ef7e57ed3dca8 upstream. With SQE links, we can create chains of dependent SQEs. One example would be queueing an SQE that's a read from one file descriptor, with the linked SQE being a write to another with the same set of buffers. An SQE link will not stall the pipeline, it'll just ensure that dependent SQEs aren't issued before the previous link has completed. Any error at submission or completion time will break the chain of SQEs. For completions, this also includes short reads or writes, as the next SQE could depend on the previous one being fully completed. Any SQE in a chain that gets canceled due to any of the above errors, will get an CQE fill with -ECANCELED as the error value. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 60c112b0ada09826cc4ae6a4e55df677f76f1313 upstream. Stephen reports: I hit the following General Protection Fault when testing io_uring via the io_uring engine in fio. This was on a VM running 5.2-rc5 and the latest version of fio. The issue occurs for both null_blk and fake NVMe drives. I have not tested bare metal or real NVMe SSDs. The fio script used is given below. [io_uring] time_based=1 runtime=60 filename=/dev/nvme2n1 (note /dev/nullb0 also fails) ioengine=io_uring bs=4k rw=readwrite direct=1 fixedbufs=1 sqthread_poll=1 sqthread_poll_cpu=0 general protection fault: 0000 [#1] SMP PTI CPU: 0 PID: 872 Comm: io_uring-sq Not tainted 5.2.0-rc5-cpacket-io-uring #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:fput_many+0x7/0x90 Code: 01 48 85 ff 74 17 55 48 89 e5 53 48 8b 1f e8 a0 f9 ff ff 48 85 db 48 89 df 75 f0 5b 5d f3 c3 0f 1f 40 00 0f 1f 44 00 00 89 f6 <f0> 48 29 77 38 74 01 c3 55 48 89 e5 53 48 89 fb 65 48 \ RSP: 0018:ffffadeb817ebc50 EFLAGS: 00010246 RAX: 0000000000000004 RBX: ffff8f46ad477480 RCX: 0000000000001805 RDX: 0000000000000000 RSI: 0000000000000001 RDI: f18b51b9a39552b5 RBP: ffffadeb817ebc58 R08: ffff8f46b7a318c0 R09: 000000000000015d R10: ffffadeb817ebce8 R11: 0000000000000020 R12: ffff8f46ad4cd000 R13: 00000000fffffff7 R14: ffffadeb817ebe30 R15: 0000000000000004 FS: 0000000000000000(0000) GS:ffff8f46b7a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055828f0bbbf0 CR3: 0000000232176004 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? fput+0x13/0x20 io_free_req+0x20/0x40 io_put_req+0x1b/0x20 io_submit_sqe+0x40a/0x680 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 io_submit_sqes+0xb9/0x160 ? io_submit_sqes+0xb9/0x160 ? __switch_to_asm+0x40/0x70 ? __switch_to_asm+0x34/0x70 ? __schedule+0x3f2/0x6a0 ? __switch_to_asm+0x34/0x70 io_sq_thread+0x1af/0x470 ? __switch_to_asm+0x34/0x70 ? wait_woken+0x80/0x80 ? __switch_to+0x85/0x410 ? __switch_to_asm+0x40/0x70 ? __switch_to_asm+0x34/0x70 ? __schedule+0x3f2/0x6a0 kthread+0x105/0x140 ? io_submit_sqes+0x160/0x160 ? kthread+0x105/0x140 ? io_submit_sqes+0x160/0x160 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x35/0x40 which occurs because using a kernel side submission thread isn't valid without using fixed files (registered through io_uring_register()). This causes io_uring to put the request after logging an error, but before the file field is set in the request. If it happens to be non-zero, we attempt to fput() garbage. Fix this by ensuring that req->file is initialized when the request is allocated. Cc: stable@vger.kernel.org # 5.1+ Reported-by: NStephen Bates <sbates@raithlin.com> Tested-by: NStephen Bates <sbates@raithlin.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Eric Biggers 提交于
commit 355e8d26f719c207aa2e00e6f3cfab3acf21769b upstream. Opening and closing an io_uring instance leaks a UNIX domain socket inode. This is because the ->file of the io_uring instance's internal UNIX domain socket is set to point to the io_uring file, but then sock_release() sees the non-NULL ->file and assumes the inode reference is held by the file so doesn't call iput(). That's not the case here, since the reference is still meant to be held by the socket; the actual inode of the io_uring file is different. Fix this leak by NULL-ing out ->file before releasing the socket. Reported-by: syzbot+111cb28d9f583693aefa@syzkaller.appspotmail.com Fixes: 2b188cc1bb85 ("Add io_uring IO interface") Cc: <stable@vger.kernel.org> # v5.1+ Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 9d93a3f5a0c0d0f79aebc597d47c7cedc852aeb5 upstream. We can encounter a short read when we're doing buffered reads and the data is partially cached. Right now we just return the short read, but that forces the application to read that CQE, then issue another SQE to finish the read. That read will not be cached, and hence will result in an async punt. It's more efficient to do that async punt from within the kernel, as that will the not need two round trips more to the kernel. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-