1. 18 3月, 2020 40 次提交
    • E
      tcp: do not leave dangling pointers in tp->highest_sack · c637b6c2
      Eric Dumazet 提交于
      [ Upstream commit 2bec445f9bf35e52e395b971df48d3e1e5dc704a ]
      
      Latest commit 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq")
      apparently allowed syzbot to trigger various crashes in TCP stack [1]
      
      I believe this commit only made things easier for syzbot to find
      its way into triggering use-after-frees. But really the bugs
      could lead to bad TCP behavior or even plain crashes even for
      non malicious peers.
      
      I have audited all calls to tcp_rtx_queue_unlink() and
      tcp_rtx_queue_unlink_and_free() and made sure tp->highest_sack would be updated
      if we are removing from rtx queue the skb that tp->highest_sack points to.
      
      These updates were missing in three locations :
      
      1) tcp_clean_rtx_queue() [This one seems quite serious,
                                I have no idea why this was not caught earlier]
      
      2) tcp_rtx_queue_purge() [Probably not a big deal for normal operations]
      
      3) tcp_send_synack()     [Probably not a big deal for normal operations]
      
      [1]
      BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
      BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
      BUG: KASAN: use-after-free in tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
      Read of size 4 at addr ffff8880a488d068 by task ksoftirqd/1/16
      
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.5.0-rc5-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
       __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
       kasan_report+0x12/0x20 mm/kasan/common.c:639
       __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134
       tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
       tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
       tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
       tcp_try_undo_partial net/ipv4/tcp_input.c:2730 [inline]
       tcp_fastretrans_alert+0xf74/0x23f0 net/ipv4/tcp_input.c:2847
       tcp_ack+0x2577/0x5bf0 net/ipv4/tcp_input.c:3710
       tcp_rcv_established+0x6dd/0x1e90 net/ipv4/tcp_input.c:5706
       tcp_v4_do_rcv+0x619/0x8d0 net/ipv4/tcp_ipv4.c:1619
       tcp_v4_rcv+0x307f/0x3b40 net/ipv4/tcp_ipv4.c:2001
       ip_protocol_deliver_rcu+0x5a/0x880 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x23b/0x380 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x1db/0x2f0 net/ipv4/ip_input.c:428
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:538
       __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5148
       __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5262
       process_backlog+0x206/0x750 net/core/dev.c:6093
       napi_poll net/core/dev.c:6530 [inline]
       net_rx_action+0x508/0x1120 net/core/dev.c:6598
       __do_softirq+0x262/0x98c kernel/softirq.c:292
       run_ksoftirqd kernel/softirq.c:603 [inline]
       run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595
       smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165
       kthread+0x361/0x430 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      Allocated by task 10091:
       save_stack+0x23/0x90 mm/kasan/common.c:72
       set_track mm/kasan/common.c:80 [inline]
       __kasan_kmalloc mm/kasan/common.c:513 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
       kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521
       slab_post_alloc_hook mm/slab.h:584 [inline]
       slab_alloc_node mm/slab.c:3263 [inline]
       kmem_cache_alloc_node+0x138/0x740 mm/slab.c:3575
       __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:198
       alloc_skb_fclone include/linux/skbuff.h:1099 [inline]
       sk_stream_alloc_skb net/ipv4/tcp.c:875 [inline]
       sk_stream_alloc_skb+0x113/0xc90 net/ipv4/tcp.c:852
       tcp_sendmsg_locked+0xcf9/0x3470 net/ipv4/tcp.c:1282
       tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1432
       inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:672
       __sys_sendto+0x262/0x380 net/socket.c:1998
       __do_sys_sendto net/socket.c:2010 [inline]
       __se_sys_sendto net/socket.c:2006 [inline]
       __x64_sys_sendto+0xe1/0x1a0 net/socket.c:2006
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 10095:
       save_stack+0x23/0x90 mm/kasan/common.c:72
       set_track mm/kasan/common.c:80 [inline]
       kasan_set_free_info mm/kasan/common.c:335 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
       __cache_free mm/slab.c:3426 [inline]
       kmem_cache_free+0x86/0x320 mm/slab.c:3694
       kfree_skbmem+0x178/0x1c0 net/core/skbuff.c:645
       __kfree_skb+0x1e/0x30 net/core/skbuff.c:681
       sk_eat_skb include/net/sock.h:2453 [inline]
       tcp_recvmsg+0x1252/0x2930 net/ipv4/tcp.c:2166
       inet_recvmsg+0x136/0x610 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:886 [inline]
       sock_recvmsg net/socket.c:904 [inline]
       sock_recvmsg+0xce/0x110 net/socket.c:900
       __sys_recvfrom+0x1ff/0x350 net/socket.c:2055
       __do_sys_recvfrom net/socket.c:2073 [inline]
       __se_sys_recvfrom net/socket.c:2069 [inline]
       __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:2069
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff8880a488d040
       which belongs to the cache skbuff_fclone_cache of size 456
      The buggy address is located 40 bytes inside of
       456-byte region [ffff8880a488d040, ffff8880a488d208)
      The buggy address belongs to the page:
      page:ffffea0002922340 refcount:1 mapcount:0 mapping:ffff88821b057000 index:0x0
      raw: 00fffe0000000200 ffffea00022a5788 ffffea0002624a48 ffff88821b057000
      raw: 0000000000000000 ffff8880a488d040 0000000100000006 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8880a488cf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff8880a488cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff8880a488d000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                                ^
       ffff8880a488d080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880a488d100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq")
      Fixes: 50895b9d ("tcp: highest_sack fix")
      Fixes: 737ff314 ("tcp: use sequence distance to detect reordering")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cambda Zhu <cambda@linux.alibaba.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      c637b6c2
    • S
      include/linux/notifier.h: SRCU: fix ctags · ffeba5d0
      Sam Protsenko 提交于
      commit 94e297c50b529f5d01cfd1dbc808d61e95180ab7 upstream.
      
      ctags indexing ("make tags" command) throws this warning:
      
          ctags: Warning: include/linux/notifier.h:125:
          null expansion of name pattern "\1"
      
      This is the result of DEFINE_PER_CPU() macro expansion.  Fix that by
      getting rid of line break.
      
      Similar fix was already done in commit 25528213 ("tags: Fix
      DEFINE_PER_CPU expansions"), but this one probably wasn't noticed.
      
      Link: http://lkml.kernel.org/r/20181030202808.28027-1-semen.protsenko@linaro.org
      Fixes: 9c80172b ("kernel/SRCU: provide a static initializer")
      Signed-off-by: NSam Protsenko <semen.protsenko@linaro.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      ffeba5d0
    • J
      alinux: mm: remove unused variable · 9ea9e641
      Joseph Qi 提交于
      To fix the following build warning:
      mm/memcontrol.c: In function ‘mem_cgroup_move_account’:
      mm/memcontrol.c:5604:6: warning: unused variable ‘nid’ [-Wunused-variable]
        int nid = page_to_nid(page);
            ^
      
      Fixes: 96298509 ("mm: thp: don't need care deferred split queue in memcg charge move path")
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
      9ea9e641
    • W
      mm: thp: don't need care deferred split queue in memcg charge move path · a5c7cdab
      Wei Yang 提交于
      commit fac0516b5534897bf4c4a88daa06a8cfa5611b23 upstream
      
      If compound is true, this means it is a PMD mapped THP.  Which implies
      the page is not linked to any defer list.  So the first code chunk will
      not be executed.
      
      Also with this reason, it would not be proper to add this page to a
      defer list.  So the second code chunk is not correct.
      
      Based on this, we should remove the defer list related code.
      
      [yang.shi@linux.alibaba.com: better patch title]
      Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com
      Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
      Signed-off-by: NWei Yang <richardw.yang@linux.intel.com>
      Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>    [5.4+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [Fixed conflicts with our 4.19 kernel]
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a5c7cdab
    • J
      io_uring: add need_resched() check in inner poll loop · 40d7dab8
      Jens Axboe 提交于
      commit 08f5439f1df25a6cf6cf4c72cf6c13025599ce67 upstream.
      
      The outer poll loop checks for whether we need to reschedule, and
      returns to userspace if we do. However, it's possible to get stuck
      in the inner loop as well, if the CPU we are running on needs to
      reschedule to finish the IO work.
      
      Add the need_resched() check in the inner loop as well. This fixes
      a potential hang if the kernel is configured with
      CONFIG_PREEMPT_VOLUNTARY=y.
      Reported-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Tested-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      40d7dab8
    • J
      io_uring: don't enter poll loop if we have CQEs pending · 9108b6e4
      Jens Axboe 提交于
      commit a3a0e43fd77013819e4b6f55e37e0efe8e35d805 upstream.
      
      We need to check if we have CQEs pending before starting a poll loop,
      as those could be the events we will be spinning for (and hence we'll
      find none). This can happen if a CQE triggers an error, or if it is
      found by eg an IRQ before we get a chance to find it through polling.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      9108b6e4
    • J
      io_uring: fix potential hang with polled IO · 527a6504
      Jens Axboe 提交于
      commit 500f9fbadef86466a435726192f4ca4df7d94236 upstream.
      
      If a request issue ends up being punted to async context to avoid
      blocking, we can get into a situation where the original application
      enters the poll loop for that very request before it has been issued.
      This should not be an issue, except that the polling will hold the
      io_uring uring_ctx mutex for the duration of the poll. When the async
      worker has actually issued the request, it needs to acquire this mutex
      to add the request to the poll issued list. Since the application
      polling is already holding this mutex, the workqueue sleeps on the
      mutex forever, and the application thus never gets a chance to poll for
      the very request it was interested in.
      
      Fix this by ensuring that the polling drops the uring_ctx occasionally
      if it's not making any progress.
      Reported-by: NJeffrey M. Birnbaum <jmbnyc@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      527a6504
    • J
      io_uring: fix an issue when IOSQE_IO_LINK is inserted into defer list · dcf45e51
      Jackie Liu 提交于
      commit a982eeb09b6030e567b8b815277c8c9197168040 upstream.
      
      This patch may fix two issues:
      
      First, when IOSQE_IO_DRAIN set, the next IOs need to be inserted into
      defer list to delay execution, but link io will be actively scheduled to
      run by calling io_queue_sqe.
      
      Second, when multiple LINK_IOs are inserted together with defer_list,
      the LINK_IO is no longer keep order.
      
         |-------------|
         |   LINK_IO   |      ----> insert to defer_list  -----------
         |-------------|                                            |
         |   LINK_IO   |      ----> insert to defer_list  ----------|
         |-------------|                                            |
         |   LINK_IO   |      ----> insert to defer_list  ----------|
         |-------------|                                            |
         |   NORMAL_IO |      ----> insert to defer_list  ----------|
         |-------------|                                            |
                                                                    |
                                    queue_work at same time   <-----|
      
      Fixes: 9e645e1105c ("io_uring: add support for sqe links")
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      dcf45e51
    • A
      io_uring: fix manual setup of iov_iter for fixed buffers · 64033a9b
      Aleix Roca Nonell 提交于
      commit 99c79f6692ccdc42e04deea8a36e22bb48168a62 upstream.
      
      Commit bd11b3a391e3 ("io_uring: don't use iov_iter_advance() for fixed
      buffers") introduced an optimization to avoid using the slow
      iov_iter_advance by manually populating the iov_iter iterator in some
      cases.
      
      However, the computation of the iterator count field was erroneous: The
      first bvec was always accounted for an extent of page size even if the
      bvec length was smaller.
      
      In consequence, some I/O operations on fixed buffers were unable to
      operate on the full extent of the buffer, consistently skipping some
      bytes at the end of it.
      
      Fixes: bd11b3a391e3 ("io_uring: don't use iov_iter_advance() for fixed buffers")
      Cc: stable@vger.kernel.org
      Signed-off-by: NAleix Roca Nonell <aleix.rocanonell@bsc.es>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      64033a9b
    • J
      io_uring: fix KASAN use after free in io_sq_wq_submit_work · a4709e68
      Jackie Liu 提交于
      commit d0ee879187df966ef638031b5f5183078d672141 upstream.
      
      [root@localhost ~]# ./liburing/test/link
      
      QEMU Standard PC report that:
      
      [   29.379892] CPU: 0 PID: 84 Comm: kworker/u2:2 Not tainted 5.3.0-rc2-00051-g4010b622f1d2-dirty #86
      [   29.379902] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
      [   29.379913] Workqueue: io_ring-wq io_sq_wq_submit_work
      [   29.379929] Call Trace:
      [   29.379953]  dump_stack+0xa9/0x10e
      [   29.379970]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.379986]  print_address_description.cold.6+0x9/0x317
      [   29.379999]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380010]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380026]  __kasan_report.cold.7+0x1a/0x34
      [   29.380044]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380061]  kasan_report+0xe/0x12
      [   29.380076]  io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380104]  ? io_sq_thread+0xaf0/0xaf0
      [   29.380152]  process_one_work+0xb59/0x19e0
      [   29.380184]  ? pwq_dec_nr_in_flight+0x2c0/0x2c0
      [   29.380221]  worker_thread+0x8c/0xf40
      [   29.380248]  ? __kthread_parkme+0xab/0x110
      [   29.380265]  ? process_one_work+0x19e0/0x19e0
      [   29.380278]  kthread+0x30b/0x3d0
      [   29.380292]  ? kthread_create_on_node+0xe0/0xe0
      [   29.380311]  ret_from_fork+0x3a/0x50
      
      [   29.380635] Allocated by task 209:
      [   29.381255]  save_stack+0x19/0x80
      [   29.381268]  __kasan_kmalloc.constprop.6+0xc1/0xd0
      [   29.381279]  kmem_cache_alloc+0xc0/0x240
      [   29.381289]  io_submit_sqe+0x11bc/0x1c70
      [   29.381300]  io_ring_submit+0x174/0x3c0
      [   29.381311]  __x64_sys_io_uring_enter+0x601/0x780
      [   29.381322]  do_syscall_64+0x9f/0x4d0
      [   29.381336]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [   29.381633] Freed by task 84:
      [   29.382186]  save_stack+0x19/0x80
      [   29.382198]  __kasan_slab_free+0x11d/0x160
      [   29.382210]  kmem_cache_free+0x8c/0x2f0
      [   29.382220]  io_put_req+0x22/0x30
      [   29.382230]  io_sq_wq_submit_work+0x28b/0xe90
      [   29.382241]  process_one_work+0xb59/0x19e0
      [   29.382251]  worker_thread+0x8c/0xf40
      [   29.382262]  kthread+0x30b/0x3d0
      [   29.382272]  ret_from_fork+0x3a/0x50
      
      [   29.382569] The buggy address belongs to the object at ffff888067172140
                      which belongs to the cache io_kiocb of size 224
      [   29.384692] The buggy address is located 120 bytes inside of
                      224-byte region [ffff888067172140, ffff888067172220)
      [   29.386723] The buggy address belongs to the page:
      [   29.387575] page:ffffea00019c5c80 refcount:1 mapcount:0 mapping:ffff88806ace5180 index:0x0
      [   29.387587] flags: 0x100000000000200(slab)
      [   29.387603] raw: 0100000000000200 dead000000000100 dead000000000122 ffff88806ace5180
      [   29.387617] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
      [   29.387624] page dumped because: kasan: bad access detected
      
      [   29.387920] Memory state around the buggy address:
      [   29.388771]  ffff888067172080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
      [   29.390062]  ffff888067172100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
      [   29.391325] >ffff888067172180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [   29.392578]                                         ^
      [   29.393480]  ffff888067172200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
      [   29.394744]  ffff888067172280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   29.396003] ==================================================================
      [   29.397260] Disabling lock debugging due to kernel taint
      
      io_sq_wq_submit_work free and read req again.
      
      Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
      Cc: linux-block@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: f7b76ac9d17e ("io_uring: fix counter inc/dec mismatch in async_list")
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      a4709e68
    • J
      io_uring: ensure ->list is initialized for poll commands · f3b90301
      Jens Axboe 提交于
      commit 36703247d5f52a679df9da51192b6950fe81689f upstream.
      
      Daniel reports that when testing an http server that uses io_uring
      to poll for incoming connections, sometimes it hard crashes. This is
      due to an uninitialized list member for the io_uring request. Normally
      this doesn't trigger and none of the test cases caught it.
      Reported-by: NDaniel Kozak <kozzi11@gmail.com>
      Tested-by: NDaniel Kozak <kozzi11@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      f3b90301
    • Z
      io_uring: track io length in async_list based on bytes · e280026f
      Zhengyuan Liu 提交于
      commit 9310a7ba6de8cce6209e3e8a3cdf733f824cdd9b upstream.
      
      We are using PAGE_SIZE as the unit to determine if the total len in
      async_list has exceeded max_pages, it's not fair for smaller io sizes.
      For example, if we are doing 1k-size io streams, we will never exceed
      max_pages since len >>= PAGE_SHIFT always gets zero. So use original
      bytes to make it more accurate.
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e280026f
    • J
      io_uring: don't use iov_iter_advance() for fixed buffers · 6117e3bf
      Jens Axboe 提交于
      commit bd11b3a391e3df6fa958facbe4b3f9f4cca9bd49 upstream.
      
      Hrvoje reports that when a large fixed buffer is registered and IO is
      being done to the latter pages of said buffer, the IO submission time
      is much worse:
      
      reading to the start of the buffer: 11238 ns
      reading to the end of the buffer:   1039879 ns
      
      In fact, it's worse by two orders of magnitude. The reason for that is
      how io_uring figures out how to setup the iov_iter. We point the iter
      at the first bvec, and then use iov_iter_advance() to fast-forward to
      the offset within that buffer we need.
      
      However, that is abysmally slow, as it entails iterating the bvecs
      that we setup as part of buffer registration. There's really no need
      to use this generic helper, as we know it's a BVEC type iterator, and
      we also know that each bvec is PAGE_SIZE in size, apart from possibly
      the first and last. Hence we can just use a shift on the offset to
      find the right index, and then adjust the iov_iter appropriately.
      After this fix, the timings are:
      
      reading to the start of the buffer: 10135 ns
      reading to the end of the buffer:   1377 ns
      
      Or about an 755x improvement for the tail page.
      Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
      Tested-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      6117e3bf
    • Z
      io_uring: add a memory barrier before atomic_read · 5e331d91
      Zhengyuan Liu 提交于
      commit c0e48f9dea9129aa11bec3ed13803bcc26e96e49 upstream.
      
      There is a hang issue while using fio to do some basic test. The issue
      can be easily reproduced using the below script:
      
              while true
              do
                      fio  --ioengine=io_uring  -rw=write -bs=4k -numjobs=1 \
                           -size=1G -iodepth=64 -name=uring   --filename=/dev/zero
              done
      
      After several minutes (or more), fio would block at
      io_uring_enter->io_cqring_wait in order to waiting for previously
      committed sqes to be completed and can't return to user anymore until
      we send a SIGTERM to fio. After receiving SIGTERM, fio hangs at
      io_ring_ctx_wait_and_kill with a backtrace like this:
      
              [54133.243816] Call Trace:
              [54133.243842]  __schedule+0x3a0/0x790
              [54133.243868]  schedule+0x38/0xa0
              [54133.243880]  schedule_timeout+0x218/0x3b0
              [54133.243891]  ? sched_clock+0x9/0x10
              [54133.243903]  ? wait_for_completion+0xa3/0x130
              [54133.243916]  ? _raw_spin_unlock_irq+0x2c/0x40
              [54133.243930]  ? trace_hardirqs_on+0x3f/0xe0
              [54133.243951]  wait_for_completion+0xab/0x130
              [54133.243962]  ? wake_up_q+0x70/0x70
              [54133.243984]  io_ring_ctx_wait_and_kill+0xa0/0x1d0
              [54133.243998]  io_uring_release+0x20/0x30
              [54133.244008]  __fput+0xcf/0x270
              [54133.244029]  ____fput+0xe/0x10
              [54133.244040]  task_work_run+0x7f/0xa0
              [54133.244056]  do_exit+0x305/0xc40
              [54133.244067]  ? get_signal+0x13b/0xbd0
              [54133.244088]  do_group_exit+0x50/0xd0
              [54133.244103]  get_signal+0x18d/0xbd0
              [54133.244112]  ? _raw_spin_unlock_irqrestore+0x36/0x60
              [54133.244142]  do_signal+0x34/0x720
              [54133.244171]  ? exit_to_usermode_loop+0x7e/0x130
              [54133.244190]  exit_to_usermode_loop+0xc0/0x130
              [54133.244209]  do_syscall_64+0x16b/0x1d0
              [54133.244221]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The reason is that we had added a req to ctx->pending_async at the very
      end, but it didn't get a chance to be processed. How could this happen?
      
              fio#cpu0                                        wq#cpu1
      
              io_add_to_prev_work                    io_sq_wq_submit_work
      
                atomic_read() <<< 1
      
                                                        atomic_dec_return() << 1->0
                                                        list_empty();    <<< true;
      
                list_add_tail()
                atomic_read() << 0 or 1?
      
      As atomic_ops.rst states, atomic_read does not guarantee that the
      runtime modification by any other thread is visible yet, so we must take
      care of that with a proper implicit or explicit memory barrier.
      
      This issue was detected with the help of Jackie's <liuyun01@kylinos.cn>
      
      Fixes: 31b515106428 ("io_uring: allow workqueue item to handle multiple buffered requests")
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5e331d91
    • O
      signal: simplify set_user_sigmask/restore_user_sigmask · f12f9562
      Oleg Nesterov 提交于
      commit b772434be0891ed1081a08ae7cfd4666728f8e82 upstream.
      
      task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
      syscall paths.  This means that set_user_sigmask() can save ->blocked in
      ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
      was modified.
      
      This way the callers do not need 2 sigset_t's passed to set/restore and
      restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
      into the trivial helper which just calls restore_saved_sigmask().
      
      Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Deepa Dinamani <deepa.kernel@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Eric Wong <e@80x24.org>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David Laight <David.Laight@aculab.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      f12f9562
    • Z
      io_uring: fix counter inc/dec mismatch in async_list · eaebfba5
      Zhengyuan Liu 提交于
      commit f7b76ac9d17e16e44feebb6d2749fec92bfd6dd4 upstream.
      
      We could queue a work for each req in defer and link list without
      increasing async_list->cnt, so we shouldn't decrease it while exiting
      from workqueue as well if we didn't process the req in async list.
      
      Thanks to Jens Axboe <axboe@kernel.dk> for his guidance.
      
      Fixes: 31b515106428 ("io_uring: allow workqueue item to handle multiple buffered requests")
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      eaebfba5
    • Z
      io_uring: fix the sequence comparison in io_sequence_defer · 8bc3afd8
      Zhengyuan Liu 提交于
      commit dbd0f6d6c2a11eb9c31ca9cd454f95bb5713e92e upstream.
      
      sq->cached_sq_head and cq->cached_cq_tail are both unsigned int. If
      cached_sq_head overflows before cached_cq_tail, then we may miss a
      barrier req. As cached_cq_tail always follows cached_sq_head, the NQ
      should be enough.
      
      Cc: stable@vger.kernel.org
      Fixes: de0617e46717 ("io_uring: add support for marking commands as draining")
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      8bc3afd8
    • J
      io_uring: fix io_sq_thread_stop running in front of io_sq_thread · a8026189
      Jackie Liu 提交于
      commit a4c0b3decb33fb4a2b5ecc6234a50680f0b21e7d upstream.
      
      INFO: task syz-executor.5:8634 blocked for more than 143 seconds.
             Not tainted 5.2.0-rc5+ #3
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      syz-executor.5  D25632  8634   8224 0x00004004
      Call Trace:
        context_switch kernel/sched/core.c:2818 [inline]
        __schedule+0x658/0x9e0 kernel/sched/core.c:3445
        schedule+0x131/0x1d0 kernel/sched/core.c:3509
        schedule_timeout+0x9a/0x2b0 kernel/time/timer.c:1783
        do_wait_for_common+0x35e/0x5a0 kernel/sched/completion.c:83
        __wait_for_common kernel/sched/completion.c:104 [inline]
        wait_for_common kernel/sched/completion.c:115 [inline]
        wait_for_completion+0x47/0x60 kernel/sched/completion.c:136
        kthread_stop+0xb4/0x150 kernel/kthread.c:559
        io_sq_thread_stop fs/io_uring.c:2252 [inline]
        io_finish_async fs/io_uring.c:2259 [inline]
        io_ring_ctx_free fs/io_uring.c:2770 [inline]
        io_ring_ctx_wait_and_kill+0x268/0x880 fs/io_uring.c:2834
        io_uring_release+0x5d/0x70 fs/io_uring.c:2842
        __fput+0x2e4/0x740 fs/file_table.c:280
        ____fput+0x15/0x20 fs/file_table.c:313
        task_work_run+0x17e/0x1b0 kernel/task_work.c:113
        tracehook_notify_resume include/linux/tracehook.h:185 [inline]
        exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
        prepare_exit_to_usermode+0x402/0x4f0 arch/x86/entry/common.c:199
        syscall_return_slowpath+0x110/0x440 arch/x86/entry/common.c:279
        do_syscall_64+0x126/0x140 arch/x86/entry/common.c:304
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x412fb1
      Code: 80 3b 7c 0f 84 c7 02 00 00 c7 85 d0 00 00 00 00 00 00 00 48 8b 05 cf
      a6 24 00 49 8b 14 24 41 b9 cb 2a 44 00 48 89 ee 48 89 df <48> 85 c0 4c 0f
      45 c8 45 31 c0 31 c9 e8 0e 5b 00 00 85 c0 41 89 c7
      RSP: 002b:00007ffe7ee6a180 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
      RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000412fb1
      RDX: 0000001b2d920000 RSI: 0000000000000000 RDI: 0000000000000003
      RBP: 0000000000000001 R08: 00000000f3a3e1f8 R09: 00000000f3a3e1fc
      R10: 00007ffe7ee6a260 R11: 0000000000000293 R12: 000000000075c9a0
      R13: 000000000075c9a0 R14: 0000000000024c00 R15: 000000000075bf2c
      
      =============================================
      
      There is an wrong logic, when kthread_park running
      in front of io_sq_thread.
      
      CPU#0					CPU#1
      
      io_sq_thread_stop:			int kthread(void *_create):
      
      kthread_park()
      					__kthread_parkme(self);	 <<< Wrong
      kthread_stop()
          << wait for self->exited
          << clear_bit KTHREAD_SHOULD_PARK
      
      					ret = threadfn(data);
      					   |
      					   |- io_sq_thread
      					       |- kthread_should_park()	<< false
      					       |- schedule() <<< nobody wake up
      
      stuck CPU#0				stuck CPU#1
      
      So, use a new variable sqo_thread_started to ensure that io_sq_thread
      run first, then io_sq_thread_stop.
      
      Reported-by: syzbot+94324416c485d422fe15@syzkaller.appspotmail.com
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      a8026189
    • J
      io_uring: add support for recvmsg() · 3962b3d0
      Jens Axboe 提交于
      commit aa1fa28fc73ea6b740ee7b62bf3b07141883dbb8 upstream.
      
      This is done through IORING_OP_RECVMSG. This opcode uses the same
      sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the
      msghdr struct in the sqe->addr field as well.
      
      We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      3962b3d0
    • J
      io_uring: add support for sendmsg() · 0cb8acf9
      Jens Axboe 提交于
      commit 0fa03c624d8fc9932d0f27c39a9deca6a37e0e17 upstream.
      
      This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags
      for the flags argument, and the msghdr struct is passed in the
      sqe->addr field.
      
      We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0cb8acf9
    • C
      block: never take page references for ITER_BVEC · 709d159e
      Christoph Hellwig 提交于
      Cherry-pick from commit b620743077e291ae7d0debd21f50413a8c266229 upstream.
      
      If we pass pages through an iov_iter we always already have a reference
      in the caller.  Thus remove the ITER_BVEC_FLAG_NO_REF and don't take
      reference to pages by default for bvec backed iov_iters.
      
      [Joseph] Resolve conflicts since we don't have:
      81ba6abd2bcd "block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF"
      7321ecbfc7cf "block: change how we get page references in bio_iov_iter_get_pages"
      Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      709d159e
    • O
      signal: remove the wrong signal_pending() check in restore_user_sigmask() · a48e4674
      Oleg Nesterov 提交于
      commit 97abc889ee296faf95ca0e978340fb7b942a3e32 upstream.
      
      This is the minimal fix for stable, I'll send cleanups later.
      
      Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
      the visible change which breaks user-space: a signal temporary unblocked
      by set_user_sigmask() can be delivered even if the caller returns
      success or timeout.
      
      Change restore_user_sigmask() to accept the additional "interrupted"
      argument which should be used instead of signal_pending() check, and
      update the callers.
      
      Eric said:
      
      : For clarity.  I don't think this is required by posix, or fundamentally to
      : remove the races in select.  It is what linux has always done and we have
      : applications who care so I agree this fix is needed.
      :
      : Further in any case where the semantic change that this patch rolls back
      : (aka where allowing a signal to be delivered and the select like call to
      : complete) would be advantage we can do as well if not better by using
      : signalfd.
      :
      : Michael is there any chance we can get this guarantee of the linux
      : implementation of pselect and friends clearly documented.  The guarantee
      : that if the system call completes successfully we are guaranteed that no
      : signal that is unblocked by using sigmask will be delivered?
      
      Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
      Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NEric Wong <e@80x24.org>
      Tested-by: NEric Wong <e@80x24.org>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      a48e4674
    • J
      io_uring: add support for sqe links · fda445b3
      Jens Axboe 提交于
      commit 9e645e1105ca60fbbc6bddf2fd5ef7e57ed3dca8 upstream.
      
      With SQE links, we can create chains of dependent SQEs. One example
      would be queueing an SQE that's a read from one file descriptor, with
      the linked SQE being a write to another with the same set of buffers.
      
      An SQE link will not stall the pipeline, it'll just ensure that
      dependent SQEs aren't issued before the previous link has completed.
      
      Any error at submission or completion time will break the chain of SQEs.
      For completions, this also includes short reads or writes, as the next
      SQE could depend on the previous one being fully completed.
      
      Any SQE in a chain that gets canceled due to any of the above errors,
      will get an CQE fill with -ECANCELED as the error value.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      fda445b3
    • J
      io_uring: ensure req->file is cleared on allocation · deec7877
      Jens Axboe 提交于
      commit 60c112b0ada09826cc4ae6a4e55df677f76f1313 upstream.
      
      Stephen reports:
      
      I hit the following General Protection Fault when testing io_uring via
      the io_uring engine in fio. This was on a VM running 5.2-rc5 and the
      latest version of fio. The issue occurs for both null_blk and fake NVMe
      drives. I have not tested bare metal or real NVMe SSDs. The fio script
      used is given below.
      
      [io_uring]
      time_based=1
      runtime=60
      filename=/dev/nvme2n1 (note /dev/nullb0 also fails)
      ioengine=io_uring
      bs=4k
      rw=readwrite
      direct=1
      fixedbufs=1
      sqthread_poll=1
      sqthread_poll_cpu=0
      
      general protection fault: 0000 [#1] SMP PTI
      CPU: 0 PID: 872 Comm: io_uring-sq Not tainted 5.2.0-rc5-cpacket-io-uring #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      RIP: 0010:fput_many+0x7/0x90
      Code: 01 48 85 ff 74 17 55 48 89 e5 53 48 8b 1f e8 a0 f9 ff ff 48 85 db 48 89 df 75 f0 5b 5d f3 c3 0f 1f 40 00 0f 1f 44 00 00 89 f6 <f0> 48 29 77 38 74 01 c3 55 48 89 e5 53 48 89 fb 65 48 \
      
      RSP: 0018:ffffadeb817ebc50 EFLAGS: 00010246
      RAX: 0000000000000004 RBX: ffff8f46ad477480 RCX: 0000000000001805
      RDX: 0000000000000000 RSI: 0000000000000001 RDI: f18b51b9a39552b5
      RBP: ffffadeb817ebc58 R08: ffff8f46b7a318c0 R09: 000000000000015d
      R10: ffffadeb817ebce8 R11: 0000000000000020 R12: ffff8f46ad4cd000
      R13: 00000000fffffff7 R14: ffffadeb817ebe30 R15: 0000000000000004
      FS:  0000000000000000(0000) GS:ffff8f46b7a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055828f0bbbf0 CR3: 0000000232176004 CR4: 00000000003606f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       ? fput+0x13/0x20
       io_free_req+0x20/0x40
       io_put_req+0x1b/0x20
       io_submit_sqe+0x40a/0x680
       ? __switch_to_asm+0x34/0x70
       ? __switch_to_asm+0x40/0x70
       io_submit_sqes+0xb9/0x160
       ? io_submit_sqes+0xb9/0x160
       ? __switch_to_asm+0x40/0x70
       ? __switch_to_asm+0x34/0x70
       ? __schedule+0x3f2/0x6a0
       ? __switch_to_asm+0x34/0x70
       io_sq_thread+0x1af/0x470
       ? __switch_to_asm+0x34/0x70
       ? wait_woken+0x80/0x80
       ? __switch_to+0x85/0x410
       ? __switch_to_asm+0x40/0x70
       ? __switch_to_asm+0x34/0x70
       ? __schedule+0x3f2/0x6a0
       kthread+0x105/0x140
       ? io_submit_sqes+0x160/0x160
       ? kthread+0x105/0x140
       ? io_submit_sqes+0x160/0x160
       ? kthread_destroy_worker+0x50/0x50
       ret_from_fork+0x35/0x40
      
      which occurs because using a kernel side submission thread isn't valid
      without using fixed files (registered through io_uring_register()). This
      causes io_uring to put the request after logging an error, but before
      the file field is set in the request. If it happens to be non-zero, we
      attempt to fput() garbage.
      
      Fix this by ensuring that req->file is initialized when the request is
      allocated.
      
      Cc: stable@vger.kernel.org # 5.1+
      Reported-by: NStephen Bates <sbates@raithlin.com>
      Tested-by: NStephen Bates <sbates@raithlin.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      deec7877
    • E
      io_uring: fix memory leak of UNIX domain socket inode · dc61e2f4
      Eric Biggers 提交于
      commit 355e8d26f719c207aa2e00e6f3cfab3acf21769b upstream.
      
      Opening and closing an io_uring instance leaks a UNIX domain socket
      inode.  This is because the ->file of the io_uring instance's internal
      UNIX domain socket is set to point to the io_uring file, but then
      sock_release() sees the non-NULL ->file and assumes the inode reference
      is held by the file so doesn't call iput().  That's not the case here,
      since the reference is still meant to be held by the socket; the actual
      inode of the io_uring file is different.
      
      Fix this leak by NULL-ing out ->file before releasing the socket.
      
      Reported-by: syzbot+111cb28d9f583693aefa@syzkaller.appspotmail.com
      Fixes: 2b188cc1bb85 ("Add io_uring IO interface")
      Cc: <stable@vger.kernel.org> # v5.1+
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      dc61e2f4
    • J
      io_uring: punt short reads to async context · c599edd9
      Jens Axboe 提交于
      commit 9d93a3f5a0c0d0f79aebc597d47c7cedc852aeb5 upstream.
      
      We can encounter a short read when we're doing buffered reads and the
      data is partially cached. Right now we just return the short read, but
      that forces the application to read that CQE, then issue another SQE
      to finish the read. That read will not be cached, and hence will result
      in an async punt.
      
      It's more efficient to do that async punt from within the kernel, as
      that will the not need two round trips more to the kernel.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      c599edd9
    • J
      uio: make import_iovec()/compat_import_iovec() return bytes on success · 0c13034a
      Jens Axboe 提交于
      commit 87e5e6dab6c2a21fab2620f37786276d202e2ce0 upstream.
      
      Currently these functions return < 0 on error, and 0 for success.
      Change that so that we return < 0 on error, but number of bytes
      for success.
      
      Some callers already treat the return value that way, others need a
      slight tweak.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0c13034a
    • P
      io_uring: Fix __io_uring_register() false success · cb67bab8
      Pavel Begunkov 提交于
      commit a278682dad37fd2f8d2f30d8e84e376a856ab472 upstream.
      
      If io_copy_iov() fails, it will break the loop and report success,
      albeit partially completed operation.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      cb67bab8
    • J
      tools/io_uring: sync with liburing · a520af94
      Jens Axboe 提交于
      commit 004d564f908790efe815a6510a542ac1227ef2a2 upstream.
      
      Various fixes and changes have been applied to liburing since we
      copied some select bits to the kernel testing/examples part, sync
      up with liburing to get those changes.
      
      Most notable is the change that split the CQE reading into the peek
      and seen event, instead of being just a single function. Also fixes
      an unsigned wrap issue in io_uring_submit(), leak of 'fd' in setup
      if we fail, and various other little issues.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      a520af94
    • J
      tools/io_uring: fix Makefile for pthread library link · 16770031
      Jens Axboe 提交于
      commit 486f069253c3c738dec62daeb16f7232b2cca065 upstream.
      
      Currently fails with:
      
      io_uring-bench.o: In function `main':
      /home/axboe/git/linux-block/tools/io_uring/io_uring-bench.c:560: undefined reference to `pthread_create'
      /home/axboe/git/linux-block/tools/io_uring/io_uring-bench.c:588: undefined reference to `pthread_join'
      collect2: error: ld returned 1 exit status
      Makefile:11: recipe for target 'io_uring-bench' failed
      make: *** [io_uring-bench] Error 1
      
      Move -lpthread to the end.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      16770031
    • J
      blk-mq: fix NULL pointer deference in case no poll implementation · 1e41c505
      Joseph Qi 提交于
      In case some drivers such virtio-blk, poll function is not implementatin
      yet. Before commit 529262d5 ("block: remove ->poll_fn"), q->poll_fn
      is NULL and then blk_poll() won't do poll actually.
      So add a check for this to avoid NULL pointer dereference when calling
      q->mq_ops->poll.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      1e41c505
    • J
      io_uring: use wait_event_interruptible for cq_wait conditional wait · fce831f9
      Jackie Liu 提交于
      commit fdb288a679cdf6a71f3c1ae6f348ba4dae742681 upstream.
      
      The previous patch has ensured that io_cqring_events contain
      smp_rmb memory barriers, Now we can use wait_event_interruptible
      to keep the code simple.
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      fce831f9
    • J
      io_uring: adjust smp_rmb inside io_cqring_events · e4fd982c
      Jackie Liu 提交于
      commit dc6ce4bc2b355a47f225a0205046b3ebf29a7f72 upstream.
      
      Whenever smp_rmb is required to use io_cqring_events,
      keep smp_rmb inside the function io_cqring_events.
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e4fd982c
    • R
      io_uring: fix infinite wait in khread_park() on io_finish_async() · a35d7922
      Roman Penyaev 提交于
      commit 2bbcd6d3b36a75a19be4917807f54ae32dd26aba upstream.
      
      This fixes couple of races which lead to infinite wait of park completion
      with the following backtraces:
      
        [20801.303319] Call Trace:
        [20801.303321]  ? __schedule+0x284/0x650
        [20801.303323]  schedule+0x33/0xc0
        [20801.303324]  schedule_timeout+0x1bc/0x210
        [20801.303326]  ? schedule+0x3d/0xc0
        [20801.303327]  ? schedule_timeout+0x1bc/0x210
        [20801.303329]  ? preempt_count_add+0x79/0xb0
        [20801.303330]  wait_for_completion+0xa5/0x120
        [20801.303331]  ? wake_up_q+0x70/0x70
        [20801.303333]  kthread_park+0x48/0x80
        [20801.303335]  io_finish_async+0x2c/0x70
        [20801.303336]  io_ring_ctx_wait_and_kill+0x95/0x180
        [20801.303338]  io_uring_release+0x1c/0x20
        [20801.303339]  __fput+0xad/0x210
        [20801.303341]  task_work_run+0x8f/0xb0
        [20801.303342]  exit_to_usermode_loop+0xa0/0xb0
        [20801.303343]  do_syscall_64+0xe0/0x100
        [20801.303349]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [20801.303380] Call Trace:
        [20801.303383]  ? __schedule+0x284/0x650
        [20801.303384]  schedule+0x33/0xc0
        [20801.303386]  io_sq_thread+0x38a/0x410
        [20801.303388]  ? __switch_to_asm+0x40/0x70
        [20801.303390]  ? wait_woken+0x80/0x80
        [20801.303392]  ? _raw_spin_lock_irqsave+0x17/0x40
        [20801.303394]  ? io_submit_sqes+0x120/0x120
        [20801.303395]  kthread+0x112/0x130
        [20801.303396]  ? kthread_create_on_node+0x60/0x60
        [20801.303398]  ret_from_fork+0x35/0x40
      
       o kthread_park() waits for park completion, so io_sq_thread() loop
         should check kthread_should_park() along with khread_should_stop(),
         otherwise if kthread_park() is called before prepare_to_wait()
         the following schedule() never returns:
      
         CPU#0                    CPU#1
      
         io_sq_thread_stop():     io_sq_thread():
      
                                     while(!kthread_should_stop() && !ctx->sqo_stop) {
      
            ctx->sqo_stop = 1;
            kthread_park()
      
      	                            prepare_to_wait();
                                          if (kthread_should_stop() {
      				    }
                                          schedule();   <<< nobody checks park flag,
      				                  <<< so schedule and never return
      
       o if the flag ctx->sqo_stop is observed by the io_sq_thread() loop
         it is quite possible, that kthread_should_park() check and the
         following kthread_parkme() is never called, because kthread_park()
         has not been yet called, but few moments later is is called and
         waits there for park completion, which never happens, because
         kthread has already exited:
      
         CPU#0                    CPU#1
      
         io_sq_thread_stop():     io_sq_thread():
      
            ctx->sqo_stop = 1;
                                     while(!kthread_should_stop() && !ctx->sqo_stop) {
                                         <<< observe sqo_stop and exit the loop
      			       }
      
      			       if (kthread_should_park())
      			           kthread_parkme();  <<< never called, since was
      					              <<< never parked
      
            kthread_park()           <<< waits forever for park completion
      
      In the current patch we quit the loop by only kthread_should_park()
      check (kthread_park() is synchronous, so kthread_should_stop() is
      never observed), and we abandon ->sqo_stop flag, since it is racy.
      At the end of the io_sq_thread() we unconditionally call parmke(),
      since we've exited the loop by the park flag.
      Signed-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      a35d7922
    • J
      io_uring: remove 'ev_flags' argument · 8c599bf7
      Jens Axboe 提交于
      commit c71ffb673cd9bb2ddc575ede9055f265b2535690 upstream.
      
      We always pass in 0 for the cqe flags argument, since the support for
      "this read hit page cache" hint was dropped.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      8c599bf7
    • J
      io_uring: fix failure to verify SQ_AFF cpu · b9dfcf6a
      Jens Axboe 提交于
      commit 44a9bd18a0f06bba19d155aeaa11e2edce898293 upstream.
      
      The test case we have is rightfully failing with the current kernel:
      
      io_uring_setup(1, 0x7ffe2cafebe0), flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 4
      expected -1, got 3
      
      This is in a vm, and CPU3 is the last valid one, hence asking for 4
      should fail the setup with -EINVAL, not succeed. The problem is that
      we're using array_index_nospec() with nr_cpu_ids as the index, hence we
      wrap and end up using CPU0 instead of CPU4. This makes the setup
      succeed where it should be failing.
      
      We don't need to use array_index_nospec() as we're not indexing any
      array with this. Instead just compare with nr_cpu_ids directly. This
      is fine as we're checking with cpu_online() afterwards.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b9dfcf6a
    • S
      io_uring: fix race condition reading SQE data · 8f2cc0e9
      Stefan Bühler 提交于
      commit e2033e33cb3821c26d4f9e70677910827d3b7885 upstream.
      
      When punting to workers the SQE gets copied after the initial try.
      There is a race condition between reading SQE data for the initial try
      and copying it for punting it to the workers.
      
      For example io_rw_done calls kiocb->ki_complete even if it was prepared
      for IORING_OP_FSYNC (and would be NULL).
      
      The easiest solution for now is to alway prepare again in the worker.
      
      req->file is safe to prepare though as long as it is checked before use.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      8f2cc0e9
    • S
      io_uring: use cpu_online() to check p->sq_thread_cpu instead of cpu_possible() · b5ac46ff
      Shenghui Wang 提交于
      commit 7889f44dd9cee15aff1c3f7daf81ca4dfed48fc7 upstream.
      
      This issue is found by running liburing/test/io_uring_setup test.
      
      When test run, the testcase "attempt to bind to invalid cpu" would not
      pass with messages like:
         io_uring_setup(1, 0xbfc2f7c8), \
      flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, \
      resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, \
      sq_thread_cpu: 2
         expected -1, got 3
         FAIL
      
      On my system, there is:
         CPU(s) possible : 0-3
         CPU(s) online   : 0-1
         CPU(s) offline  : 2-3
         CPU(s) present  : 0-1
      
      The sq_thread_cpu 2 is offline on my system, so the bind should fail.
      But cpu_possible() will pass the check. We shouldn't be able to bind
      to an offline cpu. Use cpu_online() to do the check.
      
      After the change, the testcase run as expected: EINVAL will be returned
      for cpu offlined.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b5ac46ff
    • C
      io_uring: fix shadowed variable ret return code being not checked · 91ee101a
      Colin Ian King 提交于
      commit efeb862bd5bc001636e690debf6f9fbba98e5bfd upstream.
      
      Currently variable ret is declared in a while-loop code block that
      shadows another variable ret. When an error occurs in the while-loop
      the error return in ret is not being set in the outer code block and
      so the error check on ret is always going to be checking on the wrong
      ret variable resulting in check that is always going to be true and
      a premature return occurs.
      
      Fix this by removing the declaration of the inner while-loop variable
      ret so that shadowing does not occur.
      
      Addresses-Coverity: ("'Constant' variable guards dead code")
      Fixes: 6b06314c47e1 ("io_uring: add file set registration")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      91ee101a
    • S
      req->error only used for iopoll · f44bc1f1
      Stefan Bühler 提交于
      commit 5dcf877fb13f3c6a8ba0777ef766c4af32df725d upstream.
      
      No need to set it in io_poll_add; io_poll_complete doesn't use it to set
      the result in the CQE.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      f44bc1f1