1. 11 2月, 2020 30 次提交
  2. 04 2月, 2020 10 次提交
    • J
      io_uring: add need_resched() check in inner poll loop · a28be734
      Jens Axboe 提交于
      commit 08f5439f1df25a6cf6cf4c72cf6c13025599ce67 upstream.
      
      The outer poll loop checks for whether we need to reschedule, and
      returns to userspace if we do. However, it's possible to get stuck
      in the inner loop as well, if the CPU we are running on needs to
      reschedule to finish the IO work.
      
      Add the need_resched() check in the inner loop as well. This fixes
      a potential hang if the kernel is configured with
      CONFIG_PREEMPT_VOLUNTARY=y.
      Reported-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Tested-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      a28be734
    • J
      io_uring: don't enter poll loop if we have CQEs pending · ac7d1176
      Jens Axboe 提交于
      commit a3a0e43fd77013819e4b6f55e37e0efe8e35d805 upstream.
      
      We need to check if we have CQEs pending before starting a poll loop,
      as those could be the events we will be spinning for (and hence we'll
      find none). This can happen if a CQE triggers an error, or if it is
      found by eg an IRQ before we get a chance to find it through polling.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      ac7d1176
    • J
      io_uring: fix potential hang with polled IO · bea30534
      Jens Axboe 提交于
      commit 500f9fbadef86466a435726192f4ca4df7d94236 upstream.
      
      If a request issue ends up being punted to async context to avoid
      blocking, we can get into a situation where the original application
      enters the poll loop for that very request before it has been issued.
      This should not be an issue, except that the polling will hold the
      io_uring uring_ctx mutex for the duration of the poll. When the async
      worker has actually issued the request, it needs to acquire this mutex
      to add the request to the poll issued list. Since the application
      polling is already holding this mutex, the workqueue sleeps on the
      mutex forever, and the application thus never gets a chance to poll for
      the very request it was interested in.
      
      Fix this by ensuring that the polling drops the uring_ctx occasionally
      if it's not making any progress.
      Reported-by: NJeffrey M. Birnbaum <jmbnyc@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      bea30534
    • J
      io_uring: fix an issue when IOSQE_IO_LINK is inserted into defer list · 340538dc
      Jackie Liu 提交于
      commit a982eeb09b6030e567b8b815277c8c9197168040 upstream.
      
      This patch may fix two issues:
      
      First, when IOSQE_IO_DRAIN set, the next IOs need to be inserted into
      defer list to delay execution, but link io will be actively scheduled to
      run by calling io_queue_sqe.
      
      Second, when multiple LINK_IOs are inserted together with defer_list,
      the LINK_IO is no longer keep order.
      
         |-------------|
         |   LINK_IO   |      ----> insert to defer_list  -----------
         |-------------|                                            |
         |   LINK_IO   |      ----> insert to defer_list  ----------|
         |-------------|                                            |
         |   LINK_IO   |      ----> insert to defer_list  ----------|
         |-------------|                                            |
         |   NORMAL_IO |      ----> insert to defer_list  ----------|
         |-------------|                                            |
                                                                    |
                                    queue_work at same time   <-----|
      
      Fixes: 9e645e1105c ("io_uring: add support for sqe links")
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      340538dc
    • A
      io_uring: fix manual setup of iov_iter for fixed buffers · 19dfcec1
      Aleix Roca Nonell 提交于
      commit 99c79f6692ccdc42e04deea8a36e22bb48168a62 upstream.
      
      Commit bd11b3a391e3 ("io_uring: don't use iov_iter_advance() for fixed
      buffers") introduced an optimization to avoid using the slow
      iov_iter_advance by manually populating the iov_iter iterator in some
      cases.
      
      However, the computation of the iterator count field was erroneous: The
      first bvec was always accounted for an extent of page size even if the
      bvec length was smaller.
      
      In consequence, some I/O operations on fixed buffers were unable to
      operate on the full extent of the buffer, consistently skipping some
      bytes at the end of it.
      
      Fixes: bd11b3a391e3 ("io_uring: don't use iov_iter_advance() for fixed buffers")
      Cc: stable@vger.kernel.org
      Signed-off-by: NAleix Roca Nonell <aleix.rocanonell@bsc.es>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      19dfcec1
    • J
      io_uring: fix KASAN use after free in io_sq_wq_submit_work · 5707a64b
      Jackie Liu 提交于
      commit d0ee879187df966ef638031b5f5183078d672141 upstream.
      
      [root@localhost ~]# ./liburing/test/link
      
      QEMU Standard PC report that:
      
      [   29.379892] CPU: 0 PID: 84 Comm: kworker/u2:2 Not tainted 5.3.0-rc2-00051-g4010b622f1d2-dirty #86
      [   29.379902] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
      [   29.379913] Workqueue: io_ring-wq io_sq_wq_submit_work
      [   29.379929] Call Trace:
      [   29.379953]  dump_stack+0xa9/0x10e
      [   29.379970]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.379986]  print_address_description.cold.6+0x9/0x317
      [   29.379999]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380010]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380026]  __kasan_report.cold.7+0x1a/0x34
      [   29.380044]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380061]  kasan_report+0xe/0x12
      [   29.380076]  io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380104]  ? io_sq_thread+0xaf0/0xaf0
      [   29.380152]  process_one_work+0xb59/0x19e0
      [   29.380184]  ? pwq_dec_nr_in_flight+0x2c0/0x2c0
      [   29.380221]  worker_thread+0x8c/0xf40
      [   29.380248]  ? __kthread_parkme+0xab/0x110
      [   29.380265]  ? process_one_work+0x19e0/0x19e0
      [   29.380278]  kthread+0x30b/0x3d0
      [   29.380292]  ? kthread_create_on_node+0xe0/0xe0
      [   29.380311]  ret_from_fork+0x3a/0x50
      
      [   29.380635] Allocated by task 209:
      [   29.381255]  save_stack+0x19/0x80
      [   29.381268]  __kasan_kmalloc.constprop.6+0xc1/0xd0
      [   29.381279]  kmem_cache_alloc+0xc0/0x240
      [   29.381289]  io_submit_sqe+0x11bc/0x1c70
      [   29.381300]  io_ring_submit+0x174/0x3c0
      [   29.381311]  __x64_sys_io_uring_enter+0x601/0x780
      [   29.381322]  do_syscall_64+0x9f/0x4d0
      [   29.381336]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [   29.381633] Freed by task 84:
      [   29.382186]  save_stack+0x19/0x80
      [   29.382198]  __kasan_slab_free+0x11d/0x160
      [   29.382210]  kmem_cache_free+0x8c/0x2f0
      [   29.382220]  io_put_req+0x22/0x30
      [   29.382230]  io_sq_wq_submit_work+0x28b/0xe90
      [   29.382241]  process_one_work+0xb59/0x19e0
      [   29.382251]  worker_thread+0x8c/0xf40
      [   29.382262]  kthread+0x30b/0x3d0
      [   29.382272]  ret_from_fork+0x3a/0x50
      
      [   29.382569] The buggy address belongs to the object at ffff888067172140
                      which belongs to the cache io_kiocb of size 224
      [   29.384692] The buggy address is located 120 bytes inside of
                      224-byte region [ffff888067172140, ffff888067172220)
      [   29.386723] The buggy address belongs to the page:
      [   29.387575] page:ffffea00019c5c80 refcount:1 mapcount:0 mapping:ffff88806ace5180 index:0x0
      [   29.387587] flags: 0x100000000000200(slab)
      [   29.387603] raw: 0100000000000200 dead000000000100 dead000000000122 ffff88806ace5180
      [   29.387617] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
      [   29.387624] page dumped because: kasan: bad access detected
      
      [   29.387920] Memory state around the buggy address:
      [   29.388771]  ffff888067172080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
      [   29.390062]  ffff888067172100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
      [   29.391325] >ffff888067172180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [   29.392578]                                         ^
      [   29.393480]  ffff888067172200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
      [   29.394744]  ffff888067172280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   29.396003] ==================================================================
      [   29.397260] Disabling lock debugging due to kernel taint
      
      io_sq_wq_submit_work free and read req again.
      
      Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
      Cc: linux-block@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: f7b76ac9d17e ("io_uring: fix counter inc/dec mismatch in async_list")
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5707a64b
    • J
      io_uring: ensure ->list is initialized for poll commands · 2d4ff975
      Jens Axboe 提交于
      commit 36703247d5f52a679df9da51192b6950fe81689f upstream.
      
      Daniel reports that when testing an http server that uses io_uring
      to poll for incoming connections, sometimes it hard crashes. This is
      due to an uninitialized list member for the io_uring request. Normally
      this doesn't trigger and none of the test cases caught it.
      Reported-by: NDaniel Kozak <kozzi11@gmail.com>
      Tested-by: NDaniel Kozak <kozzi11@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2d4ff975
    • Z
      io_uring: track io length in async_list based on bytes · 70b10c40
      Zhengyuan Liu 提交于
      commit 9310a7ba6de8cce6209e3e8a3cdf733f824cdd9b upstream.
      
      We are using PAGE_SIZE as the unit to determine if the total len in
      async_list has exceeded max_pages, it's not fair for smaller io sizes.
      For example, if we are doing 1k-size io streams, we will never exceed
      max_pages since len >>= PAGE_SHIFT always gets zero. So use original
      bytes to make it more accurate.
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      70b10c40
    • J
      io_uring: don't use iov_iter_advance() for fixed buffers · bc9ceabb
      Jens Axboe 提交于
      commit bd11b3a391e3df6fa958facbe4b3f9f4cca9bd49 upstream.
      
      Hrvoje reports that when a large fixed buffer is registered and IO is
      being done to the latter pages of said buffer, the IO submission time
      is much worse:
      
      reading to the start of the buffer: 11238 ns
      reading to the end of the buffer:   1039879 ns
      
      In fact, it's worse by two orders of magnitude. The reason for that is
      how io_uring figures out how to setup the iov_iter. We point the iter
      at the first bvec, and then use iov_iter_advance() to fast-forward to
      the offset within that buffer we need.
      
      However, that is abysmally slow, as it entails iterating the bvecs
      that we setup as part of buffer registration. There's really no need
      to use this generic helper, as we know it's a BVEC type iterator, and
      we also know that each bvec is PAGE_SIZE in size, apart from possibly
      the first and last. Hence we can just use a shift on the offset to
      find the right index, and then adjust the iov_iter appropriately.
      After this fix, the timings are:
      
      reading to the start of the buffer: 10135 ns
      reading to the end of the buffer:   1377 ns
      
      Or about an 755x improvement for the tail page.
      Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
      Tested-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      bc9ceabb
    • Z
      io_uring: add a memory barrier before atomic_read · 6a9af3c7
      Zhengyuan Liu 提交于
      commit c0e48f9dea9129aa11bec3ed13803bcc26e96e49 upstream.
      
      There is a hang issue while using fio to do some basic test. The issue
      can be easily reproduced using the below script:
      
              while true
              do
                      fio  --ioengine=io_uring  -rw=write -bs=4k -numjobs=1 \
                           -size=1G -iodepth=64 -name=uring   --filename=/dev/zero
              done
      
      After several minutes (or more), fio would block at
      io_uring_enter->io_cqring_wait in order to waiting for previously
      committed sqes to be completed and can't return to user anymore until
      we send a SIGTERM to fio. After receiving SIGTERM, fio hangs at
      io_ring_ctx_wait_and_kill with a backtrace like this:
      
              [54133.243816] Call Trace:
              [54133.243842]  __schedule+0x3a0/0x790
              [54133.243868]  schedule+0x38/0xa0
              [54133.243880]  schedule_timeout+0x218/0x3b0
              [54133.243891]  ? sched_clock+0x9/0x10
              [54133.243903]  ? wait_for_completion+0xa3/0x130
              [54133.243916]  ? _raw_spin_unlock_irq+0x2c/0x40
              [54133.243930]  ? trace_hardirqs_on+0x3f/0xe0
              [54133.243951]  wait_for_completion+0xab/0x130
              [54133.243962]  ? wake_up_q+0x70/0x70
              [54133.243984]  io_ring_ctx_wait_and_kill+0xa0/0x1d0
              [54133.243998]  io_uring_release+0x20/0x30
              [54133.244008]  __fput+0xcf/0x270
              [54133.244029]  ____fput+0xe/0x10
              [54133.244040]  task_work_run+0x7f/0xa0
              [54133.244056]  do_exit+0x305/0xc40
              [54133.244067]  ? get_signal+0x13b/0xbd0
              [54133.244088]  do_group_exit+0x50/0xd0
              [54133.244103]  get_signal+0x18d/0xbd0
              [54133.244112]  ? _raw_spin_unlock_irqrestore+0x36/0x60
              [54133.244142]  do_signal+0x34/0x720
              [54133.244171]  ? exit_to_usermode_loop+0x7e/0x130
              [54133.244190]  exit_to_usermode_loop+0xc0/0x130
              [54133.244209]  do_syscall_64+0x16b/0x1d0
              [54133.244221]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The reason is that we had added a req to ctx->pending_async at the very
      end, but it didn't get a chance to be processed. How could this happen?
      
              fio#cpu0                                        wq#cpu1
      
              io_add_to_prev_work                    io_sq_wq_submit_work
      
                atomic_read() <<< 1
      
                                                        atomic_dec_return() << 1->0
                                                        list_empty();    <<< true;
      
                list_add_tail()
                atomic_read() << 0 or 1?
      
      As atomic_ops.rst states, atomic_read does not guarantee that the
      runtime modification by any other thread is visible yet, so we must take
      care of that with a proper implicit or explicit memory barrier.
      
      This issue was detected with the help of Jackie's <liuyun01@kylinos.cn>
      
      Fixes: 31b515106428 ("io_uring: allow workqueue item to handle multiple buffered requests")
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      6a9af3c7