1. 13 9月, 2019 1 次提交
    • J
      io_uring: extend async work merging · 6d5d5ac5
      Jens Axboe 提交于
      We currently merge async work items if we see a strict sequential hit.
      This helps avoid unnecessary workqueue switches when we don't need
      them. We can extend this merging to cover cases where it's not a strict
      sequential hit, but the IO still fits within the same page. If an
      application is doing multiple requests within the same page, we don't
      want separate workers waiting on the same page to complete IO. It's much
      faster to let the first worker bring in the page, then operate on that
      page from the same worker to complete the next request(s).
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6d5d5ac5
  2. 10 9月, 2019 5 次提交
    • J
      io_uring: limit parallelism of buffered writes · 54a91f3b
      Jens Axboe 提交于
      All the popular filesystems need to grab the inode lock for buffered
      writes. With io_uring punting buffered writes to async context, we
      observe a lot of contention with all workers hamming this mutex.
      
      For buffered writes, we generally don't need a lot of parallelism on
      the submission side, as the flushing will take care of that for us.
      Hence we don't need a deep queue on the write side, as long as we
      can safely punt from the original submission context.
      
      Add a workqueue with a limit of 2 that we can use for buffered writes.
      This greatly improves the performance and efficiency of higher queue
      depth buffered async writes with io_uring.
      Reported-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      54a91f3b
    • J
      io_uring: add io_queue_async_work() helper · 18d9be1a
      Jens Axboe 提交于
      Add a helper for queueing a request for async execution, in preparation
      for optimizing it.
      
      No functional change in this patch.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      18d9be1a
    • J
      io_uring: optimize submit_and_wait API · c5766668
      Jens Axboe 提交于
      For some applications that end up using a submit-and-wait type of
      approach for certain batches of IO, we can make that a bit more
      efficient by allowing the application to block for the last IO
      submission. This prevents an async when we don't need it, as the
      application will be blocking for the completion event(s) anyway.
      
      Typical use cases are using the liburing
      io_uring_submit_and_wait() API, or just using io_uring_enter()
      doing both submissions and completions. As a specific example,
      RocksDB doing MultiGet() is sped up quite a bit with this
      change.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c5766668
    • J
      io_uring: add support for link with drain · 4fe2c963
      Jackie Liu 提交于
      To support the link with drain, we need to do two parts.
      
      There is an sqes:
      
          0     1     2     3     4     5     6
       +-----+-----+-----+-----+-----+-----+-----+
       |  N  |  L  |  L  | L+D |  N  |  N  |  N  |
       +-----+-----+-----+-----+-----+-----+-----+
      
      First, we need to ensure that the io before the link is completed,
      there is a easy way is set drain flag to the link list's head, so
      all subsequent io will be inserted into the defer_list.
      
      	+-----+
          (0) |  N  |
      	+-----+
                 |          (2)         (3)         (4)
      	+-----+     +-----+     +-----+     +-----+
          (1) | L+D | --> |  L  | --> | L+D | --> |  N  |
      	+-----+     +-----+     +-----+     +-----+
                 |
      	+-----+
          (5) |  N  |
      	+-----+
                 |
      	+-----+
          (6) |  N  |
      	+-----+
      
      Second, ensure that the following IO will not be completed first,
      an easy way is to create a mirror of drain io and insert it into
      defer_list, in this way, as long as drain io is not processed, the
      following io in the defer_list will not be actively process.
      
      	+-----+
          (0) |  N  |
      	+-----+
                 |          (2)         (3)         (4)
      	+-----+     +-----+     +-----+     +-----+
          (1) | L+D | --> |  L  | --> | L+D | --> |  N  |
      	+-----+     +-----+     +-----+     +-----+
                 |
      	+-----+
         ('3) |  D  |   <== This is a shadow of (3)
      	+-----+
                 |
      	+-----+
          (5) |  N  |
      	+-----+
                 |
      	+-----+
          (6) |  N  |
      	+-----+
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4fe2c963
    • J
      io_uring: fix wrong sequence setting logic · 8776f3fa
      Jackie Liu 提交于
      Sqo_thread will get sqring in batches, which will cause
      ctx->cached_sq_head to be added in batches. if one of these
      sqes is set with the DRAIN flag, then he will never get a
      chance to process, and finally sqo_thread will not exit.
      
      Fixes: de0617e4 ("io_uring: add support for marking commands as draining")
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8776f3fa
  3. 07 9月, 2019 1 次提交
    • J
      io_uring: expose single mmap capability · ac90f249
      Jens Axboe 提交于
      After commit 75b28aff we can get by with just a single mmap to
      map both the sq and cq ring. However, userspace doesn't know that.
      
      Add a features variable to io_uring_params, and notify userspace
      that the kernel has this ability. This can then be used in liburing
      (or in applications directly) to avoid the second mmap.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ac90f249
  4. 28 8月, 2019 2 次提交
  5. 23 8月, 2019 1 次提交
  6. 21 8月, 2019 2 次提交
    • J
      io_uring: don't enter poll loop if we have CQEs pending · a3a0e43f
      Jens Axboe 提交于
      We need to check if we have CQEs pending before starting a poll loop,
      as those could be the events we will be spinning for (and hence we'll
      find none). This can happen if a CQE triggers an error, or if it is
      found by eg an IRQ before we get a chance to find it through polling.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3a0e43f
    • J
      io_uring: fix potential hang with polled IO · 500f9fba
      Jens Axboe 提交于
      If a request issue ends up being punted to async context to avoid
      blocking, we can get into a situation where the original application
      enters the poll loop for that very request before it has been issued.
      This should not be an issue, except that the polling will hold the
      io_uring uring_ctx mutex for the duration of the poll. When the async
      worker has actually issued the request, it needs to acquire this mutex
      to add the request to the poll issued list. Since the application
      polling is already holding this mutex, the workqueue sleeps on the
      mutex forever, and the application thus never gets a chance to poll for
      the very request it was interested in.
      
      Fix this by ensuring that the polling drops the uring_ctx occasionally
      if it's not making any progress.
      Reported-by: NJeffrey M. Birnbaum <jmbnyc@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      500f9fba
  7. 16 8月, 2019 2 次提交
    • J
      io_uring: fix an issue when IOSQE_IO_LINK is inserted into defer list · a982eeb0
      Jackie Liu 提交于
      This patch may fix two issues:
      
      First, when IOSQE_IO_DRAIN set, the next IOs need to be inserted into
      defer list to delay execution, but link io will be actively scheduled to
      run by calling io_queue_sqe.
      
      Second, when multiple LINK_IOs are inserted together with defer_list,
      the LINK_IO is no longer keep order.
      
         |-------------|
         |   LINK_IO   |      ----> insert to defer_list  -----------
         |-------------|                                            |
         |   LINK_IO   |      ----> insert to defer_list  ----------|
         |-------------|                                            |
         |   LINK_IO   |      ----> insert to defer_list  ----------|
         |-------------|                                            |
         |   NORMAL_IO |      ----> insert to defer_list  ----------|
         |-------------|                                            |
                                                                    |
                                    queue_work at same time   <-----|
      
      Fixes: 9e645e11 ("io_uring: add support for sqe links")
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a982eeb0
    • A
      io_uring: fix manual setup of iov_iter for fixed buffers · 99c79f66
      Aleix Roca Nonell 提交于
      Commit bd11b3a3 ("io_uring: don't use iov_iter_advance() for fixed
      buffers") introduced an optimization to avoid using the slow
      iov_iter_advance by manually populating the iov_iter iterator in some
      cases.
      
      However, the computation of the iterator count field was erroneous: The
      first bvec was always accounted for an extent of page size even if the
      bvec length was smaller.
      
      In consequence, some I/O operations on fixed buffers were unable to
      operate on the full extent of the buffer, consistently skipping some
      bytes at the end of it.
      
      Fixes: bd11b3a3 ("io_uring: don't use iov_iter_advance() for fixed buffers")
      Cc: stable@vger.kernel.org
      Signed-off-by: NAleix Roca Nonell <aleix.rocanonell@bsc.es>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      99c79f66
  8. 31 7月, 2019 1 次提交
    • J
      io_uring: fix KASAN use after free in io_sq_wq_submit_work · d0ee8791
      Jackie Liu 提交于
      [root@localhost ~]# ./liburing/test/link
      
      QEMU Standard PC report that:
      
      [   29.379892] CPU: 0 PID: 84 Comm: kworker/u2:2 Not tainted 5.3.0-rc2-00051-g4010b622-dirty #86
      [   29.379902] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
      [   29.379913] Workqueue: io_ring-wq io_sq_wq_submit_work
      [   29.379929] Call Trace:
      [   29.379953]  dump_stack+0xa9/0x10e
      [   29.379970]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.379986]  print_address_description.cold.6+0x9/0x317
      [   29.379999]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380010]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380026]  __kasan_report.cold.7+0x1a/0x34
      [   29.380044]  ? io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380061]  kasan_report+0xe/0x12
      [   29.380076]  io_sq_wq_submit_work+0xbf4/0xe90
      [   29.380104]  ? io_sq_thread+0xaf0/0xaf0
      [   29.380152]  process_one_work+0xb59/0x19e0
      [   29.380184]  ? pwq_dec_nr_in_flight+0x2c0/0x2c0
      [   29.380221]  worker_thread+0x8c/0xf40
      [   29.380248]  ? __kthread_parkme+0xab/0x110
      [   29.380265]  ? process_one_work+0x19e0/0x19e0
      [   29.380278]  kthread+0x30b/0x3d0
      [   29.380292]  ? kthread_create_on_node+0xe0/0xe0
      [   29.380311]  ret_from_fork+0x3a/0x50
      
      [   29.380635] Allocated by task 209:
      [   29.381255]  save_stack+0x19/0x80
      [   29.381268]  __kasan_kmalloc.constprop.6+0xc1/0xd0
      [   29.381279]  kmem_cache_alloc+0xc0/0x240
      [   29.381289]  io_submit_sqe+0x11bc/0x1c70
      [   29.381300]  io_ring_submit+0x174/0x3c0
      [   29.381311]  __x64_sys_io_uring_enter+0x601/0x780
      [   29.381322]  do_syscall_64+0x9f/0x4d0
      [   29.381336]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [   29.381633] Freed by task 84:
      [   29.382186]  save_stack+0x19/0x80
      [   29.382198]  __kasan_slab_free+0x11d/0x160
      [   29.382210]  kmem_cache_free+0x8c/0x2f0
      [   29.382220]  io_put_req+0x22/0x30
      [   29.382230]  io_sq_wq_submit_work+0x28b/0xe90
      [   29.382241]  process_one_work+0xb59/0x19e0
      [   29.382251]  worker_thread+0x8c/0xf40
      [   29.382262]  kthread+0x30b/0x3d0
      [   29.382272]  ret_from_fork+0x3a/0x50
      
      [   29.382569] The buggy address belongs to the object at ffff888067172140
                      which belongs to the cache io_kiocb of size 224
      [   29.384692] The buggy address is located 120 bytes inside of
                      224-byte region [ffff888067172140, ffff888067172220)
      [   29.386723] The buggy address belongs to the page:
      [   29.387575] page:ffffea00019c5c80 refcount:1 mapcount:0 mapping:ffff88806ace5180 index:0x0
      [   29.387587] flags: 0x100000000000200(slab)
      [   29.387603] raw: 0100000000000200 dead000000000100 dead000000000122 ffff88806ace5180
      [   29.387617] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
      [   29.387624] page dumped because: kasan: bad access detected
      
      [   29.387920] Memory state around the buggy address:
      [   29.388771]  ffff888067172080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
      [   29.390062]  ffff888067172100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
      [   29.391325] >ffff888067172180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [   29.392578]                                         ^
      [   29.393480]  ffff888067172200: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
      [   29.394744]  ffff888067172280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   29.396003] ==================================================================
      [   29.397260] Disabling lock debugging due to kernel taint
      
      io_sq_wq_submit_work free and read req again.
      
      Cc: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
      Cc: linux-block@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: f7b76ac9 ("io_uring: fix counter inc/dec mismatch in async_list")
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d0ee8791
  9. 26 7月, 2019 1 次提交
  10. 22 7月, 2019 2 次提交
    • Z
      io_uring: track io length in async_list based on bytes · 9310a7ba
      Zhengyuan Liu 提交于
      We are using PAGE_SIZE as the unit to determine if the total len in
      async_list has exceeded max_pages, it's not fair for smaller io sizes.
      For example, if we are doing 1k-size io streams, we will never exceed
      max_pages since len >>= PAGE_SHIFT always gets zero. So use original
      bytes to make it more accurate.
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9310a7ba
    • J
      io_uring: don't use iov_iter_advance() for fixed buffers · bd11b3a3
      Jens Axboe 提交于
      Hrvoje reports that when a large fixed buffer is registered and IO is
      being done to the latter pages of said buffer, the IO submission time
      is much worse:
      
      reading to the start of the buffer: 11238 ns
      reading to the end of the buffer:   1039879 ns
      
      In fact, it's worse by two orders of magnitude. The reason for that is
      how io_uring figures out how to setup the iov_iter. We point the iter
      at the first bvec, and then use iov_iter_advance() to fast-forward to
      the offset within that buffer we need.
      
      However, that is abysmally slow, as it entails iterating the bvecs
      that we setup as part of buffer registration. There's really no need
      to use this generic helper, as we know it's a BVEC type iterator, and
      we also know that each bvec is PAGE_SIZE in size, apart from possibly
      the first and last. Hence we can just use a shift on the offset to
      find the right index, and then adjust the iov_iter appropriately.
      After this fix, the timings are:
      
      reading to the start of the buffer: 10135 ns
      reading to the end of the buffer:   1377 ns
      
      Or about an 755x improvement for the tail page.
      Reported-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
      Tested-by: NHrvoje Zeba <zeba.hrvoje@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bd11b3a3
  11. 19 7月, 2019 1 次提交
    • Z
      io_uring: add a memory barrier before atomic_read · c0e48f9d
      Zhengyuan Liu 提交于
      There is a hang issue while using fio to do some basic test. The issue
      can be easily reproduced using the below script:
      
              while true
              do
                      fio  --ioengine=io_uring  -rw=write -bs=4k -numjobs=1 \
                           -size=1G -iodepth=64 -name=uring   --filename=/dev/zero
              done
      
      After several minutes (or more), fio would block at
      io_uring_enter->io_cqring_wait in order to waiting for previously
      committed sqes to be completed and can't return to user anymore until
      we send a SIGTERM to fio. After receiving SIGTERM, fio hangs at
      io_ring_ctx_wait_and_kill with a backtrace like this:
      
              [54133.243816] Call Trace:
              [54133.243842]  __schedule+0x3a0/0x790
              [54133.243868]  schedule+0x38/0xa0
              [54133.243880]  schedule_timeout+0x218/0x3b0
              [54133.243891]  ? sched_clock+0x9/0x10
              [54133.243903]  ? wait_for_completion+0xa3/0x130
              [54133.243916]  ? _raw_spin_unlock_irq+0x2c/0x40
              [54133.243930]  ? trace_hardirqs_on+0x3f/0xe0
              [54133.243951]  wait_for_completion+0xab/0x130
              [54133.243962]  ? wake_up_q+0x70/0x70
              [54133.243984]  io_ring_ctx_wait_and_kill+0xa0/0x1d0
              [54133.243998]  io_uring_release+0x20/0x30
              [54133.244008]  __fput+0xcf/0x270
              [54133.244029]  ____fput+0xe/0x10
              [54133.244040]  task_work_run+0x7f/0xa0
              [54133.244056]  do_exit+0x305/0xc40
              [54133.244067]  ? get_signal+0x13b/0xbd0
              [54133.244088]  do_group_exit+0x50/0xd0
              [54133.244103]  get_signal+0x18d/0xbd0
              [54133.244112]  ? _raw_spin_unlock_irqrestore+0x36/0x60
              [54133.244142]  do_signal+0x34/0x720
              [54133.244171]  ? exit_to_usermode_loop+0x7e/0x130
              [54133.244190]  exit_to_usermode_loop+0xc0/0x130
              [54133.244209]  do_syscall_64+0x16b/0x1d0
              [54133.244221]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The reason is that we had added a req to ctx->pending_async at the very
      end, but it didn't get a chance to be processed. How could this happen?
      
              fio#cpu0                                        wq#cpu1
      
              io_add_to_prev_work                    io_sq_wq_submit_work
      
                atomic_read() <<< 1
      
                                                        atomic_dec_return() << 1->0
                                                        list_empty();    <<< true;
      
                list_add_tail()
                atomic_read() << 0 or 1?
      
      As atomic_ops.rst states, atomic_read does not guarantee that the
      runtime modification by any other thread is visible yet, so we must take
      care of that with a proper implicit or explicit memory barrier.
      
      This issue was detected with the help of Jackie's <liuyun01@kylinos.cn>
      
      Fixes: 31b51510 ("io_uring: allow workqueue item to handle multiple buffered requests")
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c0e48f9d
  12. 17 7月, 2019 1 次提交
  13. 16 7月, 2019 2 次提交
  14. 10 7月, 2019 3 次提交
    • J
      io_uring: fix io_sq_thread_stop running in front of io_sq_thread · a4c0b3de
      Jackie Liu 提交于
      INFO: task syz-executor.5:8634 blocked for more than 143 seconds.
             Not tainted 5.2.0-rc5+ #3
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      syz-executor.5  D25632  8634   8224 0x00004004
      Call Trace:
        context_switch kernel/sched/core.c:2818 [inline]
        __schedule+0x658/0x9e0 kernel/sched/core.c:3445
        schedule+0x131/0x1d0 kernel/sched/core.c:3509
        schedule_timeout+0x9a/0x2b0 kernel/time/timer.c:1783
        do_wait_for_common+0x35e/0x5a0 kernel/sched/completion.c:83
        __wait_for_common kernel/sched/completion.c:104 [inline]
        wait_for_common kernel/sched/completion.c:115 [inline]
        wait_for_completion+0x47/0x60 kernel/sched/completion.c:136
        kthread_stop+0xb4/0x150 kernel/kthread.c:559
        io_sq_thread_stop fs/io_uring.c:2252 [inline]
        io_finish_async fs/io_uring.c:2259 [inline]
        io_ring_ctx_free fs/io_uring.c:2770 [inline]
        io_ring_ctx_wait_and_kill+0x268/0x880 fs/io_uring.c:2834
        io_uring_release+0x5d/0x70 fs/io_uring.c:2842
        __fput+0x2e4/0x740 fs/file_table.c:280
        ____fput+0x15/0x20 fs/file_table.c:313
        task_work_run+0x17e/0x1b0 kernel/task_work.c:113
        tracehook_notify_resume include/linux/tracehook.h:185 [inline]
        exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
        prepare_exit_to_usermode+0x402/0x4f0 arch/x86/entry/common.c:199
        syscall_return_slowpath+0x110/0x440 arch/x86/entry/common.c:279
        do_syscall_64+0x126/0x140 arch/x86/entry/common.c:304
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x412fb1
      Code: 80 3b 7c 0f 84 c7 02 00 00 c7 85 d0 00 00 00 00 00 00 00 48 8b 05 cf
      a6 24 00 49 8b 14 24 41 b9 cb 2a 44 00 48 89 ee 48 89 df <48> 85 c0 4c 0f
      45 c8 45 31 c0 31 c9 e8 0e 5b 00 00 85 c0 41 89 c7
      RSP: 002b:00007ffe7ee6a180 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
      RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000412fb1
      RDX: 0000001b2d920000 RSI: 0000000000000000 RDI: 0000000000000003
      RBP: 0000000000000001 R08: 00000000f3a3e1f8 R09: 00000000f3a3e1fc
      R10: 00007ffe7ee6a260 R11: 0000000000000293 R12: 000000000075c9a0
      R13: 000000000075c9a0 R14: 0000000000024c00 R15: 000000000075bf2c
      
      =============================================
      
      There is an wrong logic, when kthread_park running
      in front of io_sq_thread.
      
      CPU#0					CPU#1
      
      io_sq_thread_stop:			int kthread(void *_create):
      
      kthread_park()
      					__kthread_parkme(self);	 <<< Wrong
      kthread_stop()
          << wait for self->exited
          << clear_bit KTHREAD_SHOULD_PARK
      
      					ret = threadfn(data);
      					   |
      					   |- io_sq_thread
      					       |- kthread_should_park()	<< false
      					       |- schedule() <<< nobody wake up
      
      stuck CPU#0				stuck CPU#1
      
      So, use a new variable sqo_thread_started to ensure that io_sq_thread
      run first, then io_sq_thread_stop.
      
      Reported-by: syzbot+94324416c485d422fe15@syzkaller.appspotmail.com
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a4c0b3de
    • J
      io_uring: add support for recvmsg() · aa1fa28f
      Jens Axboe 提交于
      This is done through IORING_OP_RECVMSG. This opcode uses the same
      sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the
      msghdr struct in the sqe->addr field as well.
      
      We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aa1fa28f
    • J
      io_uring: add support for sendmsg() · 0fa03c62
      Jens Axboe 提交于
      This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags
      for the flags argument, and the msghdr struct is passed in the
      sqe->addr field.
      
      We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0fa03c62
  15. 29 6月, 2019 2 次提交
  16. 24 6月, 2019 1 次提交
    • J
      io_uring: add support for sqe links · 9e645e11
      Jens Axboe 提交于
      With SQE links, we can create chains of dependent SQEs. One example
      would be queueing an SQE that's a read from one file descriptor, with
      the linked SQE being a write to another with the same set of buffers.
      
      An SQE link will not stall the pipeline, it'll just ensure that
      dependent SQEs aren't issued before the previous link has completed.
      
      Any error at submission or completion time will break the chain of SQEs.
      For completions, this also includes short reads or writes, as the next
      SQE could depend on the previous one being fully completed.
      
      Any SQE in a chain that gets canceled due to any of the above errors,
      will get an CQE fill with -ECANCELED as the error value.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9e645e11
  17. 22 6月, 2019 1 次提交
    • J
      io_uring: ensure req->file is cleared on allocation · 60c112b0
      Jens Axboe 提交于
      Stephen reports:
      
      I hit the following General Protection Fault when testing io_uring via
      the io_uring engine in fio. This was on a VM running 5.2-rc5 and the
      latest version of fio. The issue occurs for both null_blk and fake NVMe
      drives. I have not tested bare metal or real NVMe SSDs. The fio script
      used is given below.
      
      [io_uring]
      time_based=1
      runtime=60
      filename=/dev/nvme2n1 (note /dev/nullb0 also fails)
      ioengine=io_uring
      bs=4k
      rw=readwrite
      direct=1
      fixedbufs=1
      sqthread_poll=1
      sqthread_poll_cpu=0
      
      general protection fault: 0000 [#1] SMP PTI
      CPU: 0 PID: 872 Comm: io_uring-sq Not tainted 5.2.0-rc5-cpacket-io-uring #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      RIP: 0010:fput_many+0x7/0x90
      Code: 01 48 85 ff 74 17 55 48 89 e5 53 48 8b 1f e8 a0 f9 ff ff 48 85 db 48 89 df 75 f0 5b 5d f3 c3 0f 1f 40 00 0f 1f 44 00 00 89 f6 <f0> 48 29 77 38 74 01 c3 55 48 89 e5 53 48 89 fb 65 48 \
      
      RSP: 0018:ffffadeb817ebc50 EFLAGS: 00010246
      RAX: 0000000000000004 RBX: ffff8f46ad477480 RCX: 0000000000001805
      RDX: 0000000000000000 RSI: 0000000000000001 RDI: f18b51b9a39552b5
      RBP: ffffadeb817ebc58 R08: ffff8f46b7a318c0 R09: 000000000000015d
      R10: ffffadeb817ebce8 R11: 0000000000000020 R12: ffff8f46ad4cd000
      R13: 00000000fffffff7 R14: ffffadeb817ebe30 R15: 0000000000000004
      FS:  0000000000000000(0000) GS:ffff8f46b7a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055828f0bbbf0 CR3: 0000000232176004 CR4: 00000000003606f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       ? fput+0x13/0x20
       io_free_req+0x20/0x40
       io_put_req+0x1b/0x20
       io_submit_sqe+0x40a/0x680
       ? __switch_to_asm+0x34/0x70
       ? __switch_to_asm+0x40/0x70
       io_submit_sqes+0xb9/0x160
       ? io_submit_sqes+0xb9/0x160
       ? __switch_to_asm+0x40/0x70
       ? __switch_to_asm+0x34/0x70
       ? __schedule+0x3f2/0x6a0
       ? __switch_to_asm+0x34/0x70
       io_sq_thread+0x1af/0x470
       ? __switch_to_asm+0x34/0x70
       ? wait_woken+0x80/0x80
       ? __switch_to+0x85/0x410
       ? __switch_to_asm+0x40/0x70
       ? __switch_to_asm+0x34/0x70
       ? __schedule+0x3f2/0x6a0
       kthread+0x105/0x140
       ? io_submit_sqes+0x160/0x160
       ? kthread+0x105/0x140
       ? io_submit_sqes+0x160/0x160
       ? kthread_destroy_worker+0x50/0x50
       ret_from_fork+0x35/0x40
      
      which occurs because using a kernel side submission thread isn't valid
      without using fixed files (registered through io_uring_register()). This
      causes io_uring to put the request after logging an error, but before
      the file field is set in the request. If it happens to be non-zero, we
      attempt to fput() garbage.
      
      Fix this by ensuring that req->file is initialized when the request is
      allocated.
      
      Cc: stable@vger.kernel.org # 5.1+
      Reported-by: NStephen Bates <sbates@raithlin.com>
      Tested-by: NStephen Bates <sbates@raithlin.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      60c112b0
  18. 13 6月, 2019 1 次提交
    • E
      io_uring: fix memory leak of UNIX domain socket inode · 355e8d26
      Eric Biggers 提交于
      Opening and closing an io_uring instance leaks a UNIX domain socket
      inode.  This is because the ->file of the io_uring instance's internal
      UNIX domain socket is set to point to the io_uring file, but then
      sock_release() sees the non-NULL ->file and assumes the inode reference
      is held by the file so doesn't call iput().  That's not the case here,
      since the reference is still meant to be held by the socket; the actual
      inode of the io_uring file is different.
      
      Fix this leak by NULL-ing out ->file before releasing the socket.
      
      Reported-by: syzbot+111cb28d9f583693aefa@syzkaller.appspotmail.com
      Fixes: 2b188cc1 ("Add io_uring IO interface")
      Cc: <stable@vger.kernel.org> # v5.1+
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      355e8d26
  19. 01 6月, 2019 2 次提交
    • J
      io_uring: punt short reads to async context · 9d93a3f5
      Jens Axboe 提交于
      We can encounter a short read when we're doing buffered reads and the
      data is partially cached. Right now we just return the short read, but
      that forces the application to read that CQE, then issue another SQE
      to finish the read. That read will not be cached, and hence will result
      in an async punt.
      
      It's more efficient to do that async punt from within the kernel, as
      that will the not need two round trips more to the kernel.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9d93a3f5
    • J
      uio: make import_iovec()/compat_import_iovec() return bytes on success · 87e5e6da
      Jens Axboe 提交于
      Currently these functions return < 0 on error, and 0 for success.
      Change that so that we return < 0 on error, but number of bytes
      for success.
      
      Some callers already treat the return value that way, others need a
      slight tweak.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      87e5e6da
  20. 26 5月, 2019 1 次提交
  21. 16 5月, 2019 4 次提交
    • J
      io_uring: use wait_event_interruptible for cq_wait conditional wait · fdb288a6
      Jackie Liu 提交于
      The previous patch has ensured that io_cqring_events contain
      smp_rmb memory barriers, Now we can use wait_event_interruptible
      to keep the code simple.
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fdb288a6
    • J
      io_uring: adjust smp_rmb inside io_cqring_events · dc6ce4bc
      Jackie Liu 提交于
      Whenever smp_rmb is required to use io_cqring_events,
      keep smp_rmb inside the function io_cqring_events.
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dc6ce4bc
    • R
      io_uring: fix infinite wait in khread_park() on io_finish_async() · 2bbcd6d3
      Roman Penyaev 提交于
      This fixes couple of races which lead to infinite wait of park completion
      with the following backtraces:
      
        [20801.303319] Call Trace:
        [20801.303321]  ? __schedule+0x284/0x650
        [20801.303323]  schedule+0x33/0xc0
        [20801.303324]  schedule_timeout+0x1bc/0x210
        [20801.303326]  ? schedule+0x3d/0xc0
        [20801.303327]  ? schedule_timeout+0x1bc/0x210
        [20801.303329]  ? preempt_count_add+0x79/0xb0
        [20801.303330]  wait_for_completion+0xa5/0x120
        [20801.303331]  ? wake_up_q+0x70/0x70
        [20801.303333]  kthread_park+0x48/0x80
        [20801.303335]  io_finish_async+0x2c/0x70
        [20801.303336]  io_ring_ctx_wait_and_kill+0x95/0x180
        [20801.303338]  io_uring_release+0x1c/0x20
        [20801.303339]  __fput+0xad/0x210
        [20801.303341]  task_work_run+0x8f/0xb0
        [20801.303342]  exit_to_usermode_loop+0xa0/0xb0
        [20801.303343]  do_syscall_64+0xe0/0x100
        [20801.303349]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [20801.303380] Call Trace:
        [20801.303383]  ? __schedule+0x284/0x650
        [20801.303384]  schedule+0x33/0xc0
        [20801.303386]  io_sq_thread+0x38a/0x410
        [20801.303388]  ? __switch_to_asm+0x40/0x70
        [20801.303390]  ? wait_woken+0x80/0x80
        [20801.303392]  ? _raw_spin_lock_irqsave+0x17/0x40
        [20801.303394]  ? io_submit_sqes+0x120/0x120
        [20801.303395]  kthread+0x112/0x130
        [20801.303396]  ? kthread_create_on_node+0x60/0x60
        [20801.303398]  ret_from_fork+0x35/0x40
      
       o kthread_park() waits for park completion, so io_sq_thread() loop
         should check kthread_should_park() along with khread_should_stop(),
         otherwise if kthread_park() is called before prepare_to_wait()
         the following schedule() never returns:
      
         CPU#0                    CPU#1
      
         io_sq_thread_stop():     io_sq_thread():
      
                                     while(!kthread_should_stop() && !ctx->sqo_stop) {
      
            ctx->sqo_stop = 1;
            kthread_park()
      
      	                            prepare_to_wait();
                                          if (kthread_should_stop() {
      				    }
                                          schedule();   <<< nobody checks park flag,
      				                  <<< so schedule and never return
      
       o if the flag ctx->sqo_stop is observed by the io_sq_thread() loop
         it is quite possible, that kthread_should_park() check and the
         following kthread_parkme() is never called, because kthread_park()
         has not been yet called, but few moments later is is called and
         waits there for park completion, which never happens, because
         kthread has already exited:
      
         CPU#0                    CPU#1
      
         io_sq_thread_stop():     io_sq_thread():
      
            ctx->sqo_stop = 1;
                                     while(!kthread_should_stop() && !ctx->sqo_stop) {
                                         <<< observe sqo_stop and exit the loop
      			       }
      
      			       if (kthread_should_park())
      			           kthread_parkme();  <<< never called, since was
      					              <<< never parked
      
            kthread_park()           <<< waits forever for park completion
      
      In the current patch we quit the loop by only kthread_should_park()
      check (kthread_park() is synchronous, so kthread_should_stop() is
      never observed), and we abandon ->sqo_stop flag, since it is racy.
      At the end of the io_sq_thread() we unconditionally call parmke(),
      since we've exited the loop by the park flag.
      Signed-off-by: NRoman Penyaev <rpenyaev@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2bbcd6d3
    • J
      io_uring: remove 'ev_flags' argument · c71ffb67
      Jens Axboe 提交于
      We always pass in 0 for the cqe flags argument, since the support for
      "this read hit page cache" hint was dropped.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c71ffb67
  22. 15 5月, 2019 2 次提交
    • J
      io_uring: fix failure to verify SQ_AFF cpu · 44a9bd18
      Jens Axboe 提交于
      The test case we have is rightfully failing with the current kernel:
      
      io_uring_setup(1, 0x7ffe2cafebe0), flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 4
      expected -1, got 3
      
      This is in a vm, and CPU3 is the last valid one, hence asking for 4
      should fail the setup with -EINVAL, not succeed. The problem is that
      we're using array_index_nospec() with nr_cpu_ids as the index, hence we
      wrap and end up using CPU0 instead of CPU4. This makes the setup
      succeed where it should be failing.
      
      We don't need to use array_index_nospec() as we're not indexing any
      array with this. Instead just compare with nr_cpu_ids directly. This
      is fine as we're checking with cpu_online() afterwards.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      44a9bd18
    • I
      mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM · 932f4a63
      Ira Weiny 提交于
      Pach series "Add FOLL_LONGTERM to GUP fast and use it".
      
      HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
      advantages.  These pages can be held for a significant time.  But
      get_user_pages_fast() does not protect against mapping FS DAX pages.
      
      Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
      retains the performance while also adding the FS DAX checks.  XDP has also
      shown interest in using this functionality.[1]
      
      In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
      and remove the specialized get_user_pages_longterm call.
      
      [1] https://lkml.org/lkml/2019/3/19/939
      
      "longterm" is a relative thing and at this point is probably a misnomer.
      This is really flagging a pin which is going to be given to hardware and
      can't move.  I've thought of a couple of alternative names but I think we
      have to settle on if we are going to use FL_LAYOUT or something else to
      solve the "longterm" problem.  Then I think we can change the flag to a
      better name.
      
      Secondly, it depends on how often you are registering memory.  I have
      spoken with some RDMA users who consider MR in the performance path...
      For the overall application performance.  I don't have the numbers as the
      tests for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an aside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      This patch (of 7):
      
      This patch starts a series which aims to support FOLL_LONGTERM in
      get_user_pages_fast().  Some callers who would like to do a longterm (user
      controlled pin) of pages with the fast variant of GUP for performance
      purposes.
      
      Rather than have a separate get_user_pages_longterm() call, introduce
      FOLL_LONGTERM and change the longterm callers to use it.
      
      This patch does not change any functionality.  In the short term
      "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
      in particular has been blocked.  However, callers of get_user_pages_fast()
      were not "protected".
      
      FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
      requires vmas to determine if DAX is in use.
      
      NOTE: In merging with the CMA changes we opt to change the
      get_user_pages() call in check_and_migrate_cma_pages() to a call of
      __get_user_pages_locked() on the newly migrated pages.  This makes the
      code read better in that we are calling __get_user_pages_locked() on the
      pages before and after a potential migration.
      
      As a side affect some of the interfaces are cleaned up but this is not the
      primary purpose of the series.
      
      In review[1] it was asked:
      
      <quote>
      > This I don't get - if you do lock down long term mappings performance
      > of the actual get_user_pages call shouldn't matter to start with.
      >
      > What do I miss?
      
      A couple of points.
      
      First "longterm" is a relative thing and at this point is probably a
      misnomer.  This is really flagging a pin which is going to be given to
      hardware and can't move.  I've thought of a couple of alternative names
      but I think we have to settle on if we are going to use FL_LAYOUT or
      something else to solve the "longterm" problem.  Then I think we can
      change the flag to a better name.
      
      Second, It depends on how often you are registering memory.  I have spoken
      with some RDMA users who consider MR in the performance path...  For the
      overall application performance.  I don't have the numbers as the tests
      for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an asside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      </quote>
      
      [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
      
      [ira.weiny@intel.com: v3]
        Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      932f4a63
  23. 13 5月, 2019 1 次提交
    • S
      io_uring: fix race condition reading SQE data · e2033e33
      Stefan Bühler 提交于
      When punting to workers the SQE gets copied after the initial try.
      There is a race condition between reading SQE data for the initial try
      and copying it for punting it to the workers.
      
      For example io_rw_done calls kiocb->ki_complete even if it was prepared
      for IORING_OP_FSYNC (and would be NULL).
      
      The easiest solution for now is to alway prepare again in the worker.
      
      req->file is safe to prepare though as long as it is checked before use.
      Signed-off-by: NStefan Bühler <source@stbuehler.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e2033e33