1. 25 7月, 2020 1 次提交
  2. 27 6月, 2020 2 次提交
  3. 15 6月, 2020 3 次提交
  4. 11 6月, 2020 1 次提交
    • X
      io_uring: avoid whole io_wq_work copy for requests completed inline · 7cdaf587
      Xiaoguang Wang 提交于
      If requests can be submitted and completed inline, we don't need to
      initialize whole io_wq_work in io_init_req(), which is an expensive
      operation, add a new 'REQ_F_WORK_INITIALIZED' to determine whether
      io_wq_work is initialized and add a helper io_req_init_async(), users
      must call io_req_init_async() for the first time touching any members
      of io_wq_work.
      
      I use /dev/nullb0 to evaluate performance improvement in my physical
      machine:
        modprobe null_blk nr_devices=1 completion_nsec=0
        sudo taskset -c 60 fio  -name=fiotest -filename=/dev/nullb0 -iodepth=128
        -thread -rw=read -ioengine=io_uring -direct=1 -bs=4k -size=100G -numjobs=1
        -time_based -runtime=120
      
      before this patch:
      Run status group 0 (all jobs):
         READ: bw=724MiB/s (759MB/s), 724MiB/s-724MiB/s (759MB/s-759MB/s),
         io=84.8GiB (91.1GB), run=120001-120001msec
      
      With this patch:
      Run status group 0 (all jobs):
         READ: bw=761MiB/s (798MB/s), 761MiB/s-761MiB/s (798MB/s-798MB/s),
         io=89.2GiB (95.8GB), run=120001-120001msec
      
      About 5% improvement.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7cdaf587
  5. 09 6月, 2020 1 次提交
  6. 04 4月, 2020 1 次提交
  7. 24 3月, 2020 1 次提交
    • P
      io-wq: handle hashed writes in chains · 86f3cd1b
      Pavel Begunkov 提交于
      We always punt async buffered writes to an io-wq helper, as the core
      kernel does not have IOCB_NOWAIT support for that. Most buffered async
      writes complete very quickly, as it's just a copy operation. This means
      that doing multiple locking roundtrips on the shared wqe lock for each
      buffered write is wasteful. Additionally, buffered writes are hashed
      work items, which means that any buffered write to a given file is
      serialized.
      
      Keep identicaly hashed work items contiguously in @wqe->work_list, and
      track a tail for each hash bucket. On dequeue of a hashed item, splice
      all of the same hash in one go using the tracked tail. Until the batch
      is done, the caller doesn't have to synchronize with the wqe or worker
      locks again.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86f3cd1b
  8. 23 3月, 2020 1 次提交
  9. 15 3月, 2020 1 次提交
  10. 05 3月, 2020 1 次提交
    • P
      io_uring/io-wq: forward submission ref to async · e9fd9396
      Pavel Begunkov 提交于
      First it changes io-wq interfaces. It replaces {get,put}_work() with
      free_work(), which guaranteed to be called exactly once. It also enforces
      free_work() callback to be non-NULL.
      
      io_uring follows the changes and instead of putting a submission reference
      in io_put_req_async_completion(), it will be done in io_free_work(). As
      removes io_get_work() with corresponding refcount_inc(), the ref balance
      is maintained.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e9fd9396
  11. 03 3月, 2020 3 次提交
  12. 26 2月, 2020 1 次提交
    • J
      io-wq: ensure work->task_pid is cleared on init · 2d141dd2
      Jens Axboe 提交于
      We use ->task_pid for exit cancellation, but we need to ensure it's
      cleared to zero for io_req_work_grab_env() to do the right thing. Take
      a suggestion from Bart and clear the whole thing, just setting the
      function passed in. This makes it more future proof as well.
      
      Fixes: 36282881 ("io-wq: add io_wq_cancel_pid() to cancel based on a specific pid")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d141dd2
  13. 10 2月, 2020 1 次提交
  14. 09 2月, 2020 1 次提交
  15. 30 1月, 2020 1 次提交
    • J
      io_uring: fix linked command file table usage · f86cd20c
      Jens Axboe 提交于
      We're not consistent in how the file table is grabbed and assigned if we
      have a command linked that requires the use of it.
      
      Add ->file_table to the io_op_defs[] array, and use that to determine
      when to grab the table instead of having the handlers set it if they
      need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES
      flag. We always initialize work->files, so io-wq can just check for
      that.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f86cd20c
  16. 29 1月, 2020 2 次提交
  17. 21 1月, 2020 2 次提交
    • J
      io-wq: support concurrent non-blocking work · 895e2ca0
      Jens Axboe 提交于
      io-wq assumes that work will complete fast (and not block), so it
      doesn't create a new worker when work is enqueued, if we already have
      at least one worker running. This is done on the assumption that if work
      is running, then it will complete fast.
      
      Add an option to force io-wq to fork a new worker for work queued. This
      is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that
      case, io-wq will create a new worker, even though workers are already
      running.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      895e2ca0
    • J
      io-wq: add support for uncancellable work · 0c9d5ccd
      Jens Axboe 提交于
      Not all work can be cancelled, some of it we may need to guarantee
      that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL
      on work that must not be cancelled. Note that the caller work function
      must also check for IO_WQ_WORK_NO_CANCEL on work that is marked
      IO_WQ_WORK_CANCEL.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0c9d5ccd
  18. 18 12月, 2019 1 次提交
  19. 11 12月, 2019 1 次提交
  20. 05 12月, 2019 1 次提交
    • J
      io-wq: clear node->next on list deletion · 08bdcc35
      Jens Axboe 提交于
      If someone removes a node from a list, and then later adds it back to
      a list, we can have invalid data in ->next. This can cause all sorts
      of issues. One such use case is the IORING_OP_POLL_ADD command, which
      will do just that if we race and get woken twice without any pending
      events. This is a pretty rare case, but can happen under extreme loads.
      Dan reports that he saw the following crash:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000000
      PGD d283ce067 P4D d283ce067 PUD e5ca04067 PMD 0
      Oops: 0002 [#1] SMP
      CPU: 17 PID: 10726 Comm: tao:fast-fiber Kdump: loaded Not tainted 5.2.9-02851-gac7bc042d2d1 #116
      Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
      RIP: 0010:io_wqe_enqueue+0x3e/0xd0
      Code: 34 24 74 55 8b 47 58 48 8d 6f 50 85 c0 74 50 48 89 df e8 35 7c 75 00 48 83 7b 08 00 48 8b 14 24 0f 84 84 00 00 00 48 8b 4b 10 <48> 89 11 48 89 53 10 83 63 20 fe 48 89 c6 48 89 df e8 0c 7a 75 00
      RSP: 0000:ffffc90006858a08 EFLAGS: 00010082
      RAX: 0000000000000002 RBX: ffff889037492fc0 RCX: 0000000000000000
      RDX: ffff888e40cc11a8 RSI: ffff888e40cc11a8 RDI: ffff889037492fc0
      RBP: ffff889037493010 R08: 00000000000000c3 R09: ffffc90006858ab8
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff888e40cc11a8
      R13: 0000000000000000 R14: 00000000000000c3 R15: ffff888e40cc1100
      FS:  00007fcddc9db700(0000) GS:ffff88903fa40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 0000000e479f5003 CR4: 00000000007606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <IRQ>
       io_poll_wake+0x12f/0x2a0
       __wake_up_common+0x86/0x120
       __wake_up_common_lock+0x7a/0xc0
       sock_def_readable+0x3c/0x70
       tcp_rcv_established+0x557/0x630
       tcp_v6_do_rcv+0x118/0x3c0
       tcp_v6_rcv+0x97e/0x9d0
       ip6_protocol_deliver_rcu+0xe3/0x440
       ip6_input+0x3d/0xc0
       ? ip6_protocol_deliver_rcu+0x440/0x440
       ipv6_rcv+0x56/0xd0
       ? ip6_rcv_finish_core.isra.18+0x80/0x80
       __netif_receive_skb_one_core+0x50/0x70
       netif_receive_skb_internal+0x2f/0xa0
       napi_gro_receive+0x125/0x150
       mlx5e_handle_rx_cqe+0x1d9/0x5a0
       ? mlx5e_poll_tx_cq+0x305/0x560
       mlx5e_poll_rx_cq+0x49f/0x9c5
       mlx5e_napi_poll+0xee/0x640
       ? smp_reschedule_interrupt+0x16/0xd0
       ? reschedule_interrupt+0xf/0x20
       net_rx_action+0x286/0x3d0
       __do_softirq+0xca/0x297
       irq_exit+0x96/0xa0
       do_IRQ+0x54/0xe0
       common_interrupt+0xf/0xf
       </IRQ>
      RIP: 0033:0x7fdc627a2e3a
      Code: 31 c0 85 d2 0f 88 f6 00 00 00 55 48 89 e5 41 57 41 56 4c 63 f2 41 55 41 54 53 48 83 ec 18 48 85 ff 0f 84 c7 00 00 00 48 8b 07 <41> 89 d4 49 89 f5 48 89 fb 48 85 c0 0f 84 64 01 00 00 48 83 78 10
      
      when running a networked workload with about 5000 sockets being polled
      for. Fix this by clearing node->next when the node is being removed from
      the list.
      
      Fixes: 6206f0e1 ("io-wq: shrink io_wq_work a bit")
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08bdcc35
  21. 03 12月, 2019 1 次提交
  22. 02 12月, 2019 1 次提交
    • J
      io_uring: use current task creds instead of allocating a new one · 0b8c0ec7
      Jens Axboe 提交于
      syzbot reports:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
      RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
      RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
      Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
      24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
      RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
      RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
      RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
      R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
      R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
        io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
        kthread+0x361/0x430 kernel/kthread.c:255
        ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      Modules linked in:
      ---[ end trace f2e1a4307fbe2245 ]---
      RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
      RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
      RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
      Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
      24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
      RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
      RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
      RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
      R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
      R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      which is caused by slab fault injection triggering a failure in
      prepare_creds(). We don't actually need to create a copy of the creds
      as we're not modifying it, we just need a reference on the current task
      creds. This avoids the failure case as well, and propagates the const
      throughout the stack.
      
      Fixes: 181e448d ("io_uring: async workers should inherit the user creds")
      Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0b8c0ec7
  23. 27 11月, 2019 1 次提交
    • J
      io-wq: shrink io_wq_work a bit · 6206f0e1
      Jens Axboe 提交于
      Currently we're using 40 bytes for the io_wq_work structure, and 16 of
      those is the doubly link list node. We don't need doubly linked lists,
      we always add to tail to keep things ordered, and any other use case
      is list traversal with deletion. For the deletion case, we can easily
      support any node deletion by keeping track of the previous entry.
      
      This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from
      io_uring to 216 to 208 bytes.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6206f0e1
  24. 26 11月, 2019 3 次提交
  25. 14 11月, 2019 1 次提交
    • J
      io_wq: add get/put_work handlers to io_wq_create() · 7d723065
      Jens Axboe 提交于
      For cancellation, we need to ensure that the work item stays valid for
      as long as ->cur_work is valid. Right now we can't safely dereference
      the work item even under the wqe->lock, because while the ->cur_work
      pointer will remain valid, the work could be completing and be freed
      in parallel.
      
      Only invoke ->get/put_work() on items we know that the caller queued
      themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed
      when we're queueing a flush item, for instance.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7d723065
  26. 12 11月, 2019 1 次提交
  27. 08 11月, 2019 1 次提交
    • J
      io-wq: add support for bounded vs unbunded work · c5def4ab
      Jens Axboe 提交于
      io_uring supports request types that basically have two different
      lifetimes:
      
      1) Bounded completion time. These are requests like disk reads or writes,
         which we know will finish in a finite amount of time.
      2) Unbounded completion time. These are generally networked IO, where we
         have no idea how long they will take to complete. Another example is
         POLL commands.
      
      This patch provides support for io-wq to handle these differently, so we
      don't starve bounded requests by tying up workers for too long. By default
      all work is bounded, unless otherwise specified in the work item.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c5def4ab
  28. 01 11月, 2019 1 次提交
    • J
      io_uring: support for generic async request cancel · 62755e35
      Jens Axboe 提交于
      This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
      cancel requests that have been punted to async context and are now
      in-flight. This works for regular read/write requests to files, as
      long as they haven't been started yet. For socket based IO (or things
      like accept4(2)), we can cancel work that is already running as well.
      
      To cancel a request, the sqe must have ->addr set to the user_data of
      the request it wishes to cancel. If the request is cancelled
      successfully, the original request is completed with -ECANCELED
      and the cancel request is completed with a result of 0. If the
      request was already running, the original may or may not complete
      in error. The cancel request will complete with -EALREADY for that
      case. And finally, if the request to cancel wasn't found, the cancel
      request is completed with -ENOENT.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      62755e35
  29. 30 10月, 2019 2 次提交
    • J
      io_uring: io_uring: add support for async work inheriting files · fcb323cc
      Jens Axboe 提交于
      This is in preparation for adding opcodes that need to add new files
      in a process file table, system calls like open(2) or accept4(2).
      
      If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
      item. If work that needs to get punted to async context have this
      set, the async worker will assume the original task file table before
      executing the work.
      
      Note that opcodes that need access to the current files of an
      application cannot be done through IORING_SETUP_SQPOLL.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fcb323cc
    • J
      io-wq: small threadpool implementation for io_uring · 771b53d0
      Jens Axboe 提交于
      This adds support for io-wq, a smaller and specialized thread pool
      implementation. This is meant to replace workqueues for io_uring. Among
      the reasons for this addition are:
      
      - We can assign memory context smarter and more persistently if we
        manage the life time of threads.
      
      - We can drop various work-arounds we have in io_uring, like the
        async_list.
      
      - We can implement hashed work insertion, to manage concurrency of
        buffered writes without needing a) an extra workqueue, or b)
        needlessly making the concurrency of said workqueue very low
        which hurts performance of multiple buffered file writers.
      
      - We can implement cancel through signals, for cancelling
        interruptible work like read/write (or send/recv) to/from sockets.
      
      - We need the above cancel for being able to assign and use file tables
        from a process.
      
      - We can implement a more thorough cancel operation in general.
      
      - We need it to move towards a syslet/threadlet model for even faster
        async execution. For that we need to take ownership of the used
        threads.
      
      This list is just off the top of my head. Performance should be the
      same, or better, at least that's what I've seen in my testing. io-wq
      supports basic NUMA functionality, setting up a pool per node.
      
      io-wq hooks up to the scheduler schedule in/out just like workqueue
      and uses that to drive the need for more/less workers.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      771b53d0