1. 03 3月, 2020 4 次提交
  2. 02 3月, 2020 1 次提交
  3. 25 2月, 2020 1 次提交
    • J
      io-wq: remove spin-for-work optimization · 3030fd4c
      Jens Axboe 提交于
      Andres reports that buffered IO seems to suck up more cycles than we
      would like, and he narrowed it down to the fact that the io-wq workers
      will briefly spin for more work on completion of a work item. This was
      a win on the networking side, but apparently some other cases take a
      hit because of it. Remove the optimization to avoid burning more CPU
      than we have to for disk IO.
      Reported-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3030fd4c
  4. 13 2月, 2020 1 次提交
    • J
      io-wq: don't call kXalloc_node() with non-online node · 7563439a
      Jens Axboe 提交于
      Glauber reports a crash on init on a box he has:
      
       RIP: 0010:__alloc_pages_nodemask+0x132/0x340
       Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 <3b> 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02
       RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8
       RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
       RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002
       R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000
       R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0
       FS:  00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       PKRU: 55555554
       Call Trace:
        alloc_slab_page+0x46/0x320
        new_slab+0x9d/0x4e0
        ___slab_alloc+0x507/0x6a0
        ? io_wq_create+0xb4/0x2a0
        __slab_alloc+0x1c/0x30
        kmem_cache_alloc_node_trace+0xa6/0x260
        io_wq_create+0xb4/0x2a0
        io_uring_setup+0x97f/0xaa0
        ? io_remove_personalities+0x30/0x30
        ? io_poll_trigger_evfd+0x30/0x30
        do_syscall_64+0x5b/0x1c0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f4d116cb1ed
      
      which is due to the 'wqe' and 'worker' allocation being node affine.
      But it isn't valid to call the node affine allocation if the node isn't
      online.
      
      Setup structures for even offline nodes, as usual, but skip them in
      terms of thread setup to not waste resources. If the node isn't online,
      just alloc memory with NUMA_NO_NODE.
      Reported-by: NGlauber Costa <glauber@scylladb.com>
      Tested-by: NGlauber Costa <glauber@scylladb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7563439a
  5. 10 2月, 2020 2 次提交
  6. 09 2月, 2020 1 次提交
  7. 30 1月, 2020 1 次提交
    • J
      io_uring: fix linked command file table usage · f86cd20c
      Jens Axboe 提交于
      We're not consistent in how the file table is grabbed and assigned if we
      have a command linked that requires the use of it.
      
      Add ->file_table to the io_op_defs[] array, and use that to determine
      when to grab the table instead of having the handlers set it if they
      need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES
      flag. We always initialize work->files, so io-wq can just check for
      that.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f86cd20c
  8. 29 1月, 2020 2 次提交
  9. 28 1月, 2020 1 次提交
  10. 21 1月, 2020 2 次提交
    • J
      io-wq: support concurrent non-blocking work · 895e2ca0
      Jens Axboe 提交于
      io-wq assumes that work will complete fast (and not block), so it
      doesn't create a new worker when work is enqueued, if we already have
      at least one worker running. This is done on the assumption that if work
      is running, then it will complete fast.
      
      Add an option to force io-wq to fork a new worker for work queued. This
      is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that
      case, io-wq will create a new worker, even though workers are already
      running.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      895e2ca0
    • J
      io-wq: add support for uncancellable work · 0c9d5ccd
      Jens Axboe 提交于
      Not all work can be cancelled, some of it we may need to guarantee
      that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL
      on work that must not be cancelled. Note that the caller work function
      must also check for IO_WQ_WORK_NO_CANCEL on work that is marked
      IO_WQ_WORK_CANCEL.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0c9d5ccd
  11. 15 1月, 2020 1 次提交
  12. 25 12月, 2019 1 次提交
  13. 23 12月, 2019 1 次提交
  14. 16 12月, 2019 1 次提交
  15. 11 12月, 2019 2 次提交
  16. 02 12月, 2019 1 次提交
    • J
      io_uring: use current task creds instead of allocating a new one · 0b8c0ec7
      Jens Axboe 提交于
      syzbot reports:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
      RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
      RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
      Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
      24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
      RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
      RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
      RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
      R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
      R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
        io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
        kthread+0x361/0x430 kernel/kthread.c:255
        ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      Modules linked in:
      ---[ end trace f2e1a4307fbe2245 ]---
      RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
      RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
      RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
      Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
      24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
      RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
      RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
      RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
      R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
      R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      which is caused by slab fault injection triggering a failure in
      prepare_creds(). We don't actually need to create a copy of the creds
      as we're not modifying it, we just need a reference on the current task
      creds. This avoids the failure case as well, and propagates the const
      throughout the stack.
      
      Fixes: 181e448d ("io_uring: async workers should inherit the user creds")
      Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0b8c0ec7
  17. 27 11月, 2019 3 次提交
    • J
      io-wq: shrink io_wq_work a bit · 6206f0e1
      Jens Axboe 提交于
      Currently we're using 40 bytes for the io_wq_work structure, and 16 of
      those is the doubly link list node. We don't need doubly linked lists,
      we always add to tail to keep things ordered, and any other use case
      is list traversal with deletion. For the deletion case, we can easily
      support any node deletion by keeping track of the previous entry.
      
      This shrinks io_wq_work to 32 bytes, and subsequently io_kiock from
      io_uring to 216 to 208 bytes.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6206f0e1
    • J
      io-wq: fix handling of NUMA node IDs · 3fc50ab5
      Jann Horn 提交于
      There are several things that can go wrong in the current code on NUMA
      systems, especially if not all nodes are online all the time:
      
       - If the identifiers of the online nodes do not form a single contiguous
         block starting at zero, wq->wqes will be too small, and OOB memory
         accesses will occur e.g. in the loop in io_wq_create().
       - If a node comes online between the call to num_online_nodes() and the
         for_each_node() loop in io_wq_create(), an OOB write will occur.
       - If a node comes online between io_wq_create() and io_wq_enqueue(), a
         lookup is performed for an element that doesn't exist, and an OOB read
         will probably occur.
      
      Fix it by:
      
       - using nr_node_ids instead of num_online_nodes() for the allocation size;
         nr_node_ids is calculated by setup_nr_node_ids() to be bigger than the
         highest node ID that could possibly come online at some point, even if
         those nodes' identifiers are not a contiguous block
       - creating workers for all possible CPUs, not just all online ones
      
      This is basically what the normal workqueue code also does, as far as I can
      tell.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3fc50ab5
    • J
      io_uring: use kzalloc instead of kcalloc for single-element allocations · ad6e005c
      Jann Horn 提交于
      These allocations are single-element allocations, so don't use the array
      allocation wrapper for them.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ad6e005c
  18. 26 11月, 2019 5 次提交
  19. 14 11月, 2019 4 次提交
    • J
      io-wq: remove now redundant struct io_wq_nulls_list · 021d1cdd
      Jens Axboe 提交于
      Since we don't iterate these lists anymore after commit:
      
      e61df66c ("io-wq: ensure free/busy list browsing see all items")
      
      we don't need to retain the nulls value we use for them. That means it's
      pretty pointless to wrap the hlist_nulls_head in a structure, so get rid
      of it.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      021d1cdd
    • J
      io-wq: ensure free/busy list browsing see all items · e61df66c
      Jens Axboe 提交于
      We have two lists for workers in io-wq, a busy and a free list. For
      certain operations we want to browse all workers, and we currently do
      that by browsing the two separate lists. But since these lists are RCU
      protected, we can potentially miss workers if they move between the two
      lists while we're browsing them.
      
      Add a third list, all_list, that simply holds all workers. A worker is
      added to that list when it starts, and removed when it exits. This makes
      the worker iteration cleaner, too.
      Reported-by: NPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: NPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e61df66c
    • J
      io-wq: ensure we have a stable view of ->cur_work for cancellations · 36c2f922
      Jens Axboe 提交于
      worker->cur_work is currently protected by the lock of the wqe that the
      worker belongs to. When we send a signal to a worker, we need a stable
      view of ->cur_work, so we need to hold that lock. But this doesn't work
      so well, since we have the opposite order potentially on queueing work.
      If POLL_ADD is used with a signalfd, then io_poll_wake() is called with
      the signal lock, and that sometimes needs to insert work items.
      
      Add a specific worker lock that protects the current work item. Then we
      can guarantee that the task we're sending a signal is currently
      processing the exact work we think it is.
      Reported-by: NPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: NPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      36c2f922
    • J
      io_wq: add get/put_work handlers to io_wq_create() · 7d723065
      Jens Axboe 提交于
      For cancellation, we need to ensure that the work item stays valid for
      as long as ->cur_work is valid. Right now we can't safely dereference
      the work item even under the wqe->lock, because while the ->cur_work
      pointer will remain valid, the work could be completing and be freed
      in parallel.
      
      Only invoke ->get/put_work() on items we know that the caller queued
      themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed
      when we're queueing a flush item, for instance.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7d723065
  20. 08 11月, 2019 1 次提交
    • J
      io-wq: add support for bounded vs unbunded work · c5def4ab
      Jens Axboe 提交于
      io_uring supports request types that basically have two different
      lifetimes:
      
      1) Bounded completion time. These are requests like disk reads or writes,
         which we know will finish in a finite amount of time.
      2) Unbounded completion time. These are generally networked IO, where we
         have no idea how long they will take to complete. Another example is
         POLL commands.
      
      This patch provides support for io-wq to handle these differently, so we
      don't starve bounded requests by tying up workers for too long. By default
      all work is bounded, unless otherwise specified in the work item.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c5def4ab
  21. 10 11月, 2019 1 次提交
  22. 06 11月, 2019 1 次提交
  23. 02 11月, 2019 1 次提交
  24. 01 11月, 2019 1 次提交
    • J
      io_uring: support for generic async request cancel · 62755e35
      Jens Axboe 提交于
      This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
      cancel requests that have been punted to async context and are now
      in-flight. This works for regular read/write requests to files, as
      long as they haven't been started yet. For socket based IO (or things
      like accept4(2)), we can cancel work that is already running as well.
      
      To cancel a request, the sqe must have ->addr set to the user_data of
      the request it wishes to cancel. If the request is cancelled
      successfully, the original request is completed with -ECANCELED
      and the cancel request is completed with a result of 0. If the
      request was already running, the original may or may not complete
      in error. The cancel request will complete with -EALREADY for that
      case. And finally, if the request to cancel wasn't found, the cancel
      request is completed with -ENOENT.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      62755e35