1. 29 6月, 2020 2 次提交
  2. 04 6月, 2020 9 次提交
    • J
      io_uring: use io-wq manager as backup task if task is exiting · 38954954
      Jens Axboe 提交于
      to #28170604
      
      commit aa96bf8a9ee33457b7e3ea43e97dfa1e3a15ab20 upstream
      
      If the original task is (or has) exited, then the task work will not get
      queued properly. Allow for using the io-wq manager task to queue this
      work for execution, and ensure that the io-wq manager notices and runs
      this work if woken up (or exiting).
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      38954954
    • P
      io-wq: handle hashed writes in chains · 56986bc4
      Pavel Begunkov 提交于
      to #28170604
      
      commit 86f3cd1b589a10dbdca98c52cc0cd0f56523c9b3 upstream
      
      We always punt async buffered writes to an io-wq helper, as the core
      kernel does not have IOCB_NOWAIT support for that. Most buffered async
      writes complete very quickly, as it's just a copy operation. This means
      that doing multiple locking roundtrips on the shared wqe lock for each
      buffered write is wasteful. Additionally, buffered writes are hashed
      work items, which means that any buffered write to a given file is
      serialized.
      
      Keep identicaly hashed work items contiguously in @wqe->work_list, and
      track a tail for each hash bucket. On dequeue of a hashed item, splice
      all of the same hash in one go using the tracked tail. Until the batch
      is done, the caller doesn't have to synchronize with the wqe or worker
      locks again.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      56986bc4
    • P
      io_uring: Fix ->data corruption on re-enqueue · fd7ce40a
      Pavel Begunkov 提交于
      to #28170604
      
      commit 18a542ff19ad149fac9e5a36a4012e3cac7b3b3b upstream
      
      work->data and work->list are shared in union. io_wq_assign_next() sets
      ->data if a req having a linked_timeout, but then io-wq may want to use
      work->list, e.g. to do re-enqueue of a request, so corrupting ->data.
      
      ->data is not necessary, just remove it and extract linked_timeout
      through @Link_list.
      
      Fixes: 60cf46ae6054 ("io-wq: hash dependent work")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      fd7ce40a
    • P
      io-wq: split hashing and enqueueing · 9fc5877e
      Pavel Begunkov 提交于
      to #28170604
      
      commit 8766dd516c535abf04491dca674d0ef6c95d814f upstream
      
      It's a preparation patch removing io_wq_enqueue_hashed(), which
      now should be done by io_wq_hash_work() + io_wq_enqueue().
      
      Also, set hash value for dependant works, and do it as late as possible,
      because req->file can be unavailable before. This hash will be ignored
      by io-wq.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      9fc5877e
    • P
      io_uring/io-wq: forward submission ref to async · ab800df8
      Pavel Begunkov 提交于
      to #28170604
      
      commit e9fd939654f17651ff65e7e55aa6934d29eb4335 upstream
      
      First it changes io-wq interfaces. It replaces {get,put}_work() with
      free_work(), which guaranteed to be called exactly once. It also enforces
      free_work() callback to be non-NULL.
      
      io_uring follows the changes and instead of putting a submission reference
      in io_put_req_async_completion(), it will be done in io_free_work(). As
      removes io_get_work() with corresponding refcount_inc(), the ref balance
      is maintained.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      ab800df8
    • P
      io_uring: remove IO_WQ_WORK_CB · e7c00972
      Pavel Begunkov 提交于
      to #28170604
      
      commit 5eae8619907a1389dbd1b4a1049caf52782c0916 upstream
      
      IO_WQ_WORK_CB is used only for linked timeouts, which will be armed
      before the work setup (i.e. mm, override creds, etc). The setup
      shouldn't take long, so it's ok to arm it a bit later and get rid
      of IO_WQ_WORK_CB.
      
      Make io-wq call work->func() only once, callbacks will handle the rest.
      i.e. the linked timeout handler will do the actual issue. And as a
      bonus, it removes an extra indirect call.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e7c00972
    • P
      io-wq: remove unused IO_WQ_WORK_HAS_MM · f7919d00
      Pavel Begunkov 提交于
      to #28170604
      
      commit e85530ddda4f08d4f9ed6506d4a1f42e086e3b21 upstream
      
      IO_WQ_WORK_HAS_MM is set but never used, remove it.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      f7919d00
    • J
      io-wq: ensure work->task_pid is cleared on init · 7c718725
      Jens Axboe 提交于
      to #28170604
      
      commit 2d141dd2caa78fbaf87b57c27769bdc14975ab3d upstream
      
      We use ->task_pid for exit cancellation, but we need to ensure it's
      cleared to zero for io_req_work_grab_env() to do the right thing. Take
      a suggestion from Bart and clear the whole thing, just setting the
      function passed in. This makes it more future proof as well.
      
      Fixes: 36282881a795 ("io-wq: add io_wq_cancel_pid() to cancel based on a specific pid")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      7c718725
    • J
      io-wq: clear node->next on list deletion · 5a60c6cc
      Jens Axboe 提交于
      to #28170604
      
      commit 08bdcc35f00c91b325195723cceaba0937a89ddf upstream
      
      If someone removes a node from a list, and then later adds it back to
      a list, we can have invalid data in ->next. This can cause all sorts
      of issues. One such use case is the IORING_OP_POLL_ADD command, which
      will do just that if we race and get woken twice without any pending
      events. This is a pretty rare case, but can happen under extreme loads.
      Dan reports that he saw the following crash:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000000
      PGD d283ce067 P4D d283ce067 PUD e5ca04067 PMD 0
      Oops: 0002 [#1] SMP
      CPU: 17 PID: 10726 Comm: tao:fast-fiber Kdump: loaded Not tainted 5.2.9-02851-gac7bc042d2d1 #116
      Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
      RIP: 0010:io_wqe_enqueue+0x3e/0xd0
      Code: 34 24 74 55 8b 47 58 48 8d 6f 50 85 c0 74 50 48 89 df e8 35 7c 75 00 48 83 7b 08 00 48 8b 14 24 0f 84 84 00 00 00 48 8b 4b 10 <48> 89 11 48 89 53 10 83 63 20 fe 48 89 c6 48 89 df e8 0c 7a 75 00
      RSP: 0000:ffffc90006858a08 EFLAGS: 00010082
      RAX: 0000000000000002 RBX: ffff889037492fc0 RCX: 0000000000000000
      RDX: ffff888e40cc11a8 RSI: ffff888e40cc11a8 RDI: ffff889037492fc0
      RBP: ffff889037493010 R08: 00000000000000c3 R09: ffffc90006858ab8
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff888e40cc11a8
      R13: 0000000000000000 R14: 00000000000000c3 R15: ffff888e40cc1100
      FS:  00007fcddc9db700(0000) GS:ffff88903fa40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 0000000e479f5003 CR4: 00000000007606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <IRQ>
       io_poll_wake+0x12f/0x2a0
       __wake_up_common+0x86/0x120
       __wake_up_common_lock+0x7a/0xc0
       sock_def_readable+0x3c/0x70
       tcp_rcv_established+0x557/0x630
       tcp_v6_do_rcv+0x118/0x3c0
       tcp_v6_rcv+0x97e/0x9d0
       ip6_protocol_deliver_rcu+0xe3/0x440
       ip6_input+0x3d/0xc0
       ? ip6_protocol_deliver_rcu+0x440/0x440
       ipv6_rcv+0x56/0xd0
       ? ip6_rcv_finish_core.isra.18+0x80/0x80
       __netif_receive_skb_one_core+0x50/0x70
       netif_receive_skb_internal+0x2f/0xa0
       napi_gro_receive+0x125/0x150
       mlx5e_handle_rx_cqe+0x1d9/0x5a0
       ? mlx5e_poll_tx_cq+0x305/0x560
       mlx5e_poll_rx_cq+0x49f/0x9c5
       mlx5e_napi_poll+0xee/0x640
       ? smp_reschedule_interrupt+0x16/0xd0
       ? reschedule_interrupt+0xf/0x20
       net_rx_action+0x286/0x3d0
       __do_softirq+0xca/0x297
       irq_exit+0x96/0xa0
       do_IRQ+0x54/0xe0
       common_interrupt+0xf/0xf
       </IRQ>
      RIP: 0033:0x7fdc627a2e3a
      Code: 31 c0 85 d2 0f 88 f6 00 00 00 55 48 89 e5 41 57 41 56 4c 63 f2 41 55 41 54 53 48 83 ec 18 48 85 ff 0f 84 c7 00 00 00 48 8b 07 <41> 89 d4 49 89 f5 48 89 fb 48 85 c0 0f 84 64 01 00 00 48 83 78 10
      
      when running a networked workload with about 5000 sockets being polled
      for. Fix this by clearing node->next when the node is being removed from
      the list.
      
      Fixes: 6206f0e180d4 ("io-wq: shrink io_wq_work a bit")
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5a60c6cc
  3. 28 5月, 2020 8 次提交
  4. 27 5月, 2020 3 次提交
  5. 02 4月, 2020 1 次提交
    • J
      io_uring: use current task creds instead of allocating a new one · 311b786d
      Jens Axboe 提交于
      fix #26374723
      
      commit 0b8c0ec7eedcd8f9f1a1f238d87f9b512b09e71a upstream.
      
      syzbot reports:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 9217 Comm: io_uring-sq Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
      RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
      RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
      Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
      24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
      RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
      RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
      RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
      R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
      R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
        io_sq_thread+0x1c7/0xa20 fs/io_uring.c:3274
        kthread+0x361/0x430 kernel/kthread.c:255
        ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      Modules linked in:
      ---[ end trace f2e1a4307fbe2245 ]---
      RIP: 0010:creds_are_invalid kernel/cred.c:792 [inline]
      RIP: 0010:__validate_creds include/linux/cred.h:187 [inline]
      RIP: 0010:override_creds+0x9f/0x170 kernel/cred.c:550
      Code: ac 25 00 81 fb 64 65 73 43 0f 85 a3 37 00 00 e8 17 ab 25 00 49 8d 7c
      24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e 96 00 00 00 41 8b 5c 24 10 bf
      RSP: 0018:ffff88809c45fda0 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000043736564 RCX: ffffffff814f3318
      RDX: 0000000000000002 RSI: ffffffff814f3329 RDI: 0000000000000010
      RBP: ffff88809c45fdb8 R08: ffff8880a3aac240 R09: ffffed1014755849
      R10: ffffed1014755848 R11: ffff8880a3aac247 R12: 0000000000000000
      R13: ffff888098ab1600 R14: 0000000000000000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffd51c40664 CR3: 0000000092641000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      which is caused by slab fault injection triggering a failure in
      prepare_creds(). We don't actually need to create a copy of the creds
      as we're not modifying it, we just need a reference on the current task
      creds. This avoids the failure case as well, and propagates the const
      throughout the stack.
      
      Fixes: 181e448d8709 ("io_uring: async workers should inherit the user creds")
      Reported-by: syzbot+5320383e16029ba057ff@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      311b786d
  6. 19 3月, 2020 3 次提交
  7. 18 3月, 2020 5 次提交
    • J
      io_uring: use correct "is IO worker" helper · cf8c3e33
      Jens Axboe 提交于
      commit 960e432dfa5927892a9b170d14de874597b84849 upstream.
      
      Since we switched to io-wq, the dependent link optimization for when to
      pass back work inline has been broken. Fix this by providing a suitable
      io-wq helper for io_uring to use to detect when to do this.
      
      Fixes: 561fb04a6a22 ("io_uring: replace workqueue usage with io-wq")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      cf8c3e33
    • J
      io-wq: add support for bounded vs unbunded work · 43cfd4cb
      Jens Axboe 提交于
      commit c5def4ab849494d3c97f6c9fc84b2ddb868fe78c upstream.
      
      io_uring supports request types that basically have two different
      lifetimes:
      
      1) Bounded completion time. These are requests like disk reads or writes,
         which we know will finish in a finite amount of time.
      2) Unbounded completion time. These are generally networked IO, where we
         have no idea how long they will take to complete. Another example is
         POLL commands.
      
      This patch provides support for io-wq to handle these differently, so we
      don't starve bounded requests by tying up workers for too long. By default
      all work is bounded, unless otherwise specified in the work item.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      43cfd4cb
    • J
      io_uring: support for generic async request cancel · e025ee0c
      Jens Axboe 提交于
      commit 62755e35dfb2b113c52b81cd96d01c20971c8e02 upstream.
      
      This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
      cancel requests that have been punted to async context and are now
      in-flight. This works for regular read/write requests to files, as
      long as they haven't been started yet. For socket based IO (or things
      like accept4(2)), we can cancel work that is already running as well.
      
      To cancel a request, the sqe must have ->addr set to the user_data of
      the request it wishes to cancel. If the request is cancelled
      successfully, the original request is completed with -ECANCELED
      and the cancel request is completed with a result of 0. If the
      request was already running, the original may or may not complete
      in error. The cancel request will complete with -EALREADY for that
      case. And finally, if the request to cancel wasn't found, the cancel
      request is completed with -ENOENT.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e025ee0c
    • J
      io_uring: io_uring: add support for async work inheriting files · 44b96780
      Jens Axboe 提交于
      commit fcb323cc53e29d9cc696d606bb42736b32dd9825 upstream.
      
      This is in preparation for adding opcodes that need to add new files
      in a process file table, system calls like open(2) or accept4(2).
      
      If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
      item. If work that needs to get punted to async context have this
      set, the async worker will assume the original task file table before
      executing the work.
      
      Note that opcodes that need access to the current files of an
      application cannot be done through IORING_SETUP_SQPOLL.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      44b96780
    • J
      io-wq: small threadpool implementation for io_uring · 8a308e54
      Jens Axboe 提交于
      commit 771b53d033e8663abdf59704806aa856b236dcdb upstream.
      
      This adds support for io-wq, a smaller and specialized thread pool
      implementation. This is meant to replace workqueues for io_uring. Among
      the reasons for this addition are:
      
      - We can assign memory context smarter and more persistently if we
        manage the life time of threads.
      
      - We can drop various work-arounds we have in io_uring, like the
        async_list.
      
      - We can implement hashed work insertion, to manage concurrency of
        buffered writes without needing a) an extra workqueue, or b)
        needlessly making the concurrency of said workqueue very low
        which hurts performance of multiple buffered file writers.
      
      - We can implement cancel through signals, for cancelling
        interruptible work like read/write (or send/recv) to/from sockets.
      
      - We need the above cancel for being able to assign and use file tables
        from a process.
      
      - We can implement a more thorough cancel operation in general.
      
      - We need it to move towards a syslet/threadlet model for even faster
        async execution. For that we need to take ownership of the used
        threads.
      
      This list is just off the top of my head. Performance should be the
      same, or better, at least that's what I've seen in my testing. io-wq
      supports basic NUMA functionality, setting up a pool per node.
      
      io-wq hooks up to the scheduler schedule in/out just like workqueue
      and uses that to drive the need for more/less workers.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      [Joseph: Cherry-pick allow_kernel_signal() from upstream commit 33da8e7c814f]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      8a308e54