1. 03 9月, 2020 1 次提交
  2. 02 9月, 2020 1 次提交
  3. 01 9月, 2020 1 次提交
  4. 28 8月, 2020 2 次提交
  5. 27 8月, 2020 2 次提交
  6. 26 8月, 2020 3 次提交
  7. 25 8月, 2020 1 次提交
  8. 24 8月, 2020 1 次提交
    • J
      io_uring: don't recurse on tsk->sighand->siglock with signalfd · fd7d6de2
      Jens Axboe 提交于
      If an application is doing reads on signalfd, and we arm the poll handler
      because there's no data available, then the wakeup can recurse on the
      tasks sighand->siglock as the signal delivery from task_work_add() will
      use TWA_SIGNAL and that attempts to lock it again.
      
      We can detect the signalfd case pretty easily by comparing the poll->head
      wait_queue_head_t with the target task signalfd wait queue. Just use
      normal task wakeup for this case.
      
      Cc: stable@vger.kernel.org # v5.7+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fd7d6de2
  9. 20 8月, 2020 4 次提交
    • P
      io_uring: kill extra iovec=NULL in import_iovec() · 867a23ea
      Pavel Begunkov 提交于
      If io_import_iovec() returns an error, return iovec is undefined and
      must not be used, so don't set it to NULL when failing.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      867a23ea
    • P
      io_uring: comment on kfree(iovec) checks · f261c168
      Pavel Begunkov 提交于
      kfree() handles NULL pointers well, but io_{read,write}() checks it
      because of performance reasons. Leave a comment there for those who are
      tempted to patch it.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f261c168
    • P
      io_uring: fix racy req->flags modification · bb175342
      Pavel Begunkov 提交于
      Setting and clearing REQ_F_OVERFLOW in io_uring_cancel_files() and
      io_cqring_overflow_flush() are racy, because they might be called
      asynchronously.
      
      REQ_F_OVERFLOW flag in only needed for files cancellation, so if it can
      be guaranteed that requests _currently_ marked inflight can't be
      overflown, the problem will be solved with removing the flag
      altogether.
      
      That's how the patch works, it removes inflight status of a request
      in io_cqring_fill_event() whenever it should be thrown into CQ-overflow
      list. That's Ok to do, because no opcode specific handling can be done
      after io_cqring_fill_event(), the same assumption as with "struct
      io_completion" patches.
      And it already have a good place for such cleanups, which is
      io_clean_op(). A nice side effect of this is removing this inflight
      check from the hot path.
      
      note on synchronisation: now __io_cqring_fill_event() may be taking two
      spinlocks simultaneously, completion_lock and inflight_lock. It's fine,
      because we never do that in reverse order, and CQ-overflow of inflight
      requests shouldn't happen often.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bb175342
    • J
      io_uring: use system_unbound_wq for ring exit work · fc666777
      Jens Axboe 提交于
      We currently use system_wq, which is unbounded in terms of number of
      workers. This means that if we're exiting tons of rings at the same
      time, then we'll briefly spawn tons of event kworkers just for a very
      short blocking time as the rings exit.
      
      Use system_unbound_wq instead, which has a sane cap on the concurrency
      level.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fc666777
  10. 19 8月, 2020 1 次提交
    • J
      io_uring: cleanup io_import_iovec() of pre-mapped request · 8452fd0c
      Jens Axboe 提交于
      io_rw_prep_async() goes through a dance of clearing req->io, calling
      the iovec import, then re-setting req->io. Provide an internal helper
      that does the right thing without needing state tweaked to get there.
      
      This enables further cleanups in io_read, io_write, and
      io_resubmit_prep(), but that's left for another time.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8452fd0c
  11. 17 8月, 2020 2 次提交
    • J
      io_uring: get rid of kiocb_wait_page_queue_init() · 3b2a4439
      Jens Axboe 提交于
      The 5.9 merge moved this function io_uring, which means that we don't
      need to retain the generic nature of it. Clean up this part by removing
      redundant checks, and just inlining the small remainder in
      io_rw_should_retry().
      
      No functional changes in this patch.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3b2a4439
    • J
      io_uring: find and cancel head link async work on files exit · b711d4ea
      Jens Axboe 提交于
      Commit f254ac04 ("io_uring: enable lookup of links holding inflight files")
      only handled 2 out of the three head link cases we have, we also need to
      lookup and cancel work that is blocked in io-wq if that work has a link
      that's holding a reference to the files structure.
      
      Put the "cancel head links that hold this request pending" logic into
      io_attempt_cancel(), which will to through the motions of finding and
      canceling head links that hold the current inflight files stable request
      pending.
      
      Cc: stable@vger.kernel.org
      Reported-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b711d4ea
  12. 16 8月, 2020 2 次提交
    • J
      io_uring: short circuit -EAGAIN for blocking read attempt · f91daf56
      Jens Axboe 提交于
      One case was missed in the short IO retry handling, and that's hitting
      -EAGAIN on a blocking attempt read (eg from io-wq context). This is a
      problem on sockets that are marked as non-blocking when created, they
      don't carry any REQ_F_NOWAIT information to help us terminate them
      instead of perpetually retrying.
      
      Fixes: 227c0c96 ("io_uring: internally retry short reads")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f91daf56
    • J
      io_uring: sanitize double poll handling · d4e7cd36
      Jens Axboe 提交于
      There's a bit of confusion on the matching pairs of poll vs double poll,
      depending on if the request is a pure poll (IORING_OP_POLL_ADD) or
      poll driven retry.
      
      Add io_poll_get_double() that returns the double poll waitqueue, if any,
      and io_poll_get_single() that returns the original poll waitqueue. With
      that, remove the argument to io_poll_remove_double().
      
      Finally ensure that wait->private is cleared once the double poll handler
      has run, so that remove knows it's already been seen.
      
      Cc: stable@vger.kernel.org # v5.8
      Reported-by: syzbot+7f617d4a9369028b8a2c@syzkaller.appspotmail.com
      Fixes: 18bceab1 ("io_uring: allow POLL_ADD with double poll_wait() users")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d4e7cd36
  13. 14 8月, 2020 2 次提交
    • J
      io_uring: internally retry short reads · 227c0c96
      Jens Axboe 提交于
      We've had a few application cases of not handling short reads properly,
      and it is understandable as short reads aren't really expected if the
      application isn't doing non-blocking IO.
      
      Now that we retain the iov_iter over retries, we can implement internal
      retry pretty trivially. This ensures that we don't return a short read,
      even for buffered reads on page cache conflicts.
      
      Cleanup the deep nesting and hard to read nature of io_read() as well,
      it's much more straight forward now to read and understand. Added a
      few comments explaining the logic as well.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      227c0c96
    • J
      io_uring: retain iov_iter state over io_read/io_write calls · ff6165b2
      Jens Axboe 提交于
      Instead of maintaining (and setting/remembering) iov_iter size and
      segment counts, just put the iov_iter in the async part of the IO
      structure.
      
      This is mostly a preparation patch for doing appropriate internal retries
      for short reads, but it also cleans up the state handling nicely and
      simplifies it quite a bit.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ff6165b2
  14. 13 8月, 2020 1 次提交
  15. 12 8月, 2020 1 次提交
    • J
      io_uring: fail poll arm on queue proc failure · a36da65c
      Jens Axboe 提交于
      Check the ipt.error value, it must have been either cleared to zero or
      set to another error than the default -EINVAL if we don't go through the
      waitqueue proc addition. Just give up on poll at that point and return
      failure, this will fallback to async work.
      
      io_poll_add() doesn't suffer from this failure case, as it returns the
      error value directly.
      
      Cc: stable@vger.kernel.org # v5.7+
      Reported-by: syzbot+a730016dc0bdce4f6ff5@syzkaller.appspotmail.com
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a36da65c
  16. 11 8月, 2020 5 次提交
    • J
      io_uring: hold 'ctx' reference around task_work queue + execute · 6d816e08
      Jens Axboe 提交于
      We're holding the request reference, but we need to go one higher
      to ensure that the ctx remains valid after the request has finished.
      If the ring is closed with pending task_work inflight, and the
      given io_kiocb finishes sync during issue, then we need a reference
      to the ring itself around the task_work execution cycle.
      
      Cc: stable@vger.kernel.org # v5.7+
      Reported-by: syzbot+9b260fc33297966f5a8e@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6d816e08
    • J
      io_uring: defer file table grabbing request cleanup for locked requests · 51a4cc11
      Jens Axboe 提交于
      If we're in the error path failing links and we have a link that has
      grabbed a reference to the fs_struct, then we cannot safely drop our
      reference to the table if we already hold the completion lock. This
      adds a hardirq dependency to the fs_struct->lock, which it currently
      doesn't have.
      
      Defer the final cleanup and free of such requests to avoid adding this
      dependency.
      
      Reported-by: syzbot+ef4b654b49ed7ff049bf@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      51a4cc11
    • J
      io_uring: add missing REQ_F_COMP_LOCKED for nested requests · 9b7adba9
      Jens Axboe 提交于
      When we traverse into failing links or timeouts, we need to ensure we
      propagate the REQ_F_COMP_LOCKED flag to ensure that we correctly signal
      to the completion side that we already hold the completion lock.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b7adba9
    • J
      io_uring: fix recursive completion locking on oveflow flush · 7271ef3a
      Jens Axboe 提交于
      syszbot reports a scenario where we recurse on the completion lock
      when flushing an overflow:
      
      1 lock held by syz-executor287/6816:
       #0: ffff888093cdb4d8 (&ctx->completion_lock){....}-{2:2}, at: io_cqring_overflow_flush+0xc6/0xab0 fs/io_uring.c:1333
      
      stack backtrace:
      CPU: 1 PID: 6816 Comm: syz-executor287 Not tainted 5.8.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1f0/0x31e lib/dump_stack.c:118
       print_deadlock_bug kernel/locking/lockdep.c:2391 [inline]
       check_deadlock kernel/locking/lockdep.c:2432 [inline]
       validate_chain+0x69a4/0x88a0 kernel/locking/lockdep.c:3202
       __lock_acquire+0x1161/0x2ab0 kernel/locking/lockdep.c:4426
       lock_acquire+0x160/0x730 kernel/locking/lockdep.c:5005
       __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
       _raw_spin_lock_irq+0x67/0x80 kernel/locking/spinlock.c:167
       spin_lock_irq include/linux/spinlock.h:379 [inline]
       io_queue_linked_timeout fs/io_uring.c:5928 [inline]
       __io_queue_async_work fs/io_uring.c:1192 [inline]
       __io_queue_deferred+0x36a/0x790 fs/io_uring.c:1237
       io_cqring_overflow_flush+0x774/0xab0 fs/io_uring.c:1359
       io_ring_ctx_wait_and_kill+0x2a1/0x570 fs/io_uring.c:7808
       io_uring_release+0x59/0x70 fs/io_uring.c:7829
       __fput+0x34f/0x7b0 fs/file_table.c:281
       task_work_run+0x137/0x1c0 kernel/task_work.c:135
       exit_task_work include/linux/task_work.h:25 [inline]
       do_exit+0x5f3/0x1f20 kernel/exit.c:806
       do_group_exit+0x161/0x2d0 kernel/exit.c:903
       __do_sys_exit_group+0x13/0x20 kernel/exit.c:914
       __se_sys_exit_group+0x10/0x10 kernel/exit.c:912
       __x64_sys_exit_group+0x37/0x40 kernel/exit.c:912
       do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fix this by passing back the link from __io_queue_async_work(), and
      then let the caller handle the queueing of the link. Take care to also
      punt the submission reference put to the caller, as we're holding the
      completion lock for the __io_queue_defer() case. Hence we need to mark
      the io_kiocb appropriately for that case.
      
      Reported-by: syzbot+996f91b6ec3812c48042@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7271ef3a
    • J
      io_uring: use TWA_SIGNAL for task_work uncondtionally · 0ba9c9ed
      Jens Axboe 提交于
      An earlier commit:
      
      b7db41c9 ("io_uring: fix regression with always ignoring signals in io_cqring_wait()")
      
      ensured that we didn't get stuck waiting for eventfd reads when it's
      registered with the io_uring ring for event notification, but we still
      have cases where the task can be waiting on other events in the kernel and
      need a bigger nudge to make forward progress. Or the task could be in the
      kernel and running, but on its way to blocking.
      
      This means that TWA_RESUME cannot reliably be used to ensure we make
      progress. Use TWA_SIGNAL unconditionally.
      
      Cc: stable@vger.kernel.org # v5.7+
      Reported-by: NJosef <josef.grieb@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0ba9c9ed
  17. 06 8月, 2020 2 次提交
  18. 05 8月, 2020 1 次提交
    • G
      io_uring: Fix NULL pointer dereference in loop_rw_iter() · 2dd2111d
      Guoyu Huang 提交于
      loop_rw_iter() does not check whether the file has a read or
      write function. This can lead to NULL pointer dereference
      when the user passes in a file descriptor that does not have
      read or write function.
      
      The crash log looks like this:
      
      [   99.834071] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [   99.835364] #PF: supervisor instruction fetch in kernel mode
      [   99.836522] #PF: error_code(0x0010) - not-present page
      [   99.837771] PGD 8000000079d62067 P4D 8000000079d62067 PUD 79d8c067 PMD 0
      [   99.839649] Oops: 0010 [#2] SMP PTI
      [   99.840591] CPU: 1 PID: 333 Comm: io_wqe_worker-0 Tainted: G      D           5.8.0 #2
      [   99.842622] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
      [   99.845140] RIP: 0010:0x0
      [   99.845840] Code: Bad RIP value.
      [   99.846672] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
      [   99.848018] RAX: 0000000000000000 RBX: ffff92363bd67300 RCX: ffff92363d461208
      [   99.849854] RDX: 0000000000000010 RSI: 00007ffdbf696bb0 RDI: ffff92363bd67300
      [   99.851743] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
      [   99.853394] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
      [   99.855148] R13: 0000000000000000 R14: ffff92363d461208 R15: ffffa1c7c01ebc68
      [   99.856914] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
      [   99.858651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   99.860032] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0
      [   99.861979] Call Trace:
      [   99.862617]  loop_rw_iter.part.0+0xad/0x110
      [   99.863838]  io_write+0x2ae/0x380
      [   99.864644]  ? kvm_sched_clock_read+0x11/0x20
      [   99.865595]  ? sched_clock+0x9/0x10
      [   99.866453]  ? sched_clock_cpu+0x11/0xb0
      [   99.867326]  ? newidle_balance+0x1d4/0x3c0
      [   99.868283]  io_issue_sqe+0xd8f/0x1340
      [   99.869216]  ? __switch_to+0x7f/0x450
      [   99.870280]  ? __switch_to_asm+0x42/0x70
      [   99.871254]  ? __switch_to_asm+0x36/0x70
      [   99.872133]  ? lock_timer_base+0x72/0xa0
      [   99.873155]  ? switch_mm_irqs_off+0x1bf/0x420
      [   99.874152]  io_wq_submit_work+0x64/0x180
      [   99.875192]  ? kthread_use_mm+0x71/0x100
      [   99.876132]  io_worker_handle_work+0x267/0x440
      [   99.877233]  io_wqe_worker+0x297/0x350
      [   99.878145]  kthread+0x112/0x150
      [   99.878849]  ? __io_worker_unuse+0x100/0x100
      [   99.879935]  ? kthread_park+0x90/0x90
      [   99.880874]  ret_from_fork+0x22/0x30
      [   99.881679] Modules linked in:
      [   99.882493] CR2: 0000000000000000
      [   99.883324] ---[ end trace 4453745f4673190b ]---
      [   99.884289] RIP: 0010:0x0
      [   99.884837] Code: Bad RIP value.
      [   99.885492] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
      [   99.886851] RAX: 0000000000000000 RBX: ffff92363acd7f00 RCX: ffff92363d461608
      [   99.888561] RDX: 0000000000000010 RSI: 00007ffe040d9e10 RDI: ffff92363acd7f00
      [   99.890203] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
      [   99.891907] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
      [   99.894106] R13: 0000000000000000 R14: ffff92363d461608 R15: ffffa1c7c01ebc68
      [   99.896079] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
      [   99.898017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   99.899197] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0
      
      Fixes: 32960613 ("io_uring: correctly handle non ->{read,write}_iter() file_operations")
      Cc: stable@vger.kernel.org
      Signed-off-by: NGuoyu Huang <hgy5945@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2dd2111d
  19. 04 8月, 2020 2 次提交
  20. 02 8月, 2020 1 次提交
  21. 31 7月, 2020 4 次提交
    • J
      io_uring: don't touch 'ctx' after installing file descriptor · d1719f70
      Jens Axboe 提交于
      As soon as we install the file descriptor, we have to assume that it
      can get arbitrarily closed. We currently account memory (and note that
      we did) after installing the ring fd, which means that it could be a
      potential use-after-free condition if the fd is closed right after
      being installed, but before we fiddle with the ctx.
      
      In fact, syzbot reported this exact scenario:
      
      BUG: KASAN: use-after-free in io_account_mem fs/io_uring.c:7397 [inline]
      BUG: KASAN: use-after-free in io_uring_create fs/io_uring.c:8369 [inline]
      BUG: KASAN: use-after-free in io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400
      Read of size 1 at addr ffff888087a41044 by task syz-executor.5/18145
      
      CPU: 0 PID: 18145 Comm: syz-executor.5 Not tainted 5.8.0-rc7-next-20200729-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x18f/0x20d lib/dump_stack.c:118
       print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
       __kasan_report mm/kasan/report.c:513 [inline]
       kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
       io_account_mem fs/io_uring.c:7397 [inline]
       io_uring_create fs/io_uring.c:8369 [inline]
       io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x45c429
      Code: 8d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 5b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f8f121d0c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
      RAX: ffffffffffffffda RBX: 0000000000008540 RCX: 000000000045c429
      RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000196
      RBP: 000000000078bf38 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000078bf0c
      R13: 00007fff86698cff R14: 00007f8f121d19c0 R15: 000000000078bf0c
      
      Move the accounting of the ring used locked memory before we get and
      install the ring file descriptor.
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+9d46305e76057f30c74e@syzkaller.appspotmail.com
      Fixes: 30975825 ("io_uring: report pinned memory usage")
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d1719f70
    • P
      io_uring: get rid of atomic FAA for cq_timeouts · 01cec8c1
      Pavel Begunkov 提交于
      If ->cq_timeouts modifications are done under ->completion_lock, we
      don't really nee any fetch-and-add and other complex atomics. Replace it
      with non-atomic FAA, that saves an implicit full memory barrier.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      01cec8c1
    • P
      io_uring: consolidate *_check_overflow accounting · 46930143
      Pavel Begunkov 提交于
      Add a helper to mark ctx->{cq,sq}_check_overflow to get rid of
      duplicates, and it's clearer to check cq_overflow_list directly anyway.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46930143
    • P
      io_uring: fix stalled deferred requests · dd9dfcdf
      Pavel Begunkov 提交于
      Always do io_commit_cqring() after completing a request, even if it was
      accounted as overflowed on the CQ side. Failing to do that may lead to
      not to pushing deferred requests when needed, and so stalling the whole
      ring.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dd9dfcdf