1. 11 8月, 2020 3 次提交
    • J
      io_uring: add missing REQ_F_COMP_LOCKED for nested requests · 9b7adba9
      Jens Axboe 提交于
      When we traverse into failing links or timeouts, we need to ensure we
      propagate the REQ_F_COMP_LOCKED flag to ensure that we correctly signal
      to the completion side that we already hold the completion lock.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b7adba9
    • J
      io_uring: fix recursive completion locking on oveflow flush · 7271ef3a
      Jens Axboe 提交于
      syszbot reports a scenario where we recurse on the completion lock
      when flushing an overflow:
      
      1 lock held by syz-executor287/6816:
       #0: ffff888093cdb4d8 (&ctx->completion_lock){....}-{2:2}, at: io_cqring_overflow_flush+0xc6/0xab0 fs/io_uring.c:1333
      
      stack backtrace:
      CPU: 1 PID: 6816 Comm: syz-executor287 Not tainted 5.8.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1f0/0x31e lib/dump_stack.c:118
       print_deadlock_bug kernel/locking/lockdep.c:2391 [inline]
       check_deadlock kernel/locking/lockdep.c:2432 [inline]
       validate_chain+0x69a4/0x88a0 kernel/locking/lockdep.c:3202
       __lock_acquire+0x1161/0x2ab0 kernel/locking/lockdep.c:4426
       lock_acquire+0x160/0x730 kernel/locking/lockdep.c:5005
       __raw_spin_lock_irq include/linux/spinlock_api_smp.h:128 [inline]
       _raw_spin_lock_irq+0x67/0x80 kernel/locking/spinlock.c:167
       spin_lock_irq include/linux/spinlock.h:379 [inline]
       io_queue_linked_timeout fs/io_uring.c:5928 [inline]
       __io_queue_async_work fs/io_uring.c:1192 [inline]
       __io_queue_deferred+0x36a/0x790 fs/io_uring.c:1237
       io_cqring_overflow_flush+0x774/0xab0 fs/io_uring.c:1359
       io_ring_ctx_wait_and_kill+0x2a1/0x570 fs/io_uring.c:7808
       io_uring_release+0x59/0x70 fs/io_uring.c:7829
       __fput+0x34f/0x7b0 fs/file_table.c:281
       task_work_run+0x137/0x1c0 kernel/task_work.c:135
       exit_task_work include/linux/task_work.h:25 [inline]
       do_exit+0x5f3/0x1f20 kernel/exit.c:806
       do_group_exit+0x161/0x2d0 kernel/exit.c:903
       __do_sys_exit_group+0x13/0x20 kernel/exit.c:914
       __se_sys_exit_group+0x10/0x10 kernel/exit.c:912
       __x64_sys_exit_group+0x37/0x40 kernel/exit.c:912
       do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fix this by passing back the link from __io_queue_async_work(), and
      then let the caller handle the queueing of the link. Take care to also
      punt the submission reference put to the caller, as we're holding the
      completion lock for the __io_queue_defer() case. Hence we need to mark
      the io_kiocb appropriately for that case.
      
      Reported-by: syzbot+996f91b6ec3812c48042@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7271ef3a
    • J
      io_uring: use TWA_SIGNAL for task_work uncondtionally · 0ba9c9ed
      Jens Axboe 提交于
      An earlier commit:
      
      b7db41c9 ("io_uring: fix regression with always ignoring signals in io_cqring_wait()")
      
      ensured that we didn't get stuck waiting for eventfd reads when it's
      registered with the io_uring ring for event notification, but we still
      have cases where the task can be waiting on other events in the kernel and
      need a bigger nudge to make forward progress. Or the task could be in the
      kernel and running, but on its way to blocking.
      
      This means that TWA_RESUME cannot reliably be used to ensure we make
      progress. Use TWA_SIGNAL unconditionally.
      
      Cc: stable@vger.kernel.org # v5.7+
      Reported-by: NJosef <josef.grieb@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0ba9c9ed
  2. 06 8月, 2020 2 次提交
  3. 05 8月, 2020 1 次提交
    • G
      io_uring: Fix NULL pointer dereference in loop_rw_iter() · 2dd2111d
      Guoyu Huang 提交于
      loop_rw_iter() does not check whether the file has a read or
      write function. This can lead to NULL pointer dereference
      when the user passes in a file descriptor that does not have
      read or write function.
      
      The crash log looks like this:
      
      [   99.834071] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [   99.835364] #PF: supervisor instruction fetch in kernel mode
      [   99.836522] #PF: error_code(0x0010) - not-present page
      [   99.837771] PGD 8000000079d62067 P4D 8000000079d62067 PUD 79d8c067 PMD 0
      [   99.839649] Oops: 0010 [#2] SMP PTI
      [   99.840591] CPU: 1 PID: 333 Comm: io_wqe_worker-0 Tainted: G      D           5.8.0 #2
      [   99.842622] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
      [   99.845140] RIP: 0010:0x0
      [   99.845840] Code: Bad RIP value.
      [   99.846672] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
      [   99.848018] RAX: 0000000000000000 RBX: ffff92363bd67300 RCX: ffff92363d461208
      [   99.849854] RDX: 0000000000000010 RSI: 00007ffdbf696bb0 RDI: ffff92363bd67300
      [   99.851743] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
      [   99.853394] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
      [   99.855148] R13: 0000000000000000 R14: ffff92363d461208 R15: ffffa1c7c01ebc68
      [   99.856914] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
      [   99.858651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   99.860032] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0
      [   99.861979] Call Trace:
      [   99.862617]  loop_rw_iter.part.0+0xad/0x110
      [   99.863838]  io_write+0x2ae/0x380
      [   99.864644]  ? kvm_sched_clock_read+0x11/0x20
      [   99.865595]  ? sched_clock+0x9/0x10
      [   99.866453]  ? sched_clock_cpu+0x11/0xb0
      [   99.867326]  ? newidle_balance+0x1d4/0x3c0
      [   99.868283]  io_issue_sqe+0xd8f/0x1340
      [   99.869216]  ? __switch_to+0x7f/0x450
      [   99.870280]  ? __switch_to_asm+0x42/0x70
      [   99.871254]  ? __switch_to_asm+0x36/0x70
      [   99.872133]  ? lock_timer_base+0x72/0xa0
      [   99.873155]  ? switch_mm_irqs_off+0x1bf/0x420
      [   99.874152]  io_wq_submit_work+0x64/0x180
      [   99.875192]  ? kthread_use_mm+0x71/0x100
      [   99.876132]  io_worker_handle_work+0x267/0x440
      [   99.877233]  io_wqe_worker+0x297/0x350
      [   99.878145]  kthread+0x112/0x150
      [   99.878849]  ? __io_worker_unuse+0x100/0x100
      [   99.879935]  ? kthread_park+0x90/0x90
      [   99.880874]  ret_from_fork+0x22/0x30
      [   99.881679] Modules linked in:
      [   99.882493] CR2: 0000000000000000
      [   99.883324] ---[ end trace 4453745f4673190b ]---
      [   99.884289] RIP: 0010:0x0
      [   99.884837] Code: Bad RIP value.
      [   99.885492] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
      [   99.886851] RAX: 0000000000000000 RBX: ffff92363acd7f00 RCX: ffff92363d461608
      [   99.888561] RDX: 0000000000000010 RSI: 00007ffe040d9e10 RDI: ffff92363acd7f00
      [   99.890203] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
      [   99.891907] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
      [   99.894106] R13: 0000000000000000 R14: ffff92363d461608 R15: ffffa1c7c01ebc68
      [   99.896079] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
      [   99.898017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   99.899197] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0
      
      Fixes: 32960613 ("io_uring: correctly handle non ->{read,write}_iter() file_operations")
      Cc: stable@vger.kernel.org
      Signed-off-by: NGuoyu Huang <hgy5945@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2dd2111d
  4. 04 8月, 2020 2 次提交
  5. 02 8月, 2020 1 次提交
  6. 31 7月, 2020 7 次提交
    • J
      io_uring: don't touch 'ctx' after installing file descriptor · d1719f70
      Jens Axboe 提交于
      As soon as we install the file descriptor, we have to assume that it
      can get arbitrarily closed. We currently account memory (and note that
      we did) after installing the ring fd, which means that it could be a
      potential use-after-free condition if the fd is closed right after
      being installed, but before we fiddle with the ctx.
      
      In fact, syzbot reported this exact scenario:
      
      BUG: KASAN: use-after-free in io_account_mem fs/io_uring.c:7397 [inline]
      BUG: KASAN: use-after-free in io_uring_create fs/io_uring.c:8369 [inline]
      BUG: KASAN: use-after-free in io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400
      Read of size 1 at addr ffff888087a41044 by task syz-executor.5/18145
      
      CPU: 0 PID: 18145 Comm: syz-executor.5 Not tainted 5.8.0-rc7-next-20200729-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x18f/0x20d lib/dump_stack.c:118
       print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
       __kasan_report mm/kasan/report.c:513 [inline]
       kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
       io_account_mem fs/io_uring.c:7397 [inline]
       io_uring_create fs/io_uring.c:8369 [inline]
       io_uring_setup+0x2797/0x2910 fs/io_uring.c:8400
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x45c429
      Code: 8d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 5b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f8f121d0c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
      RAX: ffffffffffffffda RBX: 0000000000008540 RCX: 000000000045c429
      RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000196
      RBP: 000000000078bf38 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000078bf0c
      R13: 00007fff86698cff R14: 00007f8f121d19c0 R15: 000000000078bf0c
      
      Move the accounting of the ring used locked memory before we get and
      install the ring file descriptor.
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+9d46305e76057f30c74e@syzkaller.appspotmail.com
      Fixes: 30975825 ("io_uring: report pinned memory usage")
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d1719f70
    • P
      io_uring: get rid of atomic FAA for cq_timeouts · 01cec8c1
      Pavel Begunkov 提交于
      If ->cq_timeouts modifications are done under ->completion_lock, we
      don't really nee any fetch-and-add and other complex atomics. Replace it
      with non-atomic FAA, that saves an implicit full memory barrier.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      01cec8c1
    • P
      io_uring: consolidate *_check_overflow accounting · 46930143
      Pavel Begunkov 提交于
      Add a helper to mark ctx->{cq,sq}_check_overflow to get rid of
      duplicates, and it's clearer to check cq_overflow_list directly anyway.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46930143
    • P
      io_uring: fix stalled deferred requests · dd9dfcdf
      Pavel Begunkov 提交于
      Always do io_commit_cqring() after completing a request, even if it was
      accounted as overflowed on the CQ side. Failing to do that may lead to
      not to pushing deferred requests when needed, and so stalling the whole
      ring.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dd9dfcdf
    • P
      io_uring: fix racy overflow count reporting · b2bd1cf9
      Pavel Begunkov 提交于
      All ->cq_overflow modifications should be under completion_lock,
      otherwise it can report a wrong number to the userspace. Fix it in
      io_uring_cancel_files().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b2bd1cf9
    • P
      io_uring: deduplicate __io_complete_rw() · 81b68a5c
      Pavel Begunkov 提交于
      Call __io_complete_rw() in io_iopoll_queue() instead of hand coding it.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      81b68a5c
    • P
      io_uring: de-unionise io_kiocb · 010e8e6b
      Pavel Begunkov 提交于
      As io_kiocb have enough space, move ->work out of a union. It's safer
      this way and removes ->work memcpy bouncing.
      By the way make tabulation in struct io_kiocb consistent.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      010e8e6b
  7. 25 7月, 2020 24 次提交