1. 18 10月, 2021 1 次提交
  2. 14 10月, 2021 1 次提交
  3. 02 10月, 2021 1 次提交
  4. 25 9月, 2021 8 次提交
  5. 15 9月, 2021 3 次提交
  6. 14 9月, 2021 3 次提交
  7. 13 9月, 2021 1 次提交
  8. 10 9月, 2021 1 次提交
  9. 09 9月, 2021 3 次提交
    • P
      io_uring: fail links of cancelled timeouts · 2ae2eb9d
      Pavel Begunkov 提交于
      When we cancel a timeout we should mark it with REQ_F_FAIL, so
      linked requests are cancelled as well, but not queued for further
      execution.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/fff625b44eeced3a5cae79f60e6acf3fbdf8f990.1631192135.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      2ae2eb9d
    • J
      io_uring: drop ctx->uring_lock before acquiring sqd->lock · 009ad9f0
      Jens Axboe 提交于
      The SQPOLL thread dictates the lock order, and we hold the ctx->uring_lock
      for all the registration opcodes. We also hold a ref to the ctx, and we
      do drop the lock for other reasons to quiesce, so it's fine to drop the
      ctx lock temporarily to grab the sqd->lock. This fixes the following
      lockdep splat:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.14.0-syzkaller #0 Not tainted
      ------------------------------------------------------
      syz-executor.5/25433 is trying to acquire lock:
      ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: io_register_iowq_max_workers fs/io_uring.c:10551 [inline]
      ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __io_uring_register fs/io_uring.c:10757 [inline]
      ffff888023426870 (&sqd->lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792
      
      but task is already holding lock:
      ffff8880885b40a8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_register+0x2e1/0x2e70 fs/io_uring.c:10791
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&ctx->uring_lock){+.+.}-{3:3}:
             __mutex_lock_common kernel/locking/mutex.c:596 [inline]
             __mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729
             __io_sq_thread fs/io_uring.c:7291 [inline]
             io_sq_thread+0x65a/0x1370 fs/io_uring.c:7368
             ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
      
      -> #0 (&sqd->lock){+.+.}-{3:3}:
             check_prev_add kernel/locking/lockdep.c:3051 [inline]
             check_prevs_add kernel/locking/lockdep.c:3174 [inline]
             validate_chain kernel/locking/lockdep.c:3789 [inline]
             __lock_acquire+0x2a07/0x54a0 kernel/locking/lockdep.c:5015
             lock_acquire kernel/locking/lockdep.c:5625 [inline]
             lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
             __mutex_lock_common kernel/locking/mutex.c:596 [inline]
             __mutex_lock+0x131/0x12f0 kernel/locking/mutex.c:729
             io_register_iowq_max_workers fs/io_uring.c:10551 [inline]
             __io_uring_register fs/io_uring.c:10757 [inline]
             __do_sys_io_uring_register+0x10aa/0x2e70 fs/io_uring.c:10792
             do_syscall_x64 arch/x86/entry/common.c:50 [inline]
             do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&ctx->uring_lock);
                                     lock(&sqd->lock);
                                     lock(&ctx->uring_lock);
        lock(&sqd->lock);
      
       *** DEADLOCK ***
      
      Fixes: 2e480058 ("io-wq: provide a way to limit max number of workers")
      Reported-by: syzbot+97fa56483f69d677969f@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      009ad9f0
    • P
      io_uring: fix missing mb() before waitqueue_active · c57a91fb
      Pavel Begunkov 提交于
      In case of !SQPOLL, io_cqring_ev_posted_iopoll() doesn't provide a
      memory barrier required by waitqueue_active(&ctx->poll_wait). There is
      a wq_has_sleeper(), which does smb_mb() inside, but it's called only for
      SQPOLL.
      
      Fixes: 5fd46178 ("io_uring: be smarter about waking multiple CQ ring waiters")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/2982e53bcea2274006ed435ee2a77197107d8a29.1631130542.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      c57a91fb
  10. 04 9月, 2021 1 次提交
    • P
      io_uring: reexpand under-reexpanded iters · 89c2b3b7
      Pavel Begunkov 提交于
      [   74.211232] BUG: KASAN: stack-out-of-bounds in iov_iter_revert+0x809/0x900
      [   74.212778] Read of size 8 at addr ffff888025dc78b8 by task
      syz-executor.0/828
      [   74.214756] CPU: 0 PID: 828 Comm: syz-executor.0 Not tainted
      5.14.0-rc3-next-20210730 #1
      [   74.216525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [   74.219033] Call Trace:
      [   74.219683]  dump_stack_lvl+0x8b/0xb3
      [   74.220706]  print_address_description.constprop.0+0x1f/0x140
      [   74.224226]  kasan_report.cold+0x7f/0x11b
      [   74.226085]  iov_iter_revert+0x809/0x900
      [   74.227960]  io_write+0x57d/0xe40
      [   74.232647]  io_issue_sqe+0x4da/0x6a80
      [   74.242578]  __io_queue_sqe+0x1ac/0xe60
      [   74.245358]  io_submit_sqes+0x3f6e/0x76a0
      [   74.248207]  __do_sys_io_uring_enter+0x90c/0x1a20
      [   74.257167]  do_syscall_64+0x3b/0x90
      [   74.257984]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      old_size = iov_iter_count();
      ...
      iov_iter_revert(old_size - iov_iter_count());
      
      If iov_iter_revert() is done base on the initial size as above, and the
      iter is truncated and not reexpanded in the middle, it miscalculates
      borders causing problems. This trace is due to no one reexpanding after
      generic_write_checks().
      
      Now iters store how many bytes has been truncated, so reexpand them to
      the initial state right before reverting.
      
      Cc: stable@vger.kernel.org
      Reported-by: NPalash Oswal <oswalpalash@gmail.com>
      Reported-by: NSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Reported-and-tested-by: syzbot+9671693590ef5aad8953@syzkaller.appspotmail.com
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      89c2b3b7
  11. 03 9月, 2021 4 次提交
  12. 01 9月, 2021 4 次提交
    • P
      io_uring: don't submit half-prepared drain request · b8ce1b9d
      Pavel Begunkov 提交于
      [ 3784.910888] BUG: kernel NULL pointer dereference, address: 0000000000000020
      [ 3784.910904] RIP: 0010:__io_file_supports_nowait+0x5/0xc0
      [ 3784.910926] Call Trace:
      [ 3784.910928]  ? io_read+0x17c/0x480
      [ 3784.910945]  io_issue_sqe+0xcb/0x1840
      [ 3784.910953]  __io_queue_sqe+0x44/0x300
      [ 3784.910959]  io_req_task_submit+0x27/0x70
      [ 3784.910962]  tctx_task_work+0xeb/0x1d0
      [ 3784.910966]  task_work_run+0x61/0xa0
      [ 3784.910968]  io_run_task_work_sig+0x53/0xa0
      [ 3784.910975]  __x64_sys_io_uring_enter+0x22/0x30
      [ 3784.910977]  do_syscall_64+0x3d/0x90
      [ 3784.910981]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      io_drain_req() goes before checks for REQ_F_FAIL, which protect us from
      submitting under-prepared request (e.g. failed in io_init_req(). Fail
      such drained requests as well.
      
      Fixes: a8295b98 ("io_uring: fix failed linkchain code logic")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/e411eb9924d47a131b1e200b26b675df0c2b7627.1630415423.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      b8ce1b9d
    • P
      io_uring: fix queueing half-created requests · c6d3d9cb
      Pavel Begunkov 提交于
      [   27.259845] general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] SMP KASAN PTI
      [   27.261043] KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
      [   27.263730] RIP: 0010:sock_from_file+0x20/0x90
      [   27.272444] Call Trace:
      [   27.272736]  io_sendmsg+0x98/0x600
      [   27.279216]  io_issue_sqe+0x498/0x68d0
      [   27.281142]  __io_queue_sqe+0xab/0xb50
      [   27.285830]  io_req_task_submit+0xbf/0x1b0
      [   27.286306]  tctx_task_work+0x178/0xad0
      [   27.288211]  task_work_run+0xe2/0x190
      [   27.288571]  exit_to_user_mode_prepare+0x1a1/0x1b0
      [   27.289041]  syscall_exit_to_user_mode+0x19/0x50
      [   27.289521]  do_syscall_64+0x48/0x90
      [   27.289871]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      io_req_complete_failed() -> io_req_complete_post() ->
      io_req_task_queue() still would try to enqueue hard linked request,
      which can be half prepared (e.g. failed init), so we can't allow
      that to happen.
      
      Fixes: a8295b98 ("io_uring: fix failed linkchain code logic")
      Reported-by: syzbot+f9704d1878e290eddf73@syzkaller.appspotmail.com
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/70b513848c1000f88bd75965504649c6bb1415c0.1630415423.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      c6d3d9cb
    • M
      io_uring: retry in case of short read on block device · 7db30437
      Ming Lei 提交于
      In case of buffered reading from block device, when short read happens,
      we should retry to read more, otherwise the IO will be completed
      partially, for example, the following fio expects to read 2MB, but it
      can only read 1M or less bytes:
      
          fio --name=onessd --filename=/dev/nvme0n1 --filesize=2M \
      	--rw=randread --bs=2M --direct=0 --overwrite=0 --numjobs=1 \
      	--iodepth=1 --time_based=0 --runtime=2 --ioengine=io_uring \
      	--registerfiles --fixedbufs --gtod_reduce=1 --group_reporting
      
      Fix the issue by allowing short read retry for block device, which sets
      FMODE_BUF_RASYNC really.
      
      Fixes: 9a173346 ("io_uring: fix short read retries for non-reg files")
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/20210821150751.1290434-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      7db30437
    • J
      io_uring: IORING_OP_WRITE needs hash_reg_file set · 7b3188e7
      Jens Axboe 提交于
      During some testing, it became evident that using IORING_OP_WRITE doesn't
      hash buffered writes like the other writes commands do. That's simply
      an oversight, and can cause performance regressions when doing buffered
      writes with this command.
      
      Correct that and add the flag, so that buffered writes are correctly
      hashed when using the non-iovec based write command.
      
      Cc: stable@vger.kernel.org
      Fixes: 3a6820f2 ("io_uring: add non-vectored read/write commands")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7b3188e7
  13. 30 8月, 2021 2 次提交
  14. 29 8月, 2021 2 次提交
    • J
      io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts · 50c1df2b
      Jens Axboe 提交于
      Certain use cases want to use CLOCK_BOOTTIME or CLOCK_REALTIME rather than
      CLOCK_MONOTONIC, instead of the default CLOCK_MONOTONIC.
      
      Add an IORING_TIMEOUT_BOOTTIME and IORING_TIMEOUT_REALTIME flag that
      allows timeouts and linked timeouts to use the selected clock source.
      
      Only one clock source may be selected, and we -EINVAL the request if more
      than one is given. If neither BOOTIME nor REALTIME are selected, the
      previous default of MONOTONIC is used.
      
      Link: https://github.com/axboe/liburing/issues/369Signed-off-by: NJens Axboe <axboe@kernel.dk>
      50c1df2b
    • J
      io-wq: provide a way to limit max number of workers · 2e480058
      Jens Axboe 提交于
      io-wq divides work into two categories:
      
      1) Work that completes in a bounded time, like reading from a regular file
         or a block device. This type of work is limited based on the size of
         the SQ ring.
      
      2) Work that may never complete, we call this unbounded work. The amount
         of workers here is just limited by RLIMIT_NPROC.
      
      For various uses cases, it's handy to have the kernel limit the maximum
      amount of pending workers for both categories. Provide a way to do with
      with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation.
      
      IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets
      the max worker count to what is being passed in for each category. The
      old values are returned into that same array. If 0 is being passed in for
      either category, it simply returns the current value.
      
      The value is capped at RLIMIT_NPROC. This actually isn't that important
      as it's more of a hint, if we're exceeding the value then our attempt
      to fork a new worker will fail. This happens naturally already if more
      than one node is in the system, as these values are per-node internally
      for io-wq.
      Reported-by: NJohannes Lundberg <johalun0@gmail.com>
      Link: https://github.com/axboe/liburing/issues/420Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2e480058
  15. 27 8月, 2021 5 次提交