1. 08 4月, 2022 6 次提交
    • P
      io_uring: nospec index for tags on files update · 34bb7718
      Pavel Begunkov 提交于
      Don't forget to array_index_nospec() for indexes before updating rsrc
      tags in __io_sqe_files_update(), just use already safe and precalculated
      index @i.
      
      Fixes: c3bdad02 ("io_uring: add generic rsrc update with tags")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      34bb7718
    • E
      io_uring: implement compat handling for IORING_REGISTER_IOWQ_AFF · 0f5e4b83
      Eugene Syromiatnikov 提交于
      Similarly to the way it is done im mbind syscall.
      
      Cc: stable@vger.kernel.org # 5.14
      Fixes: fe76421d ("io_uring: allow user configurable IO thread CPU affinity")
      Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f5e4b83
    • J
      Revert "io_uring: Add support for napi_busy_poll" · cb318216
      Jens Axboe 提交于
      This reverts commit adc8682e.
      
      There's some discussion on the API not being as good as it can be.
      Rather than ship something and be stuck with it forever, let's revert
      the NAPI support for now and work on getting something sorted out
      for the next kernel release instead.
      
      Link: https://lore.kernel.org/io-uring/b7bbc124-8502-0ee9-d4c8-7c41b4487264@kernel.dk/Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cb318216
    • J
      io_uring: drop the old style inflight file tracking · d5361233
      Jens Axboe 提交于
      io_uring tracks requests that are referencing an io_uring descriptor to
      be able to cancel without worrying about loops in the references. Since
      we now assign the file at execution time, the easier approach is to drop
      a potentially problematic reference before we punt the request. This
      eliminates the need to special case these types of files beyond just
      marking them as such, and simplifies cancelation quite a bit.
      
      This also fixes a recent issue where an async punted tee operation would
      with the io_uring descriptor as the output file would crash when
      attempting to get a reference to the file from the io-wq worker. We
      could have worked around that, but this is the much cleaner fix.
      
      Fixes: 6bf9c47a ("io_uring: defer file assignment")
      Reported-by: syzbot+c4b9303500a21750b250@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d5361233
    • J
      io_uring: defer file assignment · 6bf9c47a
      Jens Axboe 提交于
      If an application uses direct open or accept, it knows in advance what
      direct descriptor value it will get as it picks it itself. This allows
      combined requests such as:
      
      sqe = io_uring_get_sqe(ring);
      io_uring_prep_openat_direct(sqe, ..., file_slot);
      sqe->flags |= IOSQE_IO_LINK | IOSQE_CQE_SKIP_SUCCESS;
      
      sqe = io_uring_get_sqe(ring);
      io_uring_prep_read(sqe,file_slot, buf, buf_size, 0);
      sqe->flags |= IOSQE_FIXED_FILE;
      
      io_uring_submit(ring);
      
      where we prepare both a file open and read, and only get a completion
      event for the read when both have completed successfully.
      
      Currently links are fully prepared before the head is issued, but that
      fails if the dependent link needs a file assigned that isn't valid until
      the head has completed.
      
      Conversely, if the same chain is performed but the fixed file slot is
      already valid, then we would be unexpectedly returning data from the
      old file slot rather than the newly opened one. Make sure we're
      consistent here.
      
      Allow deferral of file setup, which makes this documented case work.
      
      Cc: stable@vger.kernel.org # v5.15+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6bf9c47a
    • J
      io_uring: propagate issue_flags state down to file assignment · 5106dd6e
      Jens Axboe 提交于
      We'll need this in a future patch, when we could be assigning the file
      after the prep stage. While at it, get rid of the io_file_get() helper,
      it just makes the code harder to read.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5106dd6e
  2. 05 4月, 2022 2 次提交
  3. 04 4月, 2022 1 次提交
  4. 30 3月, 2022 2 次提交
  5. 26 3月, 2022 1 次提交
  6. 25 3月, 2022 6 次提交
  7. 24 3月, 2022 3 次提交
    • J
      io_uring: remove IORING_CQE_F_MSG · 7ef66d18
      Jens Axboe 提交于
      This was introduced with the message ring opcode, but isn't strictly
      required for the request itself. The sender can encode what is needed
      in user_data, which is passed to the receiver. It's unclear if having
      a separate flag that essentially says "This CQE did not originate from
      an SQE on this ring" provides any real utility to applications. While
      we can always re-introduce a flag to provide this information, we cannot
      take it away at a later point in time.
      
      Remove the flag while we still can, before it's in a released kernel.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7ef66d18
    • J
      io_uring: add flag for disabling provided buffer recycling · 8a3e8ee5
      Jens Axboe 提交于
      If we need to continue doing this IO, then we don't want a potentially
      selected buffer recycled. Add a flag for that.
      
      Set this for recv/recvmsg if they do partial IO.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8a3e8ee5
    • J
      io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly · 7ba89d2a
      Jens Axboe 提交于
      We currently don't attempt to get the full asked for length even if
      MSG_WAITALL is set, if we get a partial receive. If we do see a partial
      receive, then just note how many bytes we did and return -EAGAIN to
      get it retried.
      
      The iov is advanced appropriately for the vector based case, and we
      manually bump the buffer and remainder for the non-vector case.
      
      Cc: stable@vger.kernel.org
      Reported-by: NConstantine Gavrilov <constantine.gavrilov@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7ba89d2a
  8. 23 3月, 2022 3 次提交
    • J
      io_uring: don't recycle provided buffer if punted to async worker · 4d55f238
      Jens Axboe 提交于
      We only really need to recycle the buffer when going async for a file
      type that has an indefinite reponse time (eg non-file/bdev). And for
      files that to arm poll, the async worker will arm poll anyway and the
      buffer will get recycled there.
      
      In that latter case, we're not holding ctx->uring_lock. Ensure we take
      the issue_flags into account and acquire it if we need to.
      
      Fixes: b1c62645 ("io_uring: recycle provided buffers if request goes async")
      Reported-by: NStefan Roesch <shr@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4d55f238
    • J
      io_uring: fix assuming triggered poll waitqueue is the single poll · d89a4fac
      Jens Axboe 提交于
      syzbot reports a recent regression:
      
      BUG: KASAN: use-after-free in __wake_up_common+0x637/0x650 kernel/sched/wait.c:101
      Read of size 8 at addr ffff888011e8a130 by task syz-executor413/3618
      
      CPU: 0 PID: 3618 Comm: syz-executor413 Tainted: G        W         5.17.0-syzkaller-01402-g8565d644 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x303 mm/kasan/report.c:255
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
       __wake_up_common+0x637/0x650 kernel/sched/wait.c:101
       __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:138
       tty_release+0x657/0x1200 drivers/tty/tty_io.c:1781
       __fput+0x286/0x9f0 fs/file_table.c:317
       task_work_run+0xdd/0x1a0 kernel/task_work.c:164
       exit_task_work include/linux/task_work.h:32 [inline]
       do_exit+0xaff/0x29d0 kernel/exit.c:806
       do_group_exit+0xd2/0x2f0 kernel/exit.c:936
       __do_sys_exit_group kernel/exit.c:947 [inline]
       __se_sys_exit_group kernel/exit.c:945 [inline]
       __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:945
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f439a1fac69
      
      which is due to leaving the request on the waitqueue mistakenly. The
      reproducer is using a tty device, which means we end up arming the same
      poll queue twice (it uses the same poll waitqueue for both), but in
      io_poll_wake() we always just clear REQ_F_SINGLE_POLL regardless of which
      entry triggered. This leaves one waitqueue potentially armed after we're
      done, which then blows up in tty when the waitqueue is attempted removed.
      
      We have no room to store this information, so simply encode it in the
      wait_queue_entry->private where we store the io_kiocb request pointer.
      
      Fixes: 91eac1c6 ("io_uring: cache poll/double-poll state with a request flag")
      Reported-by: syzbot+09ad4050dd3a120bfccd@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d89a4fac
    • J
      io_uring: bump poll refs to full 31-bits · e2c0cb7c
      Jens Axboe 提交于
      The previous commit:
      
      1bc84c40088 ("io_uring: remove poll entry from list when canceling all")
      
      removed a potential overflow condition for the poll references. They
      are currently limited to 20-bits, even if we have 31-bits available. The
      upper bit is used to mark for cancelation.
      
      Bump the poll ref space to 31-bits, making that kind of situation much
      harder to trigger in general. We'll separately add overflow checking
      and handling.
      
      Fixes: aa43477b ("io_uring: poll rework")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e2c0cb7c
  9. 22 3月, 2022 1 次提交
    • J
      io_uring: remove poll entry from list when canceling all · 61bc84c4
      Jens Axboe 提交于
      When the ring is exiting, as part of the shutdown, poll requests are
      removed. But io_poll_remove_all() does not remove entries when finding
      them, and since completions are done out-of-band, we can find and remove
      the same entry multiple times.
      
      We do guard the poll execution by poll ownership, but that does not
      exclude us from reissuing a new one once the previous removal ownership
      goes away.
      
      This can race with poll execution as well, where we then end up seeing
      req->apoll be NULL because a previous task_work requeue finished the
      request.
      
      Remove the poll entry when we find it and get ownership of it. This
      prevents multiple invocations from finding it.
      
      Fixes: aa43477b ("io_uring: poll rework")
      Reported-by: NDylan Yudaken <dylany@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      61bc84c4
  10. 21 3月, 2022 2 次提交
  11. 20 3月, 2022 1 次提交
    • J
      io_uring: recycle provided before arming poll · abdad709
      Jens Axboe 提交于
      We currently have a race where we recycle the selected buffer if poll
      returns IO_APOLL_OK. But that's too late, as the poll could already be
      triggering or have triggered. If that race happens, then we're putting a
      buffer that's already being used.
      
      Fix this by recycling before we arm poll. This does mean that we'll
      sometimes almost instantly re-select the buffer, but it's rare enough in
      testing that it should not pose a performance issue.
      
      Fixes: b1c62645 ("io_uring: recycle provided buffers if request goes async")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      abdad709
  12. 19 3月, 2022 2 次提交
  13. 18 3月, 2022 1 次提交
    • J
      io_uring: manage provided buffers strictly ordered · dbc7d452
      Jens Axboe 提交于
      Workloads using provided buffers benefit from using and returning buffers
      in the right order, and so does TLBs for that matter. Manage the internal
      buffer list in a straight list, rather than use the head buffer as the
      insertion node. Use a hashed list for the buffer group IDs instead of
      xarray, the overhead is much lower this way. xarray provides internal
      locking and other trickery that is handy for some uses cases, but
      io_uring already locks internally for the buffer manipulation and needs
      none of that.
      
      This is good for about a 2% reduction in overhead, combination of the
      improved management and the fact that the workload has an easier time
      bundling back provided buffers.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dbc7d452
  14. 17 3月, 2022 9 次提交