1. 30 12月, 2020 1 次提交
    • J
      io_uring: don't assume mm is constant across submits · 77788775
      Jens Axboe 提交于
      If we COW the identity, we assume that ->mm never changes. But this
      isn't true of multiple processes end up sharing the ring. Hence treat
      id->mm like like any other process compontent when it comes to the
      identity mapping. This is pretty trivial, just moving the existing grab
      into io_grab_identity(), and including a check for the match.
      
      Cc: stable@vger.kernel.org # 5.10
      Fixes: 1e6fa521 ("io_uring: COW io_identity on mismatch")
      Reported-by: Christian Brauner <christian.brauner@ubuntu.com>:
      Tested-by: Christian Brauner <christian.brauner@ubuntu.com>:
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      77788775
  2. 23 12月, 2020 2 次提交
  3. 22 12月, 2020 1 次提交
  4. 21 12月, 2020 3 次提交
  5. 19 12月, 2020 1 次提交
  6. 18 12月, 2020 1 次提交
    • P
      io_uring: close a small race gap for files cancel · dfea9fce
      Pavel Begunkov 提交于
      The purpose of io_uring_cancel_files() is to wait for all requests
      matching ->files to go/be cancelled. We should first drop files of a
      request in io_req_drop_files() and only then make it undiscoverable for
      io_uring_cancel_files.
      
      First drop, then delete from list. It's ok to leave req->id->files
      dangling, because it's not dereferenced by cancellation code, only
      compared against. It would potentially go to sleep and be awaken by
      following in io_req_drop_files() wake_up().
      
      Fixes: 0f212204 ("io_uring: don't rely on weak ->files references")
      Cc: <stable@vger.kernel.org> # 5.5+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dfea9fce
  7. 17 12月, 2020 7 次提交
  8. 13 12月, 2020 2 次提交
  9. 11 12月, 2020 1 次提交
  10. 10 12月, 2020 21 次提交
    • P
      io_uring: fix io_cqring_events()'s noflush · 59850d22
      Pavel Begunkov 提交于
      Checking !list_empty(&ctx->cq_overflow_list) around noflush in
      io_cqring_events() is racy, because if it fails but a request overflowed
      just after that, io_cqring_overflow_flush() still will be called.
      
      Remove the second check, it shouldn't be a problem for performance,
      because there is cq_check_overflow bit check just above.
      
      Cc: <stable@vger.kernel.org> # 5.5+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      59850d22
    • P
      io_uring: fix racy IOPOLL flush overflow · 634578f8
      Pavel Begunkov 提交于
      It's not safe to call io_cqring_overflow_flush() for IOPOLL mode without
      hodling uring_lock, because it does synchronisation differently. Make
      sure we have it.
      
      As for io_ring_exit_work(), we don't even need it there because
      io_ring_ctx_wait_and_kill() already set force flag making all overflowed
      requests to be dropped.
      
      Cc: <stable@vger.kernel.org> # 5.5+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      634578f8
    • P
      io_uring: fix racy IOPOLL completions · 31bff9a5
      Pavel Begunkov 提交于
      IOPOLL allows buffer remove/provide requests, but they doesn't
      synchronise by rules of IOPOLL, namely it have to hold uring_lock.
      
      Cc: <stable@vger.kernel.org> # 5.7+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      31bff9a5
    • X
      io_uring: always let io_iopoll_complete() complete polled io · dad1b124
      Xiaoguang Wang 提交于
      Abaci Fuzz reported a double-free or invalid-free BUG in io_commit_cqring():
      [   95.504842] BUG: KASAN: double-free or invalid-free in io_commit_cqring+0x3ec/0x8e0
      [   95.505921]
      [   95.506225] CPU: 0 PID: 4037 Comm: io_wqe_worker-0 Tainted: G    B
      W         5.10.0-rc5+ #1
      [   95.507434] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [   95.508248] Call Trace:
      [   95.508683]  dump_stack+0x107/0x163
      [   95.509323]  ? io_commit_cqring+0x3ec/0x8e0
      [   95.509982]  print_address_description.constprop.0+0x3e/0x60
      [   95.510814]  ? vprintk_func+0x98/0x140
      [   95.511399]  ? io_commit_cqring+0x3ec/0x8e0
      [   95.512036]  ? io_commit_cqring+0x3ec/0x8e0
      [   95.512733]  kasan_report_invalid_free+0x51/0x80
      [   95.513431]  ? io_commit_cqring+0x3ec/0x8e0
      [   95.514047]  __kasan_slab_free+0x141/0x160
      [   95.514699]  kfree+0xd1/0x390
      [   95.515182]  io_commit_cqring+0x3ec/0x8e0
      [   95.515799]  __io_req_complete.part.0+0x64/0x90
      [   95.516483]  io_wq_submit_work+0x1fa/0x260
      [   95.517117]  io_worker_handle_work+0xeac/0x1c00
      [   95.517828]  io_wqe_worker+0xc94/0x11a0
      [   95.518438]  ? io_worker_handle_work+0x1c00/0x1c00
      [   95.519151]  ? __kthread_parkme+0x11d/0x1d0
      [   95.519806]  ? io_worker_handle_work+0x1c00/0x1c00
      [   95.520512]  ? io_worker_handle_work+0x1c00/0x1c00
      [   95.521211]  kthread+0x396/0x470
      [   95.521727]  ? _raw_spin_unlock_irq+0x24/0x30
      [   95.522380]  ? kthread_mod_delayed_work+0x180/0x180
      [   95.523108]  ret_from_fork+0x22/0x30
      [   95.523684]
      [   95.523985] Allocated by task 4035:
      [   95.524543]  kasan_save_stack+0x1b/0x40
      [   95.525136]  __kasan_kmalloc.constprop.0+0xc2/0xd0
      [   95.525882]  kmem_cache_alloc_trace+0x17b/0x310
      [   95.533930]  io_queue_sqe+0x225/0xcb0
      [   95.534505]  io_submit_sqes+0x1768/0x25f0
      [   95.535164]  __x64_sys_io_uring_enter+0x89e/0xd10
      [   95.535900]  do_syscall_64+0x33/0x40
      [   95.536465]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   95.537199]
      [   95.537505] Freed by task 4035:
      [   95.538003]  kasan_save_stack+0x1b/0x40
      [   95.538599]  kasan_set_track+0x1c/0x30
      [   95.539177]  kasan_set_free_info+0x1b/0x30
      [   95.539798]  __kasan_slab_free+0x112/0x160
      [   95.540427]  kfree+0xd1/0x390
      [   95.540910]  io_commit_cqring+0x3ec/0x8e0
      [   95.541516]  io_iopoll_complete+0x914/0x1390
      [   95.542150]  io_do_iopoll+0x580/0x700
      [   95.542724]  io_iopoll_try_reap_events.part.0+0x108/0x200
      [   95.543512]  io_ring_ctx_wait_and_kill+0x118/0x340
      [   95.544206]  io_uring_release+0x43/0x50
      [   95.544791]  __fput+0x28d/0x940
      [   95.545291]  task_work_run+0xea/0x1b0
      [   95.545873]  do_exit+0xb6a/0x2c60
      [   95.546400]  do_group_exit+0x12a/0x320
      [   95.546967]  __x64_sys_exit_group+0x3f/0x50
      [   95.547605]  do_syscall_64+0x33/0x40
      [   95.548155]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The reason is that once we got a non EAGAIN error in io_wq_submit_work(),
      we'll complete req by calling io_req_complete(), which will hold completion_lock
      to call io_commit_cqring(), but for polled io, io_iopoll_complete() won't
      hold completion_lock to call io_commit_cqring(), then there maybe concurrent
      access to ctx->defer_list, double free may happen.
      
      To fix this bug, we always let io_iopoll_complete() complete polled io.
      
      Cc: <stable@vger.kernel.org> # 5.5+
      Reported-by: NAbaci Fuzz <abaci@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dad1b124
    • P
      io_uring: add timeout update · 9c8e11b3
      Pavel Begunkov 提交于
      Support timeout updates through IORING_OP_TIMEOUT_REMOVE with passed in
      IORING_TIMEOUT_UPDATE. Updates doesn't support offset timeout mode.
      Oirignal timeout.off will be ignored as well.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      [axboe: remove now unused 'ret' variable]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9c8e11b3
    • P
      io_uring: restructure io_timeout_cancel() · fbd15848
      Pavel Begunkov 提交于
      Add io_timeout_extract() helper, which searches and disarms timeouts,
      but doesn't complete them. No functional changes.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fbd15848
    • P
      io_uring: fix files cancellation · bee749b1
      Pavel Begunkov 提交于
      io_uring_cancel_files()'s task check condition mistakenly got flipped.
      
      1. There can't be a request in the inflight list without
      IO_WQ_WORK_FILES, kill this check to keep the whole condition simpler.
      2. Also, don't call the function for files==NULL to not do such a check,
      all that staff is already handled well by its counter part,
      __io_uring_cancel_task_requests().
      
      With that just flip the task check.
      
      Also, it iowq-cancels all request of current task there, don't forget to
      set right ->files into struct io_task_cancel.
      
      Fixes: c1973b38bf639 ("io_uring: cancel only requests of current task")
      Reported-by: syzbot+c0d52d0b3c0c3ffb9525@syzkaller.appspotmail.com
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bee749b1
    • J
      io_uring: use bottom half safe lock for fixed file data · ac0648a5
      Jens Axboe 提交于
      io_file_data_ref_zero() can be invoked from soft-irq from the RCU core,
      hence we need to ensure that the file_data lock is bottom half safe. Use
      the _bh() variants when grabbing this lock.
      
      Reported-by: syzbot+1f4ba1e5520762c523c6@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ac0648a5
    • P
      io_uring: fix miscounting ios_left · bd5bbda7
      Pavel Begunkov 提交于
      io_req_init() doesn't decrement state->ios_left if a request doesn't
      need ->file, it just returns before that on if(!needs_file). That's
      not really a problem but may cause overhead for an additional fput().
      Also inline and kill io_req_set_file() as it's of no use anymore.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bd5bbda7
    • P
      io_uring: change submit file state invariant · 6e1271e6
      Pavel Begunkov 提交于
      Keep submit state invariant of whether there are file refs left based on
      state->nr_refs instead of (state->file==NULL), and always check against
      the first one. It's easier to track and allows to remove 1 if. It also
      automatically leaves struct submit_state in a consistent state after
      io_submit_state_end(), that's not used yet but nice.
      
      btw rename has_refs to file_refs for more clarity.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6e1271e6
    • X
      io_uring: check kthread stopped flag when sq thread is unparked · 65b2b213
      Xiaoguang Wang 提交于
      syzbot reports following issue:
      INFO: task syz-executor.2:12399 can't die for more than 143 seconds.
      task:syz-executor.2  state:D stack:28744 pid:12399 ppid:  8504 flags:0x00004004
      Call Trace:
       context_switch kernel/sched/core.c:3773 [inline]
       __schedule+0x893/0x2170 kernel/sched/core.c:4522
       schedule+0xcf/0x270 kernel/sched/core.c:4600
       schedule_timeout+0x1d8/0x250 kernel/time/timer.c:1847
       do_wait_for_common kernel/sched/completion.c:85 [inline]
       __wait_for_common kernel/sched/completion.c:106 [inline]
       wait_for_common kernel/sched/completion.c:117 [inline]
       wait_for_completion+0x163/0x260 kernel/sched/completion.c:138
       kthread_stop+0x17a/0x720 kernel/kthread.c:596
       io_put_sq_data fs/io_uring.c:7193 [inline]
       io_sq_thread_stop+0x452/0x570 fs/io_uring.c:7290
       io_finish_async fs/io_uring.c:7297 [inline]
       io_sq_offload_create fs/io_uring.c:8015 [inline]
       io_uring_create fs/io_uring.c:9433 [inline]
       io_uring_setup+0x19b7/0x3730 fs/io_uring.c:9507
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x45deb9
      Code: Unable to access opcode bytes at RIP 0x45de8f.
      RSP: 002b:00007f174e51ac78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
      RAX: ffffffffffffffda RBX: 0000000000008640 RCX: 000000000045deb9
      RDX: 0000000000000000 RSI: 0000000020000140 RDI: 00000000000050e5
      RBP: 000000000118bf58 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118bf2c
      R13: 00007ffed9ca723f R14: 00007f174e51b9c0 R15: 000000000118bf2c
      INFO: task syz-executor.2:12399 blocked for more than 143 seconds.
            Not tainted 5.10.0-rc3-next-20201110-syzkaller #0
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      
      Currently we don't have a reproducer yet, but seems that there is a
      race in current codes:
      => io_put_sq_data
            ctx_list is empty now.       |
      ==> kthread_park(sqd->thread);     |
                                         | T1: sq thread is parked now.
      ==> kthread_stop(sqd->thread);     |
          KTHREAD_SHOULD_STOP is set now.|
      ===> kthread_unpark(k);            |
                                         | T2: sq thread is now unparkd, run again.
                                         |
                                         | T3: sq thread is now preempted out.
                                         |
      ===> wake_up_process(k);           |
                                         |
                                         | T4: Since sqd ctx_list is empty, needs_sched will be true,
                                         | then sq thread sets task state to TASK_INTERRUPTIBLE,
                                         | and schedule, now sq thread will never be waken up.
      ===> wait_for_completion           |
      
      I have artificially used mdelay() to simulate above race, will get same
      stack like this syzbot report, but to be honest, I'm not sure this code
      race triggers syzbot report.
      
      To fix this possible code race, when sq thread is unparked, need to check
      whether sq thread has been stopped.
      
      Reported-by: syzbot+03beeb595f074db9cfd1@syzkaller.appspotmail.com
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      65b2b213
    • P
      io_uring: share fixed_file_refs b/w multiple rsrcs · 36f72fe2
      Pavel Begunkov 提交于
      Double fixed files for splice/tee are done in a nasty way, it takes 2
      ref_node refs, and during the second time it blindly overrides
      req->fixed_file_refs hoping that it haven't changed. That works because
      all that is done under iouring_lock in a single go but is error-prone.
      
      Bind everything explicitly to a single ref_node and take only one ref,
      with current ref_node ordering it's guaranteed to keep all files valid
      awhile the request is inflight.
      
      That's mainly a cleanup + preparation for generic resource handling,
      but also saves pcpu_ref get/put for splice/tee with 2 fixed files.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      36f72fe2
    • P
      io_uring: replace inflight_wait with tctx->wait · c98de08c
      Pavel Begunkov 提交于
      As tasks now cancel only theirs requests, and inflight_wait is awaited
      only in io_uring_cancel_files(), which should be called with ->in_idle
      set, instead of keeping a separate inflight_wait use tctx->wait.
      
      That will add some spurious wakeups but actually is safer from point of
      not hanging the task.
      
      e.g.
      task1                   | IRQ
                              | *start* io_complete_rw_common(link)
                              |        link: req1 -> req2 -> req3(with files)
      *cancel_files()         |
      io_wq_cancel(), etc.    |
                              | put_req(link), adds to io-wq req2
      schedule()              |
      
      So, task1 will never try to cancel req2 or req3. If req2 is
      long-standing (e.g. read(empty_pipe)), this may hang.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c98de08c
    • P
      io_uring: don't take fs for recvmsg/sendmsg · 10cad2c4
      Pavel Begunkov 提交于
      We don't even allow not plain data msg_control, which is disallowed in
      __sys_{send,revb}msg_sock(). So no need in fs for IORING_OP_SENDMSG and
      IORING_OP_RECVMSG. fs->lock is less contanged not as much as before, but
      there are cases that can be, e.g. IOSQE_ASYNC.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      10cad2c4
    • X
      io_uring: only wake up sq thread while current task is in io worker context · 2e9dbe90
      Xiaoguang Wang 提交于
      If IORING_SETUP_SQPOLL is enabled, sqes are either handled in sq thread
      task context or in io worker task context. If current task context is sq
      thread, we don't need to check whether should wake up sq thread.
      
      io_iopoll_req_issued() calls wq_has_sleeper(), which has smp_mb() memory
      barrier, before this patch, perf shows obvious overhead:
        Samples: 481K of event 'cycles', Event count (approx.): 299807382878
        Overhead  Comma  Shared Object     Symbol
           3.69%  :9630  [kernel.vmlinux]  [k] io_issue_sqe
      
      With this patch, perf shows:
        Samples: 482K of event 'cycles', Event count (approx.): 299929547283
        Overhead  Comma  Shared Object     Symbol
           0.70%  :4015  [kernel.vmlinux]  [k] io_issue_sqe
      
      It shows some obvious improvements.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2e9dbe90
    • X
      io_uring: don't acquire uring_lock twice · 906a3c6f
      Xiaoguang Wang 提交于
      Both IOPOLL and sqes handling need to acquire uring_lock, combine
      them together, then we just need to acquire uring_lock once.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      906a3c6f
    • X
      io_uring: initialize 'timeout' properly in io_sq_thread() · a0d9205f
      Xiaoguang Wang 提交于
      Some static checker reports below warning:
          fs/io_uring.c:6939 io_sq_thread()
          error: uninitialized symbol 'timeout'.
      
      This is a false positive, but let's just initialize 'timeout' to make
      sure we don't trip over this.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a0d9205f
    • X
      io_uring: refactor io_sq_thread() handling · 08369246
      Xiaoguang Wang 提交于
      There are some issues about current io_sq_thread() implementation:
        1. The prepare_to_wait() usage in __io_sq_thread() is weird. If
      multiple ctxs share one same poll thread, one ctx will put poll thread
      in TASK_INTERRUPTIBLE, but if other ctxs have work to do, we don't
      need to change task's stat at all. I think only if all ctxs don't have
      work to do, we can do it.
        2. We use round-robin strategy to make multiple ctxs share one same
      poll thread, but there are various condition in __io_sq_thread(), which
      seems complicated and may affect round-robin strategy.
      
      To improve above issues, I take below actions:
        1. If multiple ctxs share one same poll thread, only if all all ctxs
      don't have work to do, we can call prepare_to_wait() and schedule() to
      make poll thread enter sleep state.
        2. To make round-robin strategy more straight, I simplify
      __io_sq_thread() a bit, it just does io poll and sqes submit work once,
      does not check various condition.
        3. For multiple ctxs share one same poll thread, we choose the biggest
      sq_thread_idle among these ctxs as timeout condition, and will update
      it when ctx is in or out.
        4. Not need to check EBUSY especially, if io_submit_sqes() returns
      EBUSY, IORING_SQ_CQ_OVERFLOW should be set, helper in liburing should
      be aware of cq overflow and enters kernel to flush work.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08369246
    • P
      io_uring: always batch cancel in *cancel_files() · f6edbabb
      Pavel Begunkov 提交于
      Instead of iterating over each request and cancelling it individually in
      io_uring_cancel_files(), try to cancel all matching requests and use
      ->inflight_list only to check if there anything left.
      
      In many cases it should be faster, and we can reuse a lot of code from
      task cancellation.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f6edbabb
    • P
      io_uring: pass files into kill timeouts/poll · 6b81928d
      Pavel Begunkov 提交于
      Make io_poll_remove_all() and io_kill_timeouts() to match against files
      as well. A preparation patch, effectively not used by now.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b81928d
    • P
      io_uring: don't iterate io_uring_cancel_files() · b52fda00
      Pavel Begunkov 提交于
      io_uring_cancel_files() guarantees to cancel all matching requests,
      that's not necessary to do that in a loop. Move it up in the callchain
      into io_uring_cancel_task_requests().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b52fda00