1. 12 7月, 2020 1 次提交
  2. 10 7月, 2020 2 次提交
    • J
      io_uring: account user memory freed when exit has been queued · 309fc03a
      Jens Axboe 提交于
      We currently account the memory after the exit work has been run, but
      that leaves a gap where a process has closed its ring and until the
      memory has been accounted as freed. If the memlocked ulimit is
      borderline, then that can introduce spurious setup errors returning
      -ENOMEM because the free work hasn't been run yet.
      
      Account this as freed when we close the ring, as not to expose a tiny
      gap where setting up a new ring can fail.
      
      Fixes: 85faa7b8 ("io_uring: punt final io_ring_ctx wait-and-free to workqueue")
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      309fc03a
    • Y
      io_uring: fix memleak in io_sqe_files_register() · 667e57da
      Yang Yingliang 提交于
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0x607eeac06e78 (size 8):
        comm "test", pid 295, jiffies 4294735835 (age 31.745s)
        hex dump (first 8 bytes):
          00 00 00 00 00 00 00 00                          ........
        backtrace:
          [<00000000932632e6>] percpu_ref_init+0x2a/0x1b0
          [<0000000092ddb796>] __io_uring_register+0x111d/0x22a0
          [<00000000eadd6c77>] __x64_sys_io_uring_register+0x17b/0x480
          [<00000000591b89a6>] do_syscall_64+0x56/0xa0
          [<00000000864a281d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Call percpu_ref_exit() on error path to avoid
      refcount memleak.
      
      Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Cc: stable@vger.kernel.org
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      667e57da
  3. 09 7月, 2020 2 次提交
    • Y
      io_uring: fix memleak in __io_sqe_files_update() · f3bd9dae
      Yang Yingliang 提交于
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0xffff888113e02300 (size 488):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      a0 a4 ce 19 81 88 ff ff 60 ce 09 0d 81 88 ff ff ........`.......
      backtrace:
      [<00000000129a84ec>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<00000000129a84ec>] __alloc_file+0x25/0x310 fs/file_table.c:101
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      BUG: memory leak
      unreferenced object 0xffff8881152dd5e0 (size 16):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 16 bytes):
      01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
      backtrace:
      [<0000000074caa794>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<0000000074caa794>] lsm_file_alloc security/security.c:567 [inline]
      [<0000000074caa794>] security_file_alloc+0x32/0x160 security/security.c:1440
      [<00000000c6745ea3>] __alloc_file+0xba/0x310 fs/file_table.c:106
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      If io_sqe_file_register() failed, we need put the file that get by fget()
      to avoid the memleak.
      
      Fixes: c3a31e60 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
      Cc: stable@vger.kernel.org
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f3bd9dae
    • X
      io_uring: export cq overflow status to userspace · 6d5f9049
      Xiaoguang Wang 提交于
      For those applications which are not willing to use io_uring_enter()
      to reap and handle cqes, they may completely rely on liburing's
      io_uring_peek_cqe(), but if cq ring has overflowed, currently because
      io_uring_peek_cqe() is not aware of this overflow, it won't enter
      kernel to flush cqes, below test program can reveal this bug:
      
      static void test_cq_overflow(struct io_uring *ring)
      {
              struct io_uring_cqe *cqe;
              struct io_uring_sqe *sqe;
              int issued = 0;
              int ret = 0;
      
              do {
                      sqe = io_uring_get_sqe(ring);
                      if (!sqe) {
                              fprintf(stderr, "get sqe failed\n");
                              break;;
                      }
                      ret = io_uring_submit(ring);
                      if (ret <= 0) {
                              if (ret != -EBUSY)
                                      fprintf(stderr, "sqe submit failed: %d\n", ret);
                              break;
                      }
                      issued++;
              } while (ret > 0);
              assert(ret == -EBUSY);
      
              printf("issued requests: %d\n", issued);
      
              while (issued) {
                      ret = io_uring_peek_cqe(ring, &cqe);
                      if (ret) {
                              if (ret != -EAGAIN) {
                                      fprintf(stderr, "peek completion failed: %s\n",
                                              strerror(ret));
                                      break;
                              }
                              printf("left requets: %d\n", issued);
                              continue;
                      }
                      io_uring_cqe_seen(ring, cqe);
                      issued--;
                      printf("left requets: %d\n", issued);
              }
      }
      
      int main(int argc, char *argv[])
      {
              int ret;
              struct io_uring ring;
      
              ret = io_uring_queue_init(16, &ring, 0);
              if (ret) {
                      fprintf(stderr, "ring setup failed: %d\n", ret);
                      return 1;
              }
      
              test_cq_overflow(&ring);
              return 0;
      }
      
      To fix this issue, export cq overflow status to userspace by adding new
      IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
      io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6d5f9049
  4. 05 7月, 2020 1 次提交
    • J
      io_uring: fix regression with always ignoring signals in io_cqring_wait() · b7db41c9
      Jens Axboe 提交于
      When switching to TWA_SIGNAL for task_work notifications, we also made
      any signal based condition in io_cqring_wait() return -ERESTARTSYS.
      This breaks applications that rely on using signals to abort someone
      waiting for events.
      
      Check if we have a signal pending because of queued task_work, and
      repeat the signal check once we've run the task_work. This provides a
      reliable way of telling the two apart.
      
      Additionally, only use TWA_SIGNAL if we are using an eventfd. If not,
      we don't have the dependency situation described in the original commit,
      and we can get by with just using TWA_RESUME like we previously did.
      
      Fixes: ce593a6c ("io_uring: use signal based task_work running")
      Cc: stable@vger.kernel.org # v5.7
      Reported-by: NAndres Freund <andres@anarazel.de>
      Tested-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b7db41c9
  5. 01 7月, 2020 1 次提交
    • J
      io_uring: use signal based task_work running · ce593a6c
      Jens Axboe 提交于
      Since 5.7, we've been using task_work to trigger async running of
      requests in the context of the original task. This generally works
      great, but there's a case where if the task is currently blocked
      in the kernel waiting on a condition to become true, it won't process
      task_work. Even though the task is woken, it just checks whatever
      condition it's waiting on, and goes back to sleep if it's still false.
      
      This is a problem if that very condition only becomes true when that
      task_work is run. An example of that is the task registering an eventfd
      with io_uring, and it's now blocked waiting on an eventfd read. That
      read could depend on a completion event, and that completion event
      won't get trigged until task_work has been run.
      
      Use the TWA_SIGNAL notification for task_work, so that we ensure that
      the task always runs the work when queued.
      
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ce593a6c
  6. 25 6月, 2020 2 次提交
    • P
      io_uring: fix current->mm NULL dereference on exit · d60b5fbc
      Pavel Begunkov 提交于
      Don't reissue requests from io_iopoll_reap_events(), the task may not
      have mm, which ends up with NULL. It's better to kill everything off on
      exit anyway.
      
      [  677.734670] RIP: 0010:io_iopoll_complete+0x27e/0x630
      ...
      [  677.734679] Call Trace:
      [  677.734695]  ? __send_signal+0x1f2/0x420
      [  677.734698]  ? _raw_spin_unlock_irqrestore+0x24/0x40
      [  677.734699]  ? send_signal+0xf5/0x140
      [  677.734700]  io_iopoll_getevents+0x12f/0x1a0
      [  677.734702]  io_iopoll_reap_events.part.0+0x5e/0xa0
      [  677.734703]  io_ring_ctx_wait_and_kill+0x132/0x1c0
      [  677.734704]  io_uring_release+0x20/0x30
      [  677.734706]  __fput+0xcd/0x230
      [  677.734707]  ____fput+0xe/0x10
      [  677.734709]  task_work_run+0x67/0xa0
      [  677.734710]  do_exit+0x35d/0xb70
      [  677.734712]  do_group_exit+0x43/0xa0
      [  677.734713]  get_signal+0x140/0x900
      [  677.734715]  do_signal+0x37/0x780
      [  677.734717]  ? enqueue_hrtimer+0x41/0xb0
      [  677.734718]  ? recalibrate_cpu_khz+0x10/0x10
      [  677.734720]  ? ktime_get+0x3e/0xa0
      [  677.734721]  ? lapic_next_deadline+0x26/0x30
      [  677.734723]  ? tick_program_event+0x4d/0x90
      [  677.734724]  ? __hrtimer_get_next_event+0x4d/0x80
      [  677.734726]  __prepare_exit_to_usermode+0x126/0x1c0
      [  677.734741]  prepare_exit_to_usermode+0x9/0x40
      [  677.734742]  idtentry_exit_cond_rcu+0x4c/0x60
      [  677.734743]  sysvec_reschedule_ipi+0x92/0x160
      [  677.734744]  ? asm_sysvec_reschedule_ipi+0xa/0x20
      [  677.734745]  asm_sysvec_reschedule_ipi+0x12/0x20
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d60b5fbc
    • P
      io_uring: fix hanging iopoll in case of -EAGAIN · cd664b0e
      Pavel Begunkov 提交于
      io_do_iopoll() won't do anything with a request unless
      req->iopoll_completed is set. So io_complete_rw_iopoll() has to set
      it, otherwise io_do_iopoll() will poll a file again and again even
      though the request of interest was completed long time ago.
      
      Also, remove -EAGAIN check from io_issue_sqe() as it races with
      the changed lines. The request will take the long way and be
      resubmitted from io_iopoll*().
      
      io_kiocb's result and iopoll_completed")
      
      Fixes: bbde017a ("io_uring: add memory barrier to synchronize
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cd664b0e
  7. 24 6月, 2020 1 次提交
    • X
      io_uring: fix io_sq_thread no schedule when busy · b772f07a
      Xuan Zhuo 提交于
      When the user consumes and generates sqe at a fast rate,
      io_sqring_entries can always get sqe, and ret will not be equal to -EBUSY,
      so that io_sq_thread will never call cond_resched or schedule, and then
      we will get the following system error prompt:
      
      rcu: INFO: rcu_sched self-detected stall on CPU
      or
      watchdog: BUG: soft lockup-CPU#23 stuck for 112s! [io_uring-sq:1863]
      
      This patch checks whether need to call cond_resched() by checking
      the need_resched() function every cycle.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b772f07a
  8. 18 6月, 2020 5 次提交
    • X
      io_uring: fix possible race condition against REQ_F_NEED_CLEANUP · 6f2cc166
      Xiaoguang Wang 提交于
      In io_read() or io_write(), when io request is submitted successfully,
      it'll go through the below sequence:
      
          kfree(iovec);
          req->flags &= ~REQ_F_NEED_CLEANUP;
          return ret;
      
      But clearing REQ_F_NEED_CLEANUP might be unsafe. The io request may
      already have been completed, and then io_complete_rw_iopoll()
      and io_complete_rw() will be called, both of which will also modify
      req->flags if needed. This causes a race condition, with concurrent
      non-atomic modification of req->flags.
      
      To eliminate this race, in io_read() or io_write(), if io request is
      submitted successfully, we don't remove REQ_F_NEED_CLEANUP flag. If
      REQ_F_NEED_CLEANUP is set, we'll leave __io_req_aux_free() to the
      iovec cleanup work correspondingly.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6f2cc166
    • J
      io_uring: reap poll completions while waiting for refs to drop on exit · 56952e91
      Jens Axboe 提交于
      If we're doing polled IO and end up having requests being submitted
      async, then completions can come in while we're waiting for refs to
      drop. We need to reap these manually, as nobody else will be looking
      for them.
      
      Break the wait into 1/20th of a second time waits, and check for done
      poll completions if we time out. Otherwise we can have done poll
      completions sitting in ctx->poll_list, which needs us to reap them but
      we're just waiting for them.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      56952e91
    • J
      io_uring: acquire 'mm' for task_work for SQPOLL · 9d8426a0
      Jens Axboe 提交于
      If we're unlucky with timing, we could be running task_work after
      having dropped the memory context in the sq thread. Since dropping
      the context requires a runnable task state, we cannot reliably drop
      it as part of our check-for-work loop in io_sq_thread(). Instead,
      abstract out the mm acquire for the sq thread into a helper, and call
      it from the async task work handler.
      
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9d8426a0
    • X
      io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed · bbde017a
      Xiaoguang Wang 提交于
      In io_complete_rw_iopoll(), stores to io_kiocb's result and iopoll
      completed are two independent store operations, to ensure that once
      iopoll_completed is ture and then req->result must been perceived by
      the cpu executing io_do_iopoll(), proper memory barrier should be used.
      
      And in io_do_iopoll(), we check whether req->result is EAGAIN, if it is,
      we'll need to issue this io request using io-wq again. In order to just
      issue a single smp_rmb() on the completion side, move the re-submit work
      to io_iopoll_complete().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      [axboe: don't set ->iopoll_completed for -EAGAIN retry]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bbde017a
    • X
      io_uring: don't fail links for EAGAIN error in IOPOLL mode · 2d7d6792
      Xiaoguang Wang 提交于
      In IOPOLL mode, for EAGAIN error, we'll try to submit io request
      again using io-wq, so don't fail rest of links if this io request
      has links.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d7d6792
  9. 15 6月, 2020 6 次提交
  10. 11 6月, 2020 7 次提交
  11. 10 6月, 2020 2 次提交
  12. 09 6月, 2020 4 次提交
  13. 08 6月, 2020 2 次提交
  14. 05 6月, 2020 4 次提交