1. 25 7月, 2020 4 次提交
    • D
      io_uring: fix sq array offset calculation · b36200f5
      Dmitry Vyukov 提交于
      rings_size() sets sq_offset to the total size of the rings (the returned
      value which is used for memory allocation). This is wrong: sq array should
      be located within the rings, not after them. Set sq_offset to where it
      should be.
      
      Fixes: 75b28aff ("io_uring: allocate the two rings together")
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NHristo Venev <hristo@venev.name>
      Cc: io-uring@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b36200f5
    • J
      Merge branch 'io_uring-5.8' into for-5.9/io_uring · 760618f7
      Jens Axboe 提交于
      Merge in io_uring-5.8 fixes, as changes/cleanups to how we do locked
      mem accounting require a fixup, and only one of the spots are noticed
      by git as the other merges cleanly. The flags fix from io_uring-5.8
      also causes a merge conflict, the leak fix for recvmsg, the double poll
      fix, and the link failure locking fix.
      
      * io_uring-5.8:
        io_uring: fix lockup in io_fail_links()
        io_uring: fix ->work corruption with poll_add
        io_uring: missed req_init_async() for IOSQE_ASYNC
        io_uring: always allow drain/link/hardlink/async sqe flags
        io_uring: ensure double poll additions work with both request types
        io_uring: fix recvmsg memory leak with buffer selection
        io_uring: fix not initialised work->flags
        io_uring: fix missing msg_name assignment
        io_uring: account user memory freed when exit has been queued
        io_uring: fix memleak in io_sqe_files_register()
        io_uring: fix memleak in __io_sqe_files_update()
        io_uring: export cq overflow status to userspace
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      760618f7
    • P
      io_uring: fix lockup in io_fail_links() · 4ae6dbd6
      Pavel Begunkov 提交于
      io_fail_links() doesn't consider REQ_F_COMP_LOCKED leading to nested
      spin_lock(completion_lock) and lockup.
      
      [  197.680409] rcu: INFO: rcu_preempt detected expedited stalls on
      	CPUs/tasks: { 6-... } 18239 jiffies s: 1421 root: 0x40/.
      [  197.680411] rcu: blocking rcu_node structures:
      [  197.680412] Task dump for CPU 6:
      [  197.680413] link-timeout    R  running task        0  1669
      	1 0x8000008a
      [  197.680414] Call Trace:
      [  197.680420]  ? io_req_find_next+0xa0/0x200
      [  197.680422]  ? io_put_req_find_next+0x2a/0x50
      [  197.680423]  ? io_poll_task_func+0xcf/0x140
      [  197.680425]  ? task_work_run+0x67/0xa0
      [  197.680426]  ? do_exit+0x35d/0xb70
      [  197.680429]  ? syscall_trace_enter+0x187/0x2c0
      [  197.680430]  ? do_group_exit+0x43/0xa0
      [  197.680448]  ? __x64_sys_exit_group+0x18/0x20
      [  197.680450]  ? do_syscall_64+0x52/0xa0
      [  197.680452]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4ae6dbd6
    • P
      io_uring: fix ->work corruption with poll_add · d5e16d8e
      Pavel Begunkov 提交于
      req->work might be already initialised by the time it gets into
      __io_arm_poll_handler(), which will corrupt it by using fields that are
      in an union with req->work. Luckily, the only side effect is missing
      put_creds(). Clean req->work before going there.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d5e16d8e
  2. 24 7月, 2020 1 次提交
  3. 19 7月, 2020 1 次提交
  4. 18 7月, 2020 1 次提交
    • J
      io_uring: ensure double poll additions work with both request types · 807abcb0
      Jens Axboe 提交于
      The double poll additions were centered around doing POLL_ADD on file
      descriptors that use more than one waitqueue (typically one for read,
      one for write) when being polled. However, it can also end up being
      triggered for when we use poll triggered retry. For that case, we cannot
      safely use req->io, as that could be used by the request type itself.
      
      Add a second io_poll_iocb pointer in the structure we allocate for poll
      based retry, and ensure we use the right one from the two paths.
      
      Fixes: 18bceab1 ("io_uring: allow POLL_ADD with double poll_wait() users")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      807abcb0
  5. 16 7月, 2020 1 次提交
  6. 12 7月, 2020 2 次提交
  7. 10 7月, 2020 3 次提交
    • J
      io_uring: account user memory freed when exit has been queued · 309fc03a
      Jens Axboe 提交于
      We currently account the memory after the exit work has been run, but
      that leaves a gap where a process has closed its ring and until the
      memory has been accounted as freed. If the memlocked ulimit is
      borderline, then that can introduce spurious setup errors returning
      -ENOMEM because the free work hasn't been run yet.
      
      Account this as freed when we close the ring, as not to expose a tiny
      gap where setting up a new ring can fail.
      
      Fixes: 85faa7b8 ("io_uring: punt final io_ring_ctx wait-and-free to workqueue")
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      309fc03a
    • Y
      io_uring: fix memleak in io_sqe_files_register() · 667e57da
      Yang Yingliang 提交于
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0x607eeac06e78 (size 8):
        comm "test", pid 295, jiffies 4294735835 (age 31.745s)
        hex dump (first 8 bytes):
          00 00 00 00 00 00 00 00                          ........
        backtrace:
          [<00000000932632e6>] percpu_ref_init+0x2a/0x1b0
          [<0000000092ddb796>] __io_uring_register+0x111d/0x22a0
          [<00000000eadd6c77>] __x64_sys_io_uring_register+0x17b/0x480
          [<00000000591b89a6>] do_syscall_64+0x56/0xa0
          [<00000000864a281d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Call percpu_ref_exit() on error path to avoid
      refcount memleak.
      
      Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Cc: stable@vger.kernel.org
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      667e57da
    • J
      io_uring: remove dead 'ctx' argument and move forward declaration · 4349f30e
      Jens Axboe 提交于
      We don't use 'ctx' at all in io_sq_thread_drop_mm(), it just works
      on the mm of the current task. Drop the argument.
      
      Move io_file_put_work() to where we have the other forward declarations
      of functions.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4349f30e
  8. 09 7月, 2020 5 次提交
    • J
      io_uring: get rid of __req_need_defer() · 2bc9930e
      Jens Axboe 提交于
      We just have one caller of this, req_need_defer(), just inline the
      code in there instead.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2bc9930e
    • Y
      io_uring: fix memleak in __io_sqe_files_update() · f3bd9dae
      Yang Yingliang 提交于
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0xffff888113e02300 (size 488):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      a0 a4 ce 19 81 88 ff ff 60 ce 09 0d 81 88 ff ff ........`.......
      backtrace:
      [<00000000129a84ec>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<00000000129a84ec>] __alloc_file+0x25/0x310 fs/file_table.c:101
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      BUG: memory leak
      unreferenced object 0xffff8881152dd5e0 (size 16):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 16 bytes):
      01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
      backtrace:
      [<0000000074caa794>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<0000000074caa794>] lsm_file_alloc security/security.c:567 [inline]
      [<0000000074caa794>] security_file_alloc+0x32/0x160 security/security.c:1440
      [<00000000c6745ea3>] __alloc_file+0xba/0x310 fs/file_table.c:106
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      If io_sqe_file_register() failed, we need put the file that get by fget()
      to avoid the memleak.
      
      Fixes: c3a31e60 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
      Cc: stable@vger.kernel.org
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f3bd9dae
    • X
      io_uring: export cq overflow status to userspace · 6d5f9049
      Xiaoguang Wang 提交于
      For those applications which are not willing to use io_uring_enter()
      to reap and handle cqes, they may completely rely on liburing's
      io_uring_peek_cqe(), but if cq ring has overflowed, currently because
      io_uring_peek_cqe() is not aware of this overflow, it won't enter
      kernel to flush cqes, below test program can reveal this bug:
      
      static void test_cq_overflow(struct io_uring *ring)
      {
              struct io_uring_cqe *cqe;
              struct io_uring_sqe *sqe;
              int issued = 0;
              int ret = 0;
      
              do {
                      sqe = io_uring_get_sqe(ring);
                      if (!sqe) {
                              fprintf(stderr, "get sqe failed\n");
                              break;;
                      }
                      ret = io_uring_submit(ring);
                      if (ret <= 0) {
                              if (ret != -EBUSY)
                                      fprintf(stderr, "sqe submit failed: %d\n", ret);
                              break;
                      }
                      issued++;
              } while (ret > 0);
              assert(ret == -EBUSY);
      
              printf("issued requests: %d\n", issued);
      
              while (issued) {
                      ret = io_uring_peek_cqe(ring, &cqe);
                      if (ret) {
                              if (ret != -EAGAIN) {
                                      fprintf(stderr, "peek completion failed: %s\n",
                                              strerror(ret));
                                      break;
                              }
                              printf("left requets: %d\n", issued);
                              continue;
                      }
                      io_uring_cqe_seen(ring, cqe);
                      issued--;
                      printf("left requets: %d\n", issued);
              }
      }
      
      int main(int argc, char *argv[])
      {
              int ret;
              struct io_uring ring;
      
              ret = io_uring_queue_init(16, &ring, 0);
              if (ret) {
                      fprintf(stderr, "ring setup failed: %d\n", ret);
                      return 1;
              }
      
              test_cq_overflow(&ring);
              return 0;
      }
      
      To fix this issue, export cq overflow status to userspace by adding new
      IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
      io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6d5f9049
    • J
      io_uring: only call kfree() for a non-zero pointer · 5acbbc8e
      Jens Axboe 提交于
      It's safe to call kfree() with a NULL pointer, but it's also pointless.
      Most of the time we don't have any data to free, and at millions of
      requests per second, the redundant function call adds noticeable
      overhead (about 1.3% of the runtime).
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5acbbc8e
    • D
      io_uring: fix a use after free in io_async_task_func() · aa340845
      Dan Carpenter 提交于
      The "apoll" variable is freed and then used on the next line.  We need
      to move the free down a few lines.
      
      Fixes: 0be0b0e3 ("io_uring: simplify io_async_task_func()")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aa340845
  9. 08 7月, 2020 3 次提交
  10. 06 7月, 2020 9 次提交
  11. 05 7月, 2020 1 次提交
    • J
      io_uring: fix regression with always ignoring signals in io_cqring_wait() · b7db41c9
      Jens Axboe 提交于
      When switching to TWA_SIGNAL for task_work notifications, we also made
      any signal based condition in io_cqring_wait() return -ERESTARTSYS.
      This breaks applications that rely on using signals to abort someone
      waiting for events.
      
      Check if we have a signal pending because of queued task_work, and
      repeat the signal check once we've run the task_work. This provides a
      reliable way of telling the two apart.
      
      Additionally, only use TWA_SIGNAL if we are using an eventfd. If not,
      we don't have the dependency situation described in the original commit,
      and we can get by with just using TWA_RESUME like we previously did.
      
      Fixes: ce593a6c ("io_uring: use signal based task_work running")
      Cc: stable@vger.kernel.org # v5.7
      Reported-by: NAndres Freund <andres@anarazel.de>
      Tested-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b7db41c9
  12. 01 7月, 2020 2 次提交
    • J
      io_uring: use signal based task_work running · ce593a6c
      Jens Axboe 提交于
      Since 5.7, we've been using task_work to trigger async running of
      requests in the context of the original task. This generally works
      great, but there's a case where if the task is currently blocked
      in the kernel waiting on a condition to become true, it won't process
      task_work. Even though the task is woken, it just checks whatever
      condition it's waiting on, and goes back to sleep if it's still false.
      
      This is a problem if that very condition only becomes true when that
      task_work is run. An example of that is the task registering an eventfd
      with io_uring, and it's now blocked waiting on an eventfd read. That
      read could depend on a completion event, and that completion event
      won't get trigged until task_work has been run.
      
      Use the TWA_SIGNAL notification for task_work, so that we ensure that
      the task always runs the work when queued.
      
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ce593a6c
    • O
      task_work: teach task_work_add() to do signal_wake_up() · e91b4816
      Oleg Nesterov 提交于
      So that the target task will exit the wait_event_interruptible-like
      loop and call task_work_run() asap.
      
      The patch turns "bool notify" into 0,TWA_RESUME,TWA_SIGNAL enum, the
      new TWA_SIGNAL flag implies signal_wake_up().  However, it needs to
      avoid the race with recalc_sigpending(), so the patch also adds the
      new JOBCTL_TASK_WORK bit included in JOBCTL_PENDING_MASK.
      
      TODO: once this patch is merged we need to change all current users
      of task_work_add(notify = true) to use TWA_RESUME.
      
      Cc: stable@vger.kernel.org # v5.7
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e91b4816
  13. 30 6月, 2020 7 次提交