1. 05 6月, 2020 2 次提交
    • P
      io_uring: do build_open_how() only once · 25e72d10
      Pavel Begunkov 提交于
      build_open_how() is just adjusting open_flags/mode. Do it once during
      prep. It looks better than storing raw values for the future.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      25e72d10
    • P
      io_uring: fix {SQ,IO}POLL with unsupported opcodes · 3232dd02
      Pavel Begunkov 提交于
      IORING_SETUP_IOPOLL is defined only for read/write, other opcodes should
      be disallowed, otherwise it'll get an error as below. Also refuse
      open/close with SQPOLL, as the polling thread wouldn't know which file
      table to use.
      
      RIP: 0010:io_iopoll_getevents+0x111/0x5a0
      Call Trace:
       ? _raw_spin_unlock_irqrestore+0x24/0x40
       ? do_send_sig_info+0x64/0x90
       io_iopoll_reap_events.part.0+0x5e/0xa0
       io_ring_ctx_wait_and_kill+0x132/0x1c0
       io_uring_release+0x20/0x30
       __fput+0xcd/0x230
       ____fput+0xe/0x10
       task_work_run+0x67/0xa0
       do_exit+0x353/0xb10
       ? handle_mm_fault+0xd4/0x200
       ? syscall_trace_enter+0x18c/0x2c0
       do_group_exit+0x43/0xa0
       __x64_sys_exit_group+0x18/0x20
       do_syscall_64+0x60/0x1e0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      [axboe: allow provide/remove buffers and files update]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3232dd02
  2. 03 6月, 2020 1 次提交
    • J
      io_uring: disallow close of ring itself · fd2206e4
      Jens Axboe 提交于
      A previous commit enabled this functionality, which also enabled O_PATH
      to work correctly with io_uring. But we can't safely close the ring
      itself, as the file handle isn't reference counted inside
      io_uring_enter(). Instead of jumping through hoops to enable ring
      closure, add a "soft" ->needs_file option, ->needs_file_no_error. This
      enables O_PATH file descriptors to work, but still catches the case of
      trying to close the ring itself.
      Reported-by: NJann Horn <jannh@google.com>
      Fixes: 904fbcb1 ("io_uring: remove 'fd is io_uring' from close path")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fd2206e4
  3. 30 5月, 2020 3 次提交
  4. 27 5月, 2020 7 次提交
  5. 20 5月, 2020 4 次提交
    • X
      io_uring: don't submit sqes when ctx->refs is dying · 6b668c9b
      Xiaoguang Wang 提交于
      When IORING_SETUP_SQPOLL is enabled, io_ring_ctx_wait_and_kill() will wait
      for sq thread to idle by busy loop:
      
          while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait))
              cond_resched();
      
      Above loop isn't very CPU friendly, it may introduce a short cpu burst on
      the current cpu.
      
      If ctx->refs is dying, we forbid sq_thread from submitting any further
      SQEs. Instead they just get discarded when we exit.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b668c9b
    • X
      io_uring: reset -EBUSY error when io sq thread is waken up · d4ae271d
      Xiaoguang Wang 提交于
      In io_sq_thread(), currently if we get an -EBUSY error and go to sleep,
      we will won't clear it again, which will result in io_sq_thread() will
      never have a chance to submit sqes again. Below test program test.c
      can reveal this bug:
      
      int main(int argc, char *argv[])
      {
              struct io_uring ring;
              int i, fd, ret;
              struct io_uring_sqe *sqe;
              struct io_uring_cqe *cqe;
              struct iovec *iovecs;
              void *buf;
              struct io_uring_params p;
      
              if (argc < 2) {
                      printf("%s: file\n", argv[0]);
                      return 1;
              }
      
              memset(&p, 0, sizeof(p));
              p.flags = IORING_SETUP_SQPOLL;
              ret = io_uring_queue_init_params(4, &ring, &p);
              if (ret < 0) {
                      fprintf(stderr, "queue_init: %s\n", strerror(-ret));
                      return 1;
              }
      
              fd = open(argv[1], O_RDONLY | O_DIRECT);
              if (fd < 0) {
                      perror("open");
                      return 1;
              }
      
              iovecs = calloc(10, sizeof(struct iovec));
              for (i = 0; i < 10; i++) {
                      if (posix_memalign(&buf, 4096, 4096))
                              return 1;
                      iovecs[i].iov_base = buf;
                      iovecs[i].iov_len = 4096;
              }
      
              ret = io_uring_register_files(&ring, &fd, 1);
              if (ret < 0) {
                      fprintf(stderr, "%s: register %d\n", __FUNCTION__, ret);
                      return ret;
              }
      
              for (i = 0; i < 10; i++) {
                      sqe = io_uring_get_sqe(&ring);
                      if (!sqe)
                              break;
      
                      io_uring_prep_readv(sqe, 0, &iovecs[i], 1, 0);
                      sqe->flags |= IOSQE_FIXED_FILE;
      
                      ret = io_uring_submit(&ring);
                      sleep(1);
                      printf("submit %d\n", i);
              }
      
              for (i = 0; i < 10; i++) {
                      io_uring_wait_cqe(&ring, &cqe);
                      printf("receive: %d\n", i);
                      if (cqe->res != 4096) {
                              fprintf(stderr, "ret=%d, wanted 4096\n", cqe->res);
                              ret = 1;
                      }
                      io_uring_cqe_seen(&ring, cqe);
              }
      
              close(fd);
              io_uring_queue_exit(&ring);
              return 0;
      }
      sudo ./test testfile
      above command will hang on the tenth request, to fix this bug, when io
      sq_thread is waken up, we reset the variable 'ret' to be zero.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d4ae271d
    • J
      io_uring: don't add non-IO requests to iopoll pending list · b532576e
      Jens Axboe 提交于
      We normally disable any commands that aren't specifically poll commands
      for a ring that is setup for polling, but we do allow buffer provide and
      remove commands to support buffer selection for polled IO. Once a
      request is issued, we add it to the poll list to poll for completion. But
      we should not do that for non-IO commands, as those request complete
      inline immediately and aren't pollable. If we do, we can leave requests
      on the iopoll list after they are freed.
      
      Fixes: ddf0322d ("io_uring: add IORING_OP_PROVIDE_BUFFERS")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b532576e
    • B
      io_uring: don't use kiocb.private to store buf_index · 4f4eeba8
      Bijan Mottahedeh 提交于
      kiocb.private is used in iomap_dio_rw() so store buf_index separately.
      Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
      
      Move 'buf_index' to a hole in io_kiocb.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4f4eeba8
  6. 19 5月, 2020 1 次提交
    • J
      io_uring: cancel work if task_work_add() fails · e3aabf95
      Jens Axboe 提交于
      We currently move it to the io_wqe_manager for execution, but we cannot
      safely do so as we may lack some of the state to execute it out of
      context. As we cancel work anyway when the ring/task exits, just mark
      this request as canceled and io_async_task_func() will do the right
      thing.
      
      Fixes: aa96bf8a ("io_uring: use io-wq manager as backup task if task is exiting")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e3aabf95
  7. 18 5月, 2020 7 次提交
  8. 17 5月, 2020 3 次提交
    • P
      io_uring: fix FORCE_ASYNC req preparation · bd2ab18a
      Pavel Begunkov 提交于
      As for other not inlined requests, alloc req->io for FORCE_ASYNC reqs,
      so they can be prepared properly.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bd2ab18a
    • P
      io_uring: don't prepare DRAIN reqs twice · 650b5481
      Pavel Begunkov 提交于
      If req->io is not NULL, it's already prepared. Don't do it again,
      it's dangerous.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      650b5481
    • J
      io_uring: initialize ctx->sqo_wait earlier · 583863ed
      Jens Axboe 提交于
      Ensure that ctx->sqo_wait is initialized as soon as the ctx is allocated,
      instead of deferring it to the offload setup. This fixes a syzbot
      reported lockdep complaint, which is really due to trying to wake_up
      on an uninitialized wait queue:
      
      RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441319
      RDX: 0000000000000001 RSI: 0000000020000140 RDI: 000000000000047b
      RBP: 0000000000010475 R08: 0000000000000001 R09: 00000000004002c8
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000402260
      R13: 00000000004022f0 R14: 0000000000000000 R15: 0000000000000000
      INFO: trying to register non-static key.
      the code is fine but needs lockdep annotation.
      turning off the locking correctness validator.
      CPU: 1 PID: 7090 Comm: syz-executor222 Not tainted 5.7.0-rc1-next-20200415-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x188/0x20d lib/dump_stack.c:118
       assign_lock_key kernel/locking/lockdep.c:913 [inline]
       register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225
       __lock_acquire+0x104/0x4c50 kernel/locking/lockdep.c:4234
       lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4934
       __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
       _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159
       __wake_up_common_lock+0xb4/0x130 kernel/sched/wait.c:122
       io_cqring_ev_posted+0xa5/0x1e0 fs/io_uring.c:1160
       io_poll_remove_all fs/io_uring.c:4357 [inline]
       io_ring_ctx_wait_and_kill+0x2bc/0x5a0 fs/io_uring.c:7305
       io_uring_create fs/io_uring.c:7843 [inline]
       io_uring_setup+0x115e/0x22b0 fs/io_uring.c:7870
       do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
       entry_SYSCALL_64_after_hwframe+0x49/0xb3
      RIP: 0033:0x441319
      Code: e8 5c ae 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fffb1fb9aa8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
      
      Reported-by: syzbot+8c91f5d054e998721c57@syzkaller.appspotmail.com
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      583863ed
  9. 16 5月, 2020 5 次提交
    • J
      io_uring: file registration list and lock optimization · 6a4d07cd
      Jens Axboe 提交于
      There's no point in using list_del_init() on entries that are going
      away, and the associated lock is always used in process context so
      let's not use the IRQ disabling+saving variant of the spinlock.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6a4d07cd
    • S
      io_uring: add IORING_CQ_EVENTFD_DISABLED to the CQ ring flags · 7e55a19c
      Stefano Garzarella 提交于
      This new flag should be set/clear from the application to
      disable/enable eventfd notifications when a request is completed
      and queued to the CQ ring.
      
      Before this patch, notifications were always sent if an eventfd is
      registered, so IORING_CQ_EVENTFD_DISABLED is not set during the
      initialization.
      
      It will be up to the application to set the flag after initialization
      if no notifications are required at the beginning.
      Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e55a19c
    • S
      io_uring: add 'cq_flags' field for the CQ ring · 0d9b5b3a
      Stefano Garzarella 提交于
      This patch adds the new 'cq_flags' field that should be written by
      the application and read by the kernel.
      
      This new field is available to the userspace application through
      'cq_off.flags'.
      We are using 4-bytes previously reserved and set to zero. This means
      that if the application finds this field to zero, then the new
      functionality is not supported.
      
      In the next patch we will introduce the first flag available.
      Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0d9b5b3a
    • J
      io_uring: allow POLL_ADD with double poll_wait() users · 18bceab1
      Jens Axboe 提交于
      Some file descriptors use separate waitqueues for their f_ops->poll()
      handler, most commonly one for read and one for write. The io_uring
      poll implementation doesn't work with that, as the 2nd poll_wait()
      call will cause the io_uring poll request to -EINVAL.
      
      This affects (at least) tty devices and /dev/random as well. This is a
      big problem for event loops where some file descriptors work, and others
      don't.
      
      With this fix, io_uring handles multiple waitqueues.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      18bceab1
    • J
      io_uring: batch reap of dead file registrations · 4a38aed2
      Jens Axboe 提交于
      We currently embed and queue a work item per fixed_file_ref_node that
      we update, but if the workload does a lot of these, then the associated
      kworker-events overhead can become quite noticeable.
      
      Since we rarely need to wait on these, batch them at 1 second intervals
      instead. If we do need to wait for them, we just flush the pending
      delayed work.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4a38aed2
  10. 15 5月, 2020 1 次提交
  11. 14 5月, 2020 1 次提交
    • J
      io_uring: polled fixed file must go through free iteration · 9d9e88a2
      Jens Axboe 提交于
      When we changed the file registration handling, it became important to
      iterate the bulk request freeing list for fixed files as well, or we
      miss dropping the fixed file reference. If not, we're leaking references,
      and we'll get a kworker stuck waiting for file references to disappear.
      
      This also means we can remove the special casing of fixed vs non-fixed
      files, we need to iterate for both and we can just rely on
      __io_req_aux_free() doing io_put_file() instead of doing it manually.
      
      Fixes: 05589553 ("io_uring: refactor file register/unregister/update handling")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9d9e88a2
  12. 11 5月, 2020 1 次提交
  13. 10 5月, 2020 1 次提交
  14. 09 5月, 2020 2 次提交
  15. 08 5月, 2020 1 次提交
    • J
      io_uring: don't use 'fd' for openat/openat2/statx · 63ff8223
      Jens Axboe 提交于
      We currently make some guesses as when to open this fd, but in reality
      we have no business (or need) to do so at all. In fact, it makes certain
      things fail, like O_PATH.
      
      Remove the fd lookup from these opcodes, we're just passing the 'fd' to
      generic helpers anyway. With that, we can also remove the special casing
      of fd values in io_req_needs_file(), and the 'fd_non_neg' check that
      we have. And we can ensure that we only read sqe->fd once.
      
      This fixes O_PATH usage with openat/openat2, and ditto statx path side
      oddities.
      
      Cc: stable@vger.kernel.org: # v5.6
      Reported-by: NMax Kellermann <mk@cm4all.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      63ff8223