1. 30 5月, 2020 3 次提交
  2. 27 5月, 2020 7 次提交
  3. 20 5月, 2020 1 次提交
  4. 18 5月, 2020 6 次提交
  5. 16 5月, 2020 5 次提交
    • J
      io_uring: file registration list and lock optimization · 6a4d07cd
      Jens Axboe 提交于
      There's no point in using list_del_init() on entries that are going
      away, and the associated lock is always used in process context so
      let's not use the IRQ disabling+saving variant of the spinlock.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6a4d07cd
    • S
      io_uring: add IORING_CQ_EVENTFD_DISABLED to the CQ ring flags · 7e55a19c
      Stefano Garzarella 提交于
      This new flag should be set/clear from the application to
      disable/enable eventfd notifications when a request is completed
      and queued to the CQ ring.
      
      Before this patch, notifications were always sent if an eventfd is
      registered, so IORING_CQ_EVENTFD_DISABLED is not set during the
      initialization.
      
      It will be up to the application to set the flag after initialization
      if no notifications are required at the beginning.
      Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e55a19c
    • S
      io_uring: add 'cq_flags' field for the CQ ring · 0d9b5b3a
      Stefano Garzarella 提交于
      This patch adds the new 'cq_flags' field that should be written by
      the application and read by the kernel.
      
      This new field is available to the userspace application through
      'cq_off.flags'.
      We are using 4-bytes previously reserved and set to zero. This means
      that if the application finds this field to zero, then the new
      functionality is not supported.
      
      In the next patch we will introduce the first flag available.
      Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0d9b5b3a
    • J
      io_uring: allow POLL_ADD with double poll_wait() users · 18bceab1
      Jens Axboe 提交于
      Some file descriptors use separate waitqueues for their f_ops->poll()
      handler, most commonly one for read and one for write. The io_uring
      poll implementation doesn't work with that, as the 2nd poll_wait()
      call will cause the io_uring poll request to -EINVAL.
      
      This affects (at least) tty devices and /dev/random as well. This is a
      big problem for event loops where some file descriptors work, and others
      don't.
      
      With this fix, io_uring handles multiple waitqueues.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      18bceab1
    • J
      io_uring: batch reap of dead file registrations · 4a38aed2
      Jens Axboe 提交于
      We currently embed and queue a work item per fixed_file_ref_node that
      we update, but if the workload does a lot of these, then the associated
      kworker-events overhead can become quite noticeable.
      
      Since we rarely need to wait on these, batch them at 1 second intervals
      instead. If we do need to wait for them, we just flush the pending
      delayed work.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4a38aed2
  6. 15 5月, 2020 1 次提交
  7. 11 5月, 2020 1 次提交
  8. 09 5月, 2020 2 次提交
  9. 08 5月, 2020 1 次提交
    • J
      io_uring: don't use 'fd' for openat/openat2/statx · 63ff8223
      Jens Axboe 提交于
      We currently make some guesses as when to open this fd, but in reality
      we have no business (or need) to do so at all. In fact, it makes certain
      things fail, like O_PATH.
      
      Remove the fd lookup from these opcodes, we're just passing the 'fd' to
      generic helpers anyway. With that, we can also remove the special casing
      of fd values in io_req_needs_file(), and the 'fd_non_neg' check that
      we have. And we can ensure that we only read sqe->fd once.
      
      This fixes O_PATH usage with openat/openat2, and ditto statx path side
      oddities.
      
      Cc: stable@vger.kernel.org: # v5.6
      Reported-by: NMax Kellermann <mk@cm4all.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      63ff8223
  10. 06 5月, 2020 1 次提交
  11. 04 5月, 2020 1 次提交
  12. 01 5月, 2020 7 次提交
    • P
      io_uring: punt splice async because of inode mutex · 2fb3e822
      Pavel Begunkov 提交于
      Nonblocking do_splice() still may wait for some time on an inode mutex.
      Let's play safe and always punt it async.
      Reported-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2fb3e822
    • P
      io_uring: check non-sync defer_list carefully · 4ee36314
      Pavel Begunkov 提交于
      io_req_defer() do double-checked locking. Use proper helpers for that,
      i.e. list_empty_careful().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4ee36314
    • P
      io_uring: fix extra put in sync_file_range() · 7759a0bf
      Pavel Begunkov 提交于
      [   40.179474] refcount_t: underflow; use-after-free.
      [   40.179499] WARNING: CPU: 6 PID: 1848 at lib/refcount.c:28 refcount_warn_saturate+0xae/0xf0
      ...
      [   40.179612] RIP: 0010:refcount_warn_saturate+0xae/0xf0
      [   40.179617] Code: 28 44 0a 01 01 e8 d7 01 c2 ff 0f 0b 5d c3 80 3d 15 44 0a 01 00 75 91 48 c7 c7 b8 f5 75 be c6 05 05 44 0a 01 01 e8 b7 01 c2 ff <0f> 0b 5d c3 80 3d f3 43 0a 01 00 0f 85 6d ff ff ff 48 c7 c7 10 f6
      [   40.179619] RSP: 0018:ffffb252423ebe18 EFLAGS: 00010286
      [   40.179623] RAX: 0000000000000000 RBX: ffff98d65e929400 RCX: 0000000000000000
      [   40.179625] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 00000000ffffffff
      [   40.179627] RBP: ffffb252423ebe18 R08: 0000000000000001 R09: 000000000000055d
      [   40.179629] R10: 0000000000000c8c R11: 0000000000000001 R12: 0000000000000000
      [   40.179631] R13: ffff98d68c434400 R14: ffff98d6a9cbaa20 R15: ffff98d6a609ccb8
      [   40.179634] FS:  0000000000000000(0000) GS:ffff98d6af580000(0000) knlGS:0000000000000000
      [   40.179636] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   40.179638] CR2: 00000000033e3194 CR3: 000000006480a003 CR4: 00000000003606e0
      [   40.179641] Call Trace:
      [   40.179652]  io_put_req+0x36/0x40
      [   40.179657]  io_free_work+0x15/0x20
      [   40.179661]  io_worker_handle_work+0x2f5/0x480
      [   40.179667]  io_wqe_worker+0x2a9/0x360
      [   40.179674]  ? _raw_spin_unlock_irqrestore+0x24/0x40
      [   40.179681]  kthread+0x12c/0x170
      [   40.179685]  ? io_worker_handle_work+0x480/0x480
      [   40.179690]  ? kthread_park+0x90/0x90
      [   40.179695]  ret_from_fork+0x35/0x40
      [   40.179702] ---[ end trace 85027405f00110aa ]---
      
      Opcode handler must never put submission ref, but that's what
      io_sync_file_range_finish() do. use io_steal_work() there.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7759a0bf
    • X
      io_uring: use cond_resched() in io_ring_ctx_wait_and_kill() · 3fd44c86
      Xiaoguang Wang 提交于
      While working on to make io_uring sqpoll mode support syscalls that need
      struct files_struct, I got cpu soft lockup in io_ring_ctx_wait_and_kill(),
      
          while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait))
              cpu_relax();
      
      above loop never has an chance to exit, it's because preempt isn't enabled
      in the kernel, and the context calling io_ring_ctx_wait_and_kill() and
      io_sq_thread() run in the same cpu, if io_sq_thread calls a cond_resched()
      yield cpu and another context enters above loop, then io_sq_thread() will
      always in runqueue and never exit.
      
      Use cond_resched() can fix this issue.
      
       Reported-by: syzbot+66243bb7126c410cefe6@syzkaller.appspotmail.com
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3fd44c86
    • B
      io_uring: use proper references for fallback_req locking · dd461af6
      Bijan Mottahedeh 提交于
      Use ctx->fallback_req address for test_and_set_bit_lock() and
      clear_bit_unlock().
      Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dd461af6
    • J
      io_uring: only force async punt if poll based retry can't handle it · 490e8967
      Jens Axboe 提交于
      We do blocking retry from our poll handler, if the file supports polled
      notifications. Only mark the request as needing an async worker if we
      can't poll for it.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      490e8967
    • J
      io_uring: enable poll retry for any file with ->read_iter / ->write_iter · af197f50
      Jens Axboe 提交于
      We can have files like eventfd where it's perfectly fine to do poll
      based retry on them, right now io_file_supports_async() doesn't take
      that into account.
      
      Pass in data direction and check the f_op instead of just always needing
      an async worker.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      af197f50
  13. 28 4月, 2020 1 次提交
    • J
      io_uring: statx must grab the file table for valid fd · 5b0bbee4
      Jens Axboe 提交于
      Clay reports that OP_STATX fails for a test case with a valid fd
      and empty path:
      
       -- Test 0: statx:fd 3: SUCCEED, file mode 100755
       -- Test 1: statx:path ./uring_statx: SUCCEED, file mode 100755
       -- Test 2: io_uring_statx:fd 3: FAIL, errno 9: Bad file descriptor
       -- Test 3: io_uring_statx:path ./uring_statx: SUCCEED, file mode 100755
      
      This is due to statx not grabbing the process file table, hence we can't
      lookup the fd in async context. If the fd is valid, ensure that we grab
      the file table so we can grab the file from async context.
      
      Cc: stable@vger.kernel.org # v5.6
      Reported-by: NClay Harris <bugs@claycon.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5b0bbee4
  14. 20 4月, 2020 1 次提交
    • X
      io_uring: only restore req->work for req that needs do completion · 44575a67
      Xiaoguang Wang 提交于
      When testing io_uring IORING_FEAT_FAST_POLL feature, I got below panic:
      BUG: kernel NULL pointer dereference, address: 0000000000000030
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP PTI
      CPU: 5 PID: 2154 Comm: io_uring_echo_s Not tainted 5.6.0+ #359
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
      RIP: 0010:io_wq_submit_work+0xf/0xa0
      Code: ff ff ff be 02 00 00 00 e8 ae c9 19 00 e9 58 ff ff ff 66 0f 1f
      84 00 00 00 00 00 0f 1f 44 00 00 41 54 49 89 fc 55 53 48 8b 2f <8b>
      45 30 48 8d 9d 48 ff ff ff 25 01 01 00 00 83 f8 01 75 07 eb 2a
      RSP: 0018:ffffbef543e93d58 EFLAGS: 00010286
      RAX: ffffffff84364f50 RBX: ffffa3eb50f046b8 RCX: 0000000000000000
      RDX: ffffa3eb0efc1840 RSI: 0000000000000006 RDI: ffffa3eb50f046b8
      RBP: 0000000000000000 R08: 00000000fffd070d R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffffa3eb50f046b8
      R13: ffffa3eb0efc2088 R14: ffffffff85b69be0 R15: ffffa3eb0effa4b8
      FS:  00007fe9f69cc4c0(0000) GS:ffffa3eb5ef40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000030 CR3: 0000000020410000 CR4: 00000000000006e0
      Call Trace:
       task_work_run+0x6d/0xa0
       do_exit+0x39a/0xb80
       ? get_signal+0xfe/0xbc0
       do_group_exit+0x47/0xb0
       get_signal+0x14b/0xbc0
       ? __x64_sys_io_uring_enter+0x1b7/0x450
       do_signal+0x2c/0x260
       ? __x64_sys_io_uring_enter+0x228/0x450
       exit_to_usermode_loop+0x87/0xf0
       do_syscall_64+0x209/0x230
       entry_SYSCALL_64_after_hwframe+0x49/0xb3
      RIP: 0033:0x7fe9f64f8df9
      Code: Bad RIP value.
      
      task_work_run calls io_wq_submit_work unexpectedly, it's obvious that
      struct callback_head's func member has been changed. After looking into
      codes, I found this issue is still due to the union definition:
          union {
              /*
               * Only commands that never go async can use the below fields,
               * obviously. Right now only IORING_OP_POLL_ADD uses them, and
               * async armed poll handlers for regular commands. The latter
               * restore the work, if needed.
               */
              struct {
                  struct callback_head	task_work;
                  struct hlist_node	hash_node;
                  struct async_poll	*apoll;
              };
              struct io_wq_work	work;
          };
      
      When task_work_run has multiple work to execute, the work that calls
      io_poll_remove_all() will do req->work restore for  non-poll request
      always, but indeed if a non-poll request has been added to a new
      callback_head, subsequent callback will call io_async_task_func() to
      handle this request, that means we should not do the restore work
      for such non-poll request. Meanwhile in io_async_task_func(), we should
      drop submit ref when req has been canceled.
      
      Fix both issues.
      
      Fixes: b1f573bd ("io_uring: restore req->work when canceling poll request")
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      
      Use io_double_put_req()
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      44575a67
  15. 15 4月, 2020 2 次提交
    • P
      io_uring: don't count rqs failed after current one · 31af27c7
      Pavel Begunkov 提交于
      When checking for draining with __req_need_defer(), it tries to match
      how many requests were sent before a current one with number of already
      completed. Dropped SQEs are included in req->sequence, and they won't
      ever appear in CQ. To compensate for that, __req_need_defer() substracts
      ctx->cached_sq_dropped.
      However, what it should really use is number of SQEs dropped __before__
      the current one. In other words, any submitted request shouldn't
      shouldn't affect dequeueing from the drain queue of previously submitted
      ones.
      
      Instead of saving proper ctx->cached_sq_dropped in each request,
      substract from req->sequence it at initialisation, so it includes number
      of properly submitted requests.
      
      note: it also changes behaviour of timeouts, but
      1. it's already diverge from the description because of using SQ
      2. the description is ambiguous regarding dropped SQEs
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      31af27c7
    • P
      io_uring: kill already cached timeout.seq_offset · b55ce732
      Pavel Begunkov 提交于
      req->timeout.count and req->io->timeout.seq_offset store the same value,
      which is sqe->off. Kill the second one
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b55ce732