1. 10 3月, 2020 3 次提交
    • J
      io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READV · 4d954c25
      Jens Axboe 提交于
      This adds support for the vectored read. This is limited to supporting
      just 1 segment in the iov, and is provided just for convenience for
      applications that use IORING_OP_READV already.
      
      The iov helpers will be used for IORING_OP_RECVMSG as well.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4d954c25
    • J
      io_uring: support buffer selection for OP_READ and OP_RECV · bcda7baa
      Jens Axboe 提交于
      If a server process has tons of pending socket connections, generally
      it uses epoll to wait for activity. When the socket is ready for reading
      (or writing), the task can select a buffer and issue a recv/send on the
      given fd.
      
      Now that we have fast (non-async thread) support, a task can have tons
      of pending reads or writes pending. But that means they need buffers to
      back that data, and if the number of connections is high enough, having
      them preallocated for all possible connections is unfeasible.
      
      With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
      use for any request. The request then sets IOSQE_BUFFER_SELECT in the
      sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
      a free buffer from the specified group is selected. If none are
      available, the request is terminated with -ENOBUFS. If successful, the
      CQE on completion will contain the buffer ID chosen in the cqe->flags
      member, encoded as:
      
      	(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
      
      Once a buffer has been consumed by a request, it is no longer available
      and must be registered again with IORING_OP_PROVIDE_BUFFERS.
      
      Requests need to support this feature. For now, IORING_OP_READ and
      IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
      res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bcda7baa
    • J
      io_uring: add IORING_OP_PROVIDE_BUFFERS · ddf0322d
      Jens Axboe 提交于
      IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
      support passing in an addr/len that is associated with a buffer ID and
      buffer group ID. The group ID is used to index and lookup the buffers,
      while the buffer ID can be used to notify the application which buffer
      in the group was used. The addr passed in is the starting buffer address,
      and length is each buffer length. A number of buffers to add with can be
      specified, in which case addr is incremented by length for each addition,
      and each buffer increments the buffer ID specified.
      
      No validation is done of the buffer ID. If the application provides
      buffers within the same group with identical buffer IDs, then it'll have
      a hard time telling which buffer ID was used. The only restriction is
      that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
      maximum ID that can be used.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddf0322d
  2. 05 3月, 2020 2 次提交
  3. 04 3月, 2020 3 次提交
  4. 03 3月, 2020 18 次提交
  5. 28 2月, 2020 1 次提交
  6. 27 2月, 2020 2 次提交
    • T
      io_uring: define and set show_fdinfo only if procfs is enabled · bebdb65e
      Tobias Klauser 提交于
      Follow the pattern used with other *_show_fdinfo functions and only
      define and use io_uring_show_fdinfo and its helper functions if
      CONFIG_PROC_FS is set.
      
      Fixes: 87ce955b ("io_uring: add ->show_fdinfo() for the io_uring file descriptor")
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bebdb65e
    • J
      io_uring: drop file set ref put/get on switch · dd3db2a3
      Jens Axboe 提交于
      Dan reports that he triggered a warning on ring exit doing some testing:
      
      percpu ref (io_file_data_ref_zero) <= 0 (0) after switching to atomic
      WARNING: CPU: 3 PID: 0 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
      Modules linked in:
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc3+ #5648
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
      Code: e7 ff 55 e8 eb d2 80 3d bd 02 d2 00 00 75 8b 48 8b 55 d8 48 c7 c7 e8 70 e6 81 c6 05 a9 02 d2 00 01 48 8b 75 e8 e8 3a d0 c5 ff <0f> 0b e9 69 ff ff ff 90 55 48 89 fd 53 48 89 f3 48 83 ec 28 48 83
      RSP: 0018:ffffc90000110ef8 EFLAGS: 00010292
      RAX: 0000000000000045 RBX: 7fffffffffffffff RCX: 0000000000000000
      RDX: 0000000000000045 RSI: ffffffff825be7a5 RDI: ffffffff825bc32c
      RBP: ffff8881b75eac38 R08: 000000042364b941 R09: 0000000000000045
      R10: ffffffff825beb40 R11: ffffffff825be78a R12: 0000607e46005aa0
      R13: ffff888107dcdd00 R14: 0000000000000000 R15: 0000000000000009
      FS:  0000000000000000(0000) GS:ffff8881b9d80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f49e6a5ea20 CR3: 00000001b747c004 CR4: 00000000001606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       rcu_core+0x1e4/0x4d0
       __do_softirq+0xdb/0x2f1
       irq_exit+0xa0/0xb0
       smp_apic_timer_interrupt+0x60/0x140
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      RIP: 0010:default_idle+0x23/0x170
      Code: ff eb ab cc cc cc cc 0f 1f 44 00 00 41 54 55 53 65 8b 2d 10 96 92 7e 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 21 d0 51 00 fb f4 <65> 8b 2d f6 95 92 7e 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 e5 95
      
      Turns out that this is due to percpu_ref_switch_to_atomic() only
      grabbing a reference to the percpu refcount if it's not already in
      atomic mode. io_uring drops a ref and re-gets it when switching back to
      percpu mode. We attempt to protect against this with the FFD_F_ATOMIC
      bit, but that isn't reliable.
      
      We don't actually need to juggle these refcounts between atomic and
      percpu switch, we can just do them when we've switched to atomic mode.
      This removes the need for FFD_F_ATOMIC, which wasn't reliable.
      
      Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dd3db2a3
  7. 26 2月, 2020 2 次提交
    • J
      io_uring: import_single_range() returns 0/-ERROR · 3a901598
      Jens Axboe 提交于
      Unlike the other core import helpers, import_single_range() returns 0 on
      success, not the length imported. This means that links that depend on
      the result of non-vec based IORING_OP_{READ,WRITE} that were added for
      5.5 get errored when they should not be.
      
      Fixes: 3a6820f2 ("io_uring: add non-vectored read/write commands")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a901598
    • J
      io_uring: pick up link work on submit reference drop · 2a44f467
      Jens Axboe 提交于
      If work completes inline, then we should pick up a dependent link item
      in __io_queue_sqe() as well. If we don't do so, we're forced to go async
      with that item, which is suboptimal.
      
      This also fixes an issue with io_put_req_find_next(), which always looks
      up the next work item. That should only be done if we're dropping the
      last reference to the request, to prevent multiple lookups of the same
      work item.
      
      Outside of being a fix, this also enables a good cleanup series for 5.7,
      where we never have to pass 'nxt' around or into the work handlers.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2a44f467
  8. 25 2月, 2020 1 次提交
    • X
      io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL · bdcd3eab
      Xiaoguang Wang 提交于
      After making ext4 support iopoll method:
        let ext4_file_operations's iopoll method be iomap_dio_iopoll(),
      we found fio can easily hang in fio_ioring_getevents() with below fio
      job:
          rm -f testfile; sync;
          sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread
      -rw=write -ioengine=io_uring  -hipri=1 -sqthread_poll=1 -direct=1
      -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting
      with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.
      
      There are two issues that results in this hang, one reason is that
      when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio
      does not use io_uring_enter to get completed events, it relies on
      kernel io_sq_thread to poll for completed events.
      
      Another reason is that there is a race: when io_submit_sqes() in
      io_sq_thread() submits a batch of sqes, variable 'inflight' will
      record the number of submitted reqs, then io_sq_thread will poll for
      reqs which have been added to poll_list. But note, if some previous
      reqs have been punted to io worker, these reqs will won't be in
      poll_list timely. io_sq_thread() will only poll for a part of previous
      submitted reqs, and then find poll_list is empty, reset variable
      'inflight' to be zero. If app just waits these deferred reqs and does
      not wake up io_sq_thread again, then hang happens.
      
      For app that entirely relies on io_sq_thread to poll completed requests,
      let io_iopoll_req_issued() wake up io_sq_thread properly when adding new
      element to poll_list, and when io_sq_thread prepares to sleep, check
      whether poll_list is empty again, if not empty, continue to poll.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bdcd3eab
  9. 24 2月, 2020 2 次提交
  10. 22 2月, 2020 2 次提交
    • X
      io_uring: fix __io_iopoll_check deadlock in io_sq_thread · c7849be9
      Xiaoguang Wang 提交于
      Since commit a3a0e43f ("io_uring: don't enter poll loop if we have
      CQEs pending"), if we already events pending, we won't enter poll loop.
      In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
      been terminated and don't reap pending events which are already in cq
      ring, and there are some reqs in poll_list, io_sq_thread will enter
      __io_iopoll_check(), and find pending events, then return, this loop
      will never have a chance to exit.
      
      I have seen this issue in fio stress tests, to fix this issue, let
      io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
      and remove __io_iopoll_check().
      
      Fixes: a3a0e43f ("io_uring: don't enter poll loop if we have CQEs pending")
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7849be9
    • S
      io_uring: prevent sq_thread from spinning when it should stop · 7143b5ac
      Stefano Garzarella 提交于
      This patch drops 'cur_mm' before calling cond_resched(), to prevent
      the sq_thread from spinning even when the user process is finished.
      
      Before this patch, if the user process ended without closing the
      io_uring fd, the sq_thread continues to spin until the
      'sq_thread_idle' timeout ends.
      
      In the worst case where the 'sq_thread_idle' parameter is bigger than
      INT_MAX, the sq_thread will spin forever.
      
      Fixes: 6c271ce2 ("io_uring: add submission polling")
      Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7143b5ac
  11. 19 2月, 2020 2 次提交
  12. 17 2月, 2020 1 次提交
  13. 14 2月, 2020 1 次提交
    • J
      io_uring: prune request from overflow list on flush · 2ca10259
      Jens Axboe 提交于
      Carter reported an issue where he could produce a stall on ring exit,
      when we're cleaning up requests that match the given file table. For
      this particular test case, a combination of a few things caused the
      issue:
      
      - The cq ring was overflown
      - The request being canceled was in the overflow list
      
      The combination of the above means that the cq overflow list holds a
      reference to the request. The request is canceled correctly, but since
      the overflow list holds a reference to it, the final put won't happen.
      Since the final put doesn't happen, the request remains in the inflight.
      Hence we never finish the cancelation flush.
      
      Fix this by removing requests from the overflow list if we're canceling
      them.
      
      Cc: stable@vger.kernel.org # 5.5
      Reported-by: NCarter Li 李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2ca10259