1. 21 3月, 2020 1 次提交
    • J
      io_uring: honor original task RLIMIT_FSIZE · 4ed734b0
      Jens Axboe 提交于
      With the previous fixes for number of files open checking, I added some
      debug code to see if we had other spots where we're checking rlimit()
      against the async io-wq workers. The only one I found was file size
      checking, which we should also honor.
      
      During write and fallocate prep, store the max file size and override
      that for the current ask if we're in io-wq worker context.
      
      Cc: stable@vger.kernel.org # 5.1+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4ed734b0
  2. 15 3月, 2020 1 次提交
  3. 12 3月, 2020 1 次提交
  4. 11 3月, 2020 1 次提交
    • X
      io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled · 32b2244a
      Xiaoguang Wang 提交于
      When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need
      to do io completion events polling again, they can rely on io_sq_thread to do
      polling work, which can reduce cpu usage and uring_lock contention.
      
      I modify fio io_uring engine codes a bit to evaluate the performance:
      static int fio_ioring_getevents(struct thread_data *td, unsigned int min,
                              continue;
                      }
      
      -               if (!o->sqpoll_thread) {
      +               if (o->sqpoll_thread && o->hipri) {
                              r = io_uring_enter(ld, 0, actual_min,
                                                      IORING_ENTER_GETEVENTS);
                              if (r < 0) {
      
      and use "fio  -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread
      -rw=read -ioengine=io_uring  -hipri=1 -sqthread_poll=1  -direct=1 -bs=4k
      -size=10G -numjobs=1  -time_based -runtime=120"
      
      original codes
      --------------------------------------------------------------------
      iodepth       |        4 |        8 |       16 |       32 |       64
      bw            | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s
      fio cpu usage |     100% |     100% |     100% |     100% |     100%
      --------------------------------------------------------------------
      
      with patch
      --------------------------------------------------------------------
      iodepth       |        4 |        8 |       16 |       32 |       64
      bw            | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s
      fio cpu usage |    63.8% |   74.4%% |    81.1% |    83.7% |    82.4%
      --------------------------------------------------------------------
      bw improve    |     5.5% |    13.2% |    12.3% |     9.8% |    11.5%
      --------------------------------------------------------------------
      
      From above test results, we can see that bw has above 5.5%~13%
      improvement, and fio process's cpu usage also drops much. Note this
      won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL
      are both enabled, in this case, io_sq_thread always has 100% cpu usage.
      I think this patch will be friendly to applications which will often use
      io_uring_wait_cqe() or similar from liburing.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      32b2244a
  5. 10 3月, 2020 7 次提交
    • Y
      io_uring: Fix unused function warnings · 469956e8
      YueHaibing 提交于
      If CONFIG_NET is not set, gcc warns:
      
      fs/io_uring.c:3110:12: warning: io_setup_async_msg defined but not used [-Wunused-function]
       static int io_setup_async_msg(struct io_kiocb *req,
                  ^~~~~~~~~~~~~~~~~~
      
      There are many funcions wraped by CONFIG_NET, move them
      together to simplify code, also fix this warning.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      
      Minor tweaks.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      469956e8
    • J
      io_uring: add end-of-bits marker and build time verify it · 84557871
      Jens Axboe 提交于
      Not easy to tell if we're going over the size of bits we can shove
      in req->flags, so add an end-of-bits marker and a BUILD_BUG_ON()
      check for it.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      84557871
    • J
      io_uring: provide means of removing buffers · 067524e9
      Jens Axboe 提交于
      We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers
      is to trigger IO on them. The usual case of shrinking a buffer pool
      would be to just not replenish the buffers when IO completes, and
      instead just free it. But it may be nice to have a way to manually
      remove a number of buffers from a given group, and
      IORING_OP_REMOVE_BUFFERS provides that functionality.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      067524e9
    • J
      io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG · 52de1fe1
      Jens Axboe 提交于
      Like IORING_OP_READV, this is limited to supporting just a single
      segment in the iovec passed in.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      52de1fe1
    • J
      io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_READV · 4d954c25
      Jens Axboe 提交于
      This adds support for the vectored read. This is limited to supporting
      just 1 segment in the iov, and is provided just for convenience for
      applications that use IORING_OP_READV already.
      
      The iov helpers will be used for IORING_OP_RECVMSG as well.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4d954c25
    • J
      io_uring: support buffer selection for OP_READ and OP_RECV · bcda7baa
      Jens Axboe 提交于
      If a server process has tons of pending socket connections, generally
      it uses epoll to wait for activity. When the socket is ready for reading
      (or writing), the task can select a buffer and issue a recv/send on the
      given fd.
      
      Now that we have fast (non-async thread) support, a task can have tons
      of pending reads or writes pending. But that means they need buffers to
      back that data, and if the number of connections is high enough, having
      them preallocated for all possible connections is unfeasible.
      
      With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
      use for any request. The request then sets IOSQE_BUFFER_SELECT in the
      sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
      a free buffer from the specified group is selected. If none are
      available, the request is terminated with -ENOBUFS. If successful, the
      CQE on completion will contain the buffer ID chosen in the cqe->flags
      member, encoded as:
      
      	(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
      
      Once a buffer has been consumed by a request, it is no longer available
      and must be registered again with IORING_OP_PROVIDE_BUFFERS.
      
      Requests need to support this feature. For now, IORING_OP_READ and
      IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
      res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bcda7baa
    • J
      io_uring: add IORING_OP_PROVIDE_BUFFERS · ddf0322d
      Jens Axboe 提交于
      IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
      support passing in an addr/len that is associated with a buffer ID and
      buffer group ID. The group ID is used to index and lookup the buffers,
      while the buffer ID can be used to notify the application which buffer
      in the group was used. The addr passed in is the starting buffer address,
      and length is each buffer length. A number of buffers to add with can be
      specified, in which case addr is incremented by length for each addition,
      and each buffer increments the buffer ID specified.
      
      No validation is done of the buffer ID. If the application provides
      buffers within the same group with identical buffer IDs, then it'll have
      a hard time telling which buffer ID was used. The only restriction is
      that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
      maximum ID that can be used.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddf0322d
  6. 05 3月, 2020 2 次提交
  7. 04 3月, 2020 3 次提交
  8. 03 3月, 2020 18 次提交
  9. 28 2月, 2020 1 次提交
  10. 27 2月, 2020 2 次提交
    • T
      io_uring: define and set show_fdinfo only if procfs is enabled · bebdb65e
      Tobias Klauser 提交于
      Follow the pattern used with other *_show_fdinfo functions and only
      define and use io_uring_show_fdinfo and its helper functions if
      CONFIG_PROC_FS is set.
      
      Fixes: 87ce955b ("io_uring: add ->show_fdinfo() for the io_uring file descriptor")
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bebdb65e
    • J
      io_uring: drop file set ref put/get on switch · dd3db2a3
      Jens Axboe 提交于
      Dan reports that he triggered a warning on ring exit doing some testing:
      
      percpu ref (io_file_data_ref_zero) <= 0 (0) after switching to atomic
      WARNING: CPU: 3 PID: 0 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
      Modules linked in:
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc3+ #5648
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
      Code: e7 ff 55 e8 eb d2 80 3d bd 02 d2 00 00 75 8b 48 8b 55 d8 48 c7 c7 e8 70 e6 81 c6 05 a9 02 d2 00 01 48 8b 75 e8 e8 3a d0 c5 ff <0f> 0b e9 69 ff ff ff 90 55 48 89 fd 53 48 89 f3 48 83 ec 28 48 83
      RSP: 0018:ffffc90000110ef8 EFLAGS: 00010292
      RAX: 0000000000000045 RBX: 7fffffffffffffff RCX: 0000000000000000
      RDX: 0000000000000045 RSI: ffffffff825be7a5 RDI: ffffffff825bc32c
      RBP: ffff8881b75eac38 R08: 000000042364b941 R09: 0000000000000045
      R10: ffffffff825beb40 R11: ffffffff825be78a R12: 0000607e46005aa0
      R13: ffff888107dcdd00 R14: 0000000000000000 R15: 0000000000000009
      FS:  0000000000000000(0000) GS:ffff8881b9d80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f49e6a5ea20 CR3: 00000001b747c004 CR4: 00000000001606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       rcu_core+0x1e4/0x4d0
       __do_softirq+0xdb/0x2f1
       irq_exit+0xa0/0xb0
       smp_apic_timer_interrupt+0x60/0x140
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      RIP: 0010:default_idle+0x23/0x170
      Code: ff eb ab cc cc cc cc 0f 1f 44 00 00 41 54 55 53 65 8b 2d 10 96 92 7e 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 21 d0 51 00 fb f4 <65> 8b 2d f6 95 92 7e 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 e5 95
      
      Turns out that this is due to percpu_ref_switch_to_atomic() only
      grabbing a reference to the percpu refcount if it's not already in
      atomic mode. io_uring drops a ref and re-gets it when switching back to
      percpu mode. We attempt to protect against this with the FFD_F_ATOMIC
      bit, but that isn't reliable.
      
      We don't actually need to juggle these refcounts between atomic and
      percpu switch, we can just do them when we've switched to atomic mode.
      This removes the need for FFD_F_ATOMIC, which wasn't reliable.
      
      Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dd3db2a3
  11. 26 2月, 2020 2 次提交
    • J
      io_uring: import_single_range() returns 0/-ERROR · 3a901598
      Jens Axboe 提交于
      Unlike the other core import helpers, import_single_range() returns 0 on
      success, not the length imported. This means that links that depend on
      the result of non-vec based IORING_OP_{READ,WRITE} that were added for
      5.5 get errored when they should not be.
      
      Fixes: 3a6820f2 ("io_uring: add non-vectored read/write commands")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a901598
    • J
      io_uring: pick up link work on submit reference drop · 2a44f467
      Jens Axboe 提交于
      If work completes inline, then we should pick up a dependent link item
      in __io_queue_sqe() as well. If we don't do so, we're forced to go async
      with that item, which is suboptimal.
      
      This also fixes an issue with io_put_req_find_next(), which always looks
      up the next work item. That should only be done if we're dropping the
      last reference to the request, to prevent multiple lookups of the same
      work item.
      
      Outside of being a fix, this also enables a good cleanup series for 5.7,
      where we never have to pass 'nxt' around or into the work handlers.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2a44f467
  12. 25 2月, 2020 1 次提交
    • X
      io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL · bdcd3eab
      Xiaoguang Wang 提交于
      After making ext4 support iopoll method:
        let ext4_file_operations's iopoll method be iomap_dio_iopoll(),
      we found fio can easily hang in fio_ioring_getevents() with below fio
      job:
          rm -f testfile; sync;
          sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread
      -rw=write -ioengine=io_uring  -hipri=1 -sqthread_poll=1 -direct=1
      -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting
      with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.
      
      There are two issues that results in this hang, one reason is that
      when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio
      does not use io_uring_enter to get completed events, it relies on
      kernel io_sq_thread to poll for completed events.
      
      Another reason is that there is a race: when io_submit_sqes() in
      io_sq_thread() submits a batch of sqes, variable 'inflight' will
      record the number of submitted reqs, then io_sq_thread will poll for
      reqs which have been added to poll_list. But note, if some previous
      reqs have been punted to io worker, these reqs will won't be in
      poll_list timely. io_sq_thread() will only poll for a part of previous
      submitted reqs, and then find poll_list is empty, reset variable
      'inflight' to be zero. If app just waits these deferred reqs and does
      not wake up io_sq_thread again, then hang happens.
      
      For app that entirely relies on io_sq_thread to poll completed requests,
      let io_iopoll_req_issued() wake up io_sq_thread properly when adding new
      element to poll_list, and when io_sq_thread prepares to sleep, check
      whether poll_list is empty again, if not empty, continue to poll.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bdcd3eab