1. 09 7月, 2020 1 次提交
    • X
      io_uring: export cq overflow status to userspace · 6d5f9049
      Xiaoguang Wang 提交于
      For those applications which are not willing to use io_uring_enter()
      to reap and handle cqes, they may completely rely on liburing's
      io_uring_peek_cqe(), but if cq ring has overflowed, currently because
      io_uring_peek_cqe() is not aware of this overflow, it won't enter
      kernel to flush cqes, below test program can reveal this bug:
      
      static void test_cq_overflow(struct io_uring *ring)
      {
              struct io_uring_cqe *cqe;
              struct io_uring_sqe *sqe;
              int issued = 0;
              int ret = 0;
      
              do {
                      sqe = io_uring_get_sqe(ring);
                      if (!sqe) {
                              fprintf(stderr, "get sqe failed\n");
                              break;;
                      }
                      ret = io_uring_submit(ring);
                      if (ret <= 0) {
                              if (ret != -EBUSY)
                                      fprintf(stderr, "sqe submit failed: %d\n", ret);
                              break;
                      }
                      issued++;
              } while (ret > 0);
              assert(ret == -EBUSY);
      
              printf("issued requests: %d\n", issued);
      
              while (issued) {
                      ret = io_uring_peek_cqe(ring, &cqe);
                      if (ret) {
                              if (ret != -EAGAIN) {
                                      fprintf(stderr, "peek completion failed: %s\n",
                                              strerror(ret));
                                      break;
                              }
                              printf("left requets: %d\n", issued);
                              continue;
                      }
                      io_uring_cqe_seen(ring, cqe);
                      issued--;
                      printf("left requets: %d\n", issued);
              }
      }
      
      int main(int argc, char *argv[])
      {
              int ret;
              struct io_uring ring;
      
              ret = io_uring_queue_init(16, &ring, 0);
              if (ret) {
                      fprintf(stderr, "ring setup failed: %d\n", ret);
                      return 1;
              }
      
              test_cq_overflow(&ring);
              return 0;
      }
      
      To fix this issue, export cq overflow status to userspace by adding new
      IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
      io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6d5f9049
  2. 18 5月, 2020 1 次提交
  3. 16 5月, 2020 2 次提交
  4. 22 3月, 2020 1 次提交
  5. 11 3月, 2020 1 次提交
  6. 10 3月, 2020 3 次提交
    • J
      io_uring: provide means of removing buffers · 067524e9
      Jens Axboe 提交于
      We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers
      is to trigger IO on them. The usual case of shrinking a buffer pool
      would be to just not replenish the buffers when IO completes, and
      instead just free it. But it may be nice to have a way to manually
      remove a number of buffers from a given group, and
      IORING_OP_REMOVE_BUFFERS provides that functionality.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      067524e9
    • J
      io_uring: support buffer selection for OP_READ and OP_RECV · bcda7baa
      Jens Axboe 提交于
      If a server process has tons of pending socket connections, generally
      it uses epoll to wait for activity. When the socket is ready for reading
      (or writing), the task can select a buffer and issue a recv/send on the
      given fd.
      
      Now that we have fast (non-async thread) support, a task can have tons
      of pending reads or writes pending. But that means they need buffers to
      back that data, and if the number of connections is high enough, having
      them preallocated for all possible connections is unfeasible.
      
      With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
      use for any request. The request then sets IOSQE_BUFFER_SELECT in the
      sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
      a free buffer from the specified group is selected. If none are
      available, the request is terminated with -ENOBUFS. If successful, the
      CQE on completion will contain the buffer ID chosen in the cqe->flags
      member, encoded as:
      
      	(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
      
      Once a buffer has been consumed by a request, it is no longer available
      and must be registered again with IORING_OP_PROVIDE_BUFFERS.
      
      Requests need to support this feature. For now, IORING_OP_READ and
      IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
      res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bcda7baa
    • J
      io_uring: add IORING_OP_PROVIDE_BUFFERS · ddf0322d
      Jens Axboe 提交于
      IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
      support passing in an addr/len that is associated with a buffer ID and
      buffer group ID. The group ID is used to index and lookup the buffers,
      while the buffer ID can be used to notify the application which buffer
      in the group was used. The addr passed in is the starting buffer address,
      and length is each buffer length. A number of buffers to add with can be
      specified, in which case addr is incremented by length for each addition,
      and each buffer increments the buffer ID specified.
      
      No validation is done of the buffer ID. If the application provides
      buffers within the same group with identical buffer IDs, then it'll have
      a hard time telling which buffer ID was used. The only restriction is
      that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
      maximum ID that can be used.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddf0322d
  7. 03 3月, 2020 2 次提交
    • J
      io_uring: use poll driven retry for files that support it · d7718a9d
      Jens Axboe 提交于
      Currently io_uring tries any request in a non-blocking manner, if it can,
      and then retries from a worker thread if we get -EAGAIN. Now that we have
      a new and fancy poll based retry backend, use that to retry requests if
      the file supports it.
      
      This means that, for example, an IORING_OP_RECVMSG on a socket no longer
      requires an async thread to complete the IO. If we get -EAGAIN reading
      from the socket in a non-blocking manner, we arm a poll handler for
      notification on when the socket becomes readable. When it does, the
      pending read is executed directly by the task again, through the io_uring
      task work handlers. Not only is this faster and more efficient, it also
      means we're not generating potentially tons of async threads that just
      sit and block, waiting for the IO to complete.
      
      The feature is marked with IORING_FEAT_FAST_POLL, meaning that async
      pollable IO is fast, and that poll<link>other_op is fast as well.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d7718a9d
    • P
      io_uring: add splice(2) support · 7d67af2c
      Pavel Begunkov 提交于
      Add support for splice(2).
      
      - output file is specified as sqe->fd, so it's handled by generic code
      - hash_reg_file handled by generic code as well
      - len is 32bit, but should be fine
      - the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which
      is a splice flag (i.e. sqe->splice_flags).
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7d67af2c
  8. 30 1月, 2020 1 次提交
  9. 29 1月, 2020 4 次提交
  10. 21 1月, 2020 17 次提交
    • P
      io_uring: optimise sqe-to-req flags translation · 6b47ee6e
      Pavel Begunkov 提交于
      For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there
      is a repetitive pattern of their translation:
      e.g. if (sqe->flags & SQE_FLAG*) req->flags |= REQ_F_FLAG*
      
      Use same numeric values/bits for them and copy instead of manual
      handling.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b47ee6e
    • J
      io_uring: add support for probing opcodes · 66f4af93
      Jens Axboe 提交于
      The application currently has no way of knowing if a given opcode is
      supported or not without having to try and issue one and see if we get
      -EINVAL or not. And even this approach is fraught with peril, as maybe
      we're getting -EINVAL due to some fields being missing, or maybe it's
      just not that easy to issue that particular command without doing some
      other leg work in terms of setup first.
      
      This adds IORING_REGISTER_PROBE, which fills in a structure with info
      on what it supported or not. This will work even with sparse opcode
      fields, which may happen in the future or even today if someone
      backports specific features to older kernels.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      66f4af93
    • J
      io_uring: add support for IORING_OP_OPENAT2 · cebdb986
      Jens Axboe 提交于
      Add support for the new openat2(2) system call. It's trivial to do, as
      we can have openat(2) just be wrapped around it.
      Suggested-by: NStefan Metzmacher <metze@samba.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cebdb986
    • J
      io_uring: enable option to only trigger eventfd for async completions · f2842ab5
      Jens Axboe 提交于
      If an application is using eventfd notifications with poll to know when
      new SQEs can be issued, it's expecting the following read/writes to
      complete inline. And with that, it knows that there are events available,
      and don't want spurious wakeups on the eventfd for those requests.
      
      This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like
      IORING_REGISTER_EVENTFD, except it only triggers notifications for events
      that happen from async completions (IRQ, or io-wq worker completions).
      Any completions inline from the submission itself will not trigger
      notifications.
      Suggested-by: NMark Papadakis <markuspapadakis@icloud.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f2842ab5
    • J
      io_uring: add support for send(2) and recv(2) · fddaface
      Jens Axboe 提交于
      This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for
      recv(2) support.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fddaface
    • J
      io_uring: add support for IORING_SETUP_CLAMP · 8110c1a6
      Jens Axboe 提交于
      Some applications like to start small in terms of ring size, and then
      ramp up as needed. This is a bit tricky to do currently, since we don't
      advertise the max ring size.
      
      This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring
      size exceed what we support, then clamp them at the max values instead
      of returning -EINVAL. Since we return the chosen ring sizes after setup,
      no further changes are needed on the application side. io_uring already
      changes the ring sizes if the application doesn't ask for power-of-two
      sizes, for example.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8110c1a6
    • J
      io_uring: add IORING_OP_MADVISE · c1ca757b
      Jens Axboe 提交于
      This adds support for doing madvise(2) through io_uring. We assume that
      any operation can block, and hence punt everything async. This could be
      improved, but hard to make bullet proof. The async punt ensures it's
      safe.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c1ca757b
    • J
      io_uring: add IORING_OP_FADVISE · 4840e418
      Jens Axboe 提交于
      This adds support for doing fadvise through io_uring. We assume that
      WILLNEED doesn't block, but that DONTNEED may block.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4840e418
    • J
      io_uring: allow use of offset == -1 to mean file position · ba04291e
      Jens Axboe 提交于
      This behaves like preadv2/pwritev2 with offset == -1, it'll use (and
      update) the current file position. This obviously comes with the caveat
      that if the application has multiple read/writes in flight, then the
      end result will not be as expected. This is similar to threads sharing
      a file descriptor and doing IO using the current file position.
      
      Since this feature isn't easily detectable by doing a read or write,
      add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to
      detect presence of this feature.
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba04291e
    • J
      io_uring: add non-vectored read/write commands · 3a6820f2
      Jens Axboe 提交于
      For uses cases that don't already naturally have an iovec, it's easier
      (or more convenient) to just use a buffer address + length. This is
      particular true if the use case is from languages that want to create
      a memory safe abstraction on top of io_uring, and where introducing
      the need for the iovec may impose an ownership issue. For those cases,
      they currently need an indirection buffer, which means allocating data
      just for this purpose.
      
      Add basic read/write that don't require the iovec.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a6820f2
    • J
      io_uring: add IOSQE_ASYNC · ce35a47a
      Jens Axboe 提交于
      io_uring defaults to always doing inline submissions, if at all
      possible. But for larger copies, even if the data is fully cached, that
      can take a long time. Add an IOSQE_ASYNC flag that the application can
      set on the SQE - if set, it'll ensure that we always go async for those
      kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we
      get the concurrency we desire for this case.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ce35a47a
    • J
      io_uring: add support for IORING_OP_STATX · eddc7ef5
      Jens Axboe 提交于
      This provides support for async statx(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eddc7ef5
    • J
      io_uring: avoid ring quiesce for fixed file set unregister and update · 05f3fb3c
      Jens Axboe 提交于
      We currently fully quiesce the ring before an unregister or update of
      the fixed fileset. This is very expensive, and we can be a bit smarter
      about this.
      
      Add a percpu refcount for the file tables as a whole. Grab a percpu ref
      when we use a registered file, and put it on completion. This is cheap
      to do. Upon removal of a file from a set, switch the ref count to atomic
      mode. When we hit zero ref on the completion side, then we know we can
      drop the previously registered files. When the old files have been
      dropped, switch the ref back to percpu mode for normal operation.
      
      Since there's a period between doing the update and the kernel being
      done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the
      same action. The application knows the update has completed when it gets
      the CQE for it. Between doing the update and receiving this completion,
      the application must continue to use the unregistered fd if submitting
      IO on this particular file.
      
      This takes the runtime of test/file-register from liburing from 14s to
      about 0.7s.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05f3fb3c
    • J
      io_uring: add support for IORING_OP_CLOSE · b5dba59e
      Jens Axboe 提交于
      This works just like close(2), unsurprisingly. We remove the file
      descriptor and post the completion inline, then offload the actual
      (potential) last file put to async context.
      
      Mark the async part of this work as uncancellable, as we really must
      guarantee that the latter part of the close is run.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b5dba59e
    • J
      io_uring: add support for IORING_OP_OPENAT · 15b71abe
      Jens Axboe 提交于
      This works just like openat(2), except it can be performed async. For
      the normal case of a non-blocking path lookup this will complete
      inline. If we have to do IO to perform the open, it'll be done from
      async context.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      15b71abe
    • J
      io_uring: add support for fallocate() · d63d1b5e
      Jens Axboe 提交于
      This exposes fallocate(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d63d1b5e
    • E
      io_uring: fix compat for IORING_REGISTER_FILES_UPDATE · 1292e972
      Eugene Syromiatnikov 提交于
      fds field of struct io_uring_files_update is problematic with regards
      to compat user space, as pointer size is different in 32-bit, 32-on-64-bit,
      and 64-bit user space.  In order to avoid custom handling of compat in
      the syscall implementation, make fds __u64 and use u64_to_user_ptr in
      order to retrieve it.  Also, align the field naturally and check that
      no garbage is passed there.
      
      Fixes: c3a31e60 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
      Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1292e972
  11. 12 12月, 2019 1 次提交
    • J
      io_uring: ensure we return -EINVAL on unknown opcode · 9e3aa61a
      Jens Axboe 提交于
      If we submit an unknown opcode and have fd == -1, io_op_needs_file()
      will return true as we default to needing a file. Then when we go and
      assign the file, we find the 'fd' invalid and return -EBADF. We really
      should be returning -EINVAL for that case, as we normally do for
      unsupported opcodes.
      
      Change io_op_needs_file() to have the following return values:
      
      0   - does not need a file
      1   - does need a file
      < 0 - error value
      
      and use this to pass back the right value for this invalid case.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9e3aa61a
  12. 11 12月, 2019 1 次提交
    • J
      io_uring: allow unbreakable links · 4e88d6e7
      Jens Axboe 提交于
      Some commands will invariably end in a failure in the sense that the
      completion result will be less than zero. One such example is timeouts
      that don't have a completion count set, they will always complete with
      -ETIME unless cancelled.
      
      For linked commands, we sever links and fail the rest of the chain if
      the result is less than zero. Since we have commands where we know that
      will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever
      regardless of the completion result. Note that the link will still sever
      if we fail submitting the parent request, hard links are only resilient
      in the presence of completion results for requests that did submit
      correctly.
      
      Cc: stable@vger.kernel.org # v5.4
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4e88d6e7
  13. 03 12月, 2019 1 次提交
  14. 26 11月, 2019 1 次提交
    • J
      io_uring: add support for IORING_OP_CONNECT · f8e85cf2
      Jens Axboe 提交于
      This allows an application to call connect() in an async fashion. Like
      other opcodes, we first try a non-blocking connect, then punt to async
      context if we have to.
      
      Note that we can still return -EINPROGRESS, and in that case the caller
      should use IORING_OP_POLL_ADD to do an async wait for completion of the
      connect request (just like for regular connect(2), except we can do it
      async here too).
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f8e85cf2
  15. 10 11月, 2019 1 次提交
    • J
      io_uring: add support for backlogged CQ ring · 1d7bb1d5
      Jens Axboe 提交于
      Currently we drop completion events, if the CQ ring is full. That's fine
      for requests with bounded completion times, but it may make it harder or
      impossible to use io_uring with networked IO where request completion
      times are generally unbounded. Or with POLL, for example, which is also
      unbounded.
      
      After this patch, we never overflow the ring, we simply store requests
      in a backlog for later flushing. This flushing is done automatically by
      the kernel. To prevent the backlog from growing indefinitely, if the
      backlog is non-empty, we apply back pressure on IO submissions. Any
      attempt to submit new IO with a non-empty backlog will get an -EBUSY
      return from the kernel. This is a signal to the application that it has
      backlogged CQ events, and that it must reap those before being allowed
      to submit more IO.
      
      Note that if we do return -EBUSY, we will have filled whatever
      backlogged events into the CQ ring first, if there's room. This means
      the application can safely reap events WITHOUT entering the kernel and
      waiting for them, they are already available in the CQ ring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1d7bb1d5
  16. 08 11月, 2019 1 次提交
    • J
      io_uring: add support for linked SQE timeouts · 2665abfd
      Jens Axboe 提交于
      While we have support for generic timeouts, we don't have a way to tie
      a timeout to a specific SQE. The generic timeouts simply trigger wakeups
      on the CQ ring.
      
      This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
      as a link to a previous command. The timeout specific can be either
      relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
      the timeout triggers before the dependent command completes, it will
      attempt to cancel that command. Likewise, if the dependent command
      completes before the timeout triggers, it will cancel the timeout.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2665abfd
  17. 01 11月, 2019 1 次提交
    • J
      io_uring: support for generic async request cancel · 62755e35
      Jens Axboe 提交于
      This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
      cancel requests that have been punted to async context and are now
      in-flight. This works for regular read/write requests to files, as
      long as they haven't been started yet. For socket based IO (or things
      like accept4(2)), we can cancel work that is already running as well.
      
      To cancel a request, the sqe must have ->addr set to the user_data of
      the request it wishes to cancel. If the request is cancelled
      successfully, the original request is completed with -ECANCELED
      and the cancel request is completed with a result of 0. If the
      request was already running, the original may or may not complete
      in error. The cancel request will complete with -EALREADY for that
      case. And finally, if the request to cancel wasn't found, the cancel
      request is completed with -ENOENT.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      62755e35