1. 24 2月, 2021 1 次提交
    • J
      io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS · 1c0aa1fa
      Jens Axboe 提交于
      A few reasons to do this:
      
      - The naming of the manager and worker have changed. That's a user visible
        change, so makes sense to flag it.
      
      - Opening certain files that use ->signal (like /proc/self or /dev/tty)
        now works, and the flag tells the application upfront that this is the
        case.
      
      - Related to the above, using signalfd will now work as well.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1c0aa1fa
  2. 02 2月, 2021 2 次提交
  3. 10 12月, 2020 5 次提交
  4. 24 11月, 2020 1 次提交
  5. 01 10月, 2020 4 次提交
  6. 09 7月, 2020 1 次提交
    • X
      io_uring: export cq overflow status to userspace · 6d5f9049
      Xiaoguang Wang 提交于
      For those applications which are not willing to use io_uring_enter()
      to reap and handle cqes, they may completely rely on liburing's
      io_uring_peek_cqe(), but if cq ring has overflowed, currently because
      io_uring_peek_cqe() is not aware of this overflow, it won't enter
      kernel to flush cqes, below test program can reveal this bug:
      
      static void test_cq_overflow(struct io_uring *ring)
      {
              struct io_uring_cqe *cqe;
              struct io_uring_sqe *sqe;
              int issued = 0;
              int ret = 0;
      
              do {
                      sqe = io_uring_get_sqe(ring);
                      if (!sqe) {
                              fprintf(stderr, "get sqe failed\n");
                              break;;
                      }
                      ret = io_uring_submit(ring);
                      if (ret <= 0) {
                              if (ret != -EBUSY)
                                      fprintf(stderr, "sqe submit failed: %d\n", ret);
                              break;
                      }
                      issued++;
              } while (ret > 0);
              assert(ret == -EBUSY);
      
              printf("issued requests: %d\n", issued);
      
              while (issued) {
                      ret = io_uring_peek_cqe(ring, &cqe);
                      if (ret) {
                              if (ret != -EAGAIN) {
                                      fprintf(stderr, "peek completion failed: %s\n",
                                              strerror(ret));
                                      break;
                              }
                              printf("left requets: %d\n", issued);
                              continue;
                      }
                      io_uring_cqe_seen(ring, cqe);
                      issued--;
                      printf("left requets: %d\n", issued);
              }
      }
      
      int main(int argc, char *argv[])
      {
              int ret;
              struct io_uring ring;
      
              ret = io_uring_queue_init(16, &ring, 0);
              if (ret) {
                      fprintf(stderr, "ring setup failed: %d\n", ret);
                      return 1;
              }
      
              test_cq_overflow(&ring);
              return 0;
      }
      
      To fix this issue, export cq overflow status to userspace by adding new
      IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
      io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6d5f9049
  7. 22 6月, 2020 1 次提交
  8. 18 5月, 2020 1 次提交
  9. 16 5月, 2020 2 次提交
  10. 22 3月, 2020 1 次提交
  11. 11 3月, 2020 1 次提交
  12. 10 3月, 2020 3 次提交
    • J
      io_uring: provide means of removing buffers · 067524e9
      Jens Axboe 提交于
      We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers
      is to trigger IO on them. The usual case of shrinking a buffer pool
      would be to just not replenish the buffers when IO completes, and
      instead just free it. But it may be nice to have a way to manually
      remove a number of buffers from a given group, and
      IORING_OP_REMOVE_BUFFERS provides that functionality.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      067524e9
    • J
      io_uring: support buffer selection for OP_READ and OP_RECV · bcda7baa
      Jens Axboe 提交于
      If a server process has tons of pending socket connections, generally
      it uses epoll to wait for activity. When the socket is ready for reading
      (or writing), the task can select a buffer and issue a recv/send on the
      given fd.
      
      Now that we have fast (non-async thread) support, a task can have tons
      of pending reads or writes pending. But that means they need buffers to
      back that data, and if the number of connections is high enough, having
      them preallocated for all possible connections is unfeasible.
      
      With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
      use for any request. The request then sets IOSQE_BUFFER_SELECT in the
      sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
      a free buffer from the specified group is selected. If none are
      available, the request is terminated with -ENOBUFS. If successful, the
      CQE on completion will contain the buffer ID chosen in the cqe->flags
      member, encoded as:
      
      	(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
      
      Once a buffer has been consumed by a request, it is no longer available
      and must be registered again with IORING_OP_PROVIDE_BUFFERS.
      
      Requests need to support this feature. For now, IORING_OP_READ and
      IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
      res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bcda7baa
    • J
      io_uring: add IORING_OP_PROVIDE_BUFFERS · ddf0322d
      Jens Axboe 提交于
      IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
      support passing in an addr/len that is associated with a buffer ID and
      buffer group ID. The group ID is used to index and lookup the buffers,
      while the buffer ID can be used to notify the application which buffer
      in the group was used. The addr passed in is the starting buffer address,
      and length is each buffer length. A number of buffers to add with can be
      specified, in which case addr is incremented by length for each addition,
      and each buffer increments the buffer ID specified.
      
      No validation is done of the buffer ID. If the application provides
      buffers within the same group with identical buffer IDs, then it'll have
      a hard time telling which buffer ID was used. The only restriction is
      that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
      maximum ID that can be used.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddf0322d
  13. 03 3月, 2020 2 次提交
    • J
      io_uring: use poll driven retry for files that support it · d7718a9d
      Jens Axboe 提交于
      Currently io_uring tries any request in a non-blocking manner, if it can,
      and then retries from a worker thread if we get -EAGAIN. Now that we have
      a new and fancy poll based retry backend, use that to retry requests if
      the file supports it.
      
      This means that, for example, an IORING_OP_RECVMSG on a socket no longer
      requires an async thread to complete the IO. If we get -EAGAIN reading
      from the socket in a non-blocking manner, we arm a poll handler for
      notification on when the socket becomes readable. When it does, the
      pending read is executed directly by the task again, through the io_uring
      task work handlers. Not only is this faster and more efficient, it also
      means we're not generating potentially tons of async threads that just
      sit and block, waiting for the IO to complete.
      
      The feature is marked with IORING_FEAT_FAST_POLL, meaning that async
      pollable IO is fast, and that poll<link>other_op is fast as well.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d7718a9d
    • P
      io_uring: add splice(2) support · 7d67af2c
      Pavel Begunkov 提交于
      Add support for splice(2).
      
      - output file is specified as sqe->fd, so it's handled by generic code
      - hash_reg_file handled by generic code as well
      - len is 32bit, but should be fine
      - the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which
      is a splice flag (i.e. sqe->splice_flags).
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7d67af2c
  14. 30 1月, 2020 1 次提交
  15. 29 1月, 2020 4 次提交
  16. 21 1月, 2020 10 次提交
    • P
      io_uring: optimise sqe-to-req flags translation · 6b47ee6e
      Pavel Begunkov 提交于
      For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there
      is a repetitive pattern of their translation:
      e.g. if (sqe->flags & SQE_FLAG*) req->flags |= REQ_F_FLAG*
      
      Use same numeric values/bits for them and copy instead of manual
      handling.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b47ee6e
    • J
      io_uring: add support for probing opcodes · 66f4af93
      Jens Axboe 提交于
      The application currently has no way of knowing if a given opcode is
      supported or not without having to try and issue one and see if we get
      -EINVAL or not. And even this approach is fraught with peril, as maybe
      we're getting -EINVAL due to some fields being missing, or maybe it's
      just not that easy to issue that particular command without doing some
      other leg work in terms of setup first.
      
      This adds IORING_REGISTER_PROBE, which fills in a structure with info
      on what it supported or not. This will work even with sparse opcode
      fields, which may happen in the future or even today if someone
      backports specific features to older kernels.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      66f4af93
    • J
      io_uring: add support for IORING_OP_OPENAT2 · cebdb986
      Jens Axboe 提交于
      Add support for the new openat2(2) system call. It's trivial to do, as
      we can have openat(2) just be wrapped around it.
      Suggested-by: NStefan Metzmacher <metze@samba.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cebdb986
    • J
      io_uring: enable option to only trigger eventfd for async completions · f2842ab5
      Jens Axboe 提交于
      If an application is using eventfd notifications with poll to know when
      new SQEs can be issued, it's expecting the following read/writes to
      complete inline. And with that, it knows that there are events available,
      and don't want spurious wakeups on the eventfd for those requests.
      
      This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like
      IORING_REGISTER_EVENTFD, except it only triggers notifications for events
      that happen from async completions (IRQ, or io-wq worker completions).
      Any completions inline from the submission itself will not trigger
      notifications.
      Suggested-by: NMark Papadakis <markuspapadakis@icloud.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f2842ab5
    • J
      io_uring: add support for send(2) and recv(2) · fddaface
      Jens Axboe 提交于
      This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for
      recv(2) support.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fddaface
    • J
      io_uring: add support for IORING_SETUP_CLAMP · 8110c1a6
      Jens Axboe 提交于
      Some applications like to start small in terms of ring size, and then
      ramp up as needed. This is a bit tricky to do currently, since we don't
      advertise the max ring size.
      
      This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring
      size exceed what we support, then clamp them at the max values instead
      of returning -EINVAL. Since we return the chosen ring sizes after setup,
      no further changes are needed on the application side. io_uring already
      changes the ring sizes if the application doesn't ask for power-of-two
      sizes, for example.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8110c1a6
    • J
      io_uring: add IORING_OP_MADVISE · c1ca757b
      Jens Axboe 提交于
      This adds support for doing madvise(2) through io_uring. We assume that
      any operation can block, and hence punt everything async. This could be
      improved, but hard to make bullet proof. The async punt ensures it's
      safe.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c1ca757b
    • J
      io_uring: add IORING_OP_FADVISE · 4840e418
      Jens Axboe 提交于
      This adds support for doing fadvise through io_uring. We assume that
      WILLNEED doesn't block, but that DONTNEED may block.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4840e418
    • J
      io_uring: allow use of offset == -1 to mean file position · ba04291e
      Jens Axboe 提交于
      This behaves like preadv2/pwritev2 with offset == -1, it'll use (and
      update) the current file position. This obviously comes with the caveat
      that if the application has multiple read/writes in flight, then the
      end result will not be as expected. This is similar to threads sharing
      a file descriptor and doing IO using the current file position.
      
      Since this feature isn't easily detectable by doing a read or write,
      add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to
      detect presence of this feature.
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba04291e
    • J
      io_uring: add non-vectored read/write commands · 3a6820f2
      Jens Axboe 提交于
      For uses cases that don't already naturally have an iovec, it's easier
      (or more convenient) to just use a buffer address + length. This is
      particular true if the use case is from languages that want to create
      a memory safe abstraction on top of io_uring, and where introducing
      the need for the iovec may impose an ownership issue. For those cases,
      they currently need an indirection buffer, which means allocating data
      just for this purpose.
      
      Add basic read/write that don't require the iovec.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a6820f2