1. 21 1月, 2020 13 次提交
    • J
      io_uring: add support for send(2) and recv(2) · fddaface
      Jens Axboe 提交于
      This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for
      recv(2) support.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fddaface
    • J
      io_uring: add support for IORING_SETUP_CLAMP · 8110c1a6
      Jens Axboe 提交于
      Some applications like to start small in terms of ring size, and then
      ramp up as needed. This is a bit tricky to do currently, since we don't
      advertise the max ring size.
      
      This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring
      size exceed what we support, then clamp them at the max values instead
      of returning -EINVAL. Since we return the chosen ring sizes after setup,
      no further changes are needed on the application side. io_uring already
      changes the ring sizes if the application doesn't ask for power-of-two
      sizes, for example.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8110c1a6
    • J
      io_uring: add IORING_OP_MADVISE · c1ca757b
      Jens Axboe 提交于
      This adds support for doing madvise(2) through io_uring. We assume that
      any operation can block, and hence punt everything async. This could be
      improved, but hard to make bullet proof. The async punt ensures it's
      safe.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c1ca757b
    • J
      io_uring: add IORING_OP_FADVISE · 4840e418
      Jens Axboe 提交于
      This adds support for doing fadvise through io_uring. We assume that
      WILLNEED doesn't block, but that DONTNEED may block.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4840e418
    • J
      io_uring: allow use of offset == -1 to mean file position · ba04291e
      Jens Axboe 提交于
      This behaves like preadv2/pwritev2 with offset == -1, it'll use (and
      update) the current file position. This obviously comes with the caveat
      that if the application has multiple read/writes in flight, then the
      end result will not be as expected. This is similar to threads sharing
      a file descriptor and doing IO using the current file position.
      
      Since this feature isn't easily detectable by doing a read or write,
      add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to
      detect presence of this feature.
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba04291e
    • J
      io_uring: add non-vectored read/write commands · 3a6820f2
      Jens Axboe 提交于
      For uses cases that don't already naturally have an iovec, it's easier
      (or more convenient) to just use a buffer address + length. This is
      particular true if the use case is from languages that want to create
      a memory safe abstraction on top of io_uring, and where introducing
      the need for the iovec may impose an ownership issue. For those cases,
      they currently need an indirection buffer, which means allocating data
      just for this purpose.
      
      Add basic read/write that don't require the iovec.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a6820f2
    • J
      io_uring: add IOSQE_ASYNC · ce35a47a
      Jens Axboe 提交于
      io_uring defaults to always doing inline submissions, if at all
      possible. But for larger copies, even if the data is fully cached, that
      can take a long time. Add an IOSQE_ASYNC flag that the application can
      set on the SQE - if set, it'll ensure that we always go async for those
      kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we
      get the concurrency we desire for this case.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ce35a47a
    • J
      io_uring: add support for IORING_OP_STATX · eddc7ef5
      Jens Axboe 提交于
      This provides support for async statx(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eddc7ef5
    • J
      io_uring: avoid ring quiesce for fixed file set unregister and update · 05f3fb3c
      Jens Axboe 提交于
      We currently fully quiesce the ring before an unregister or update of
      the fixed fileset. This is very expensive, and we can be a bit smarter
      about this.
      
      Add a percpu refcount for the file tables as a whole. Grab a percpu ref
      when we use a registered file, and put it on completion. This is cheap
      to do. Upon removal of a file from a set, switch the ref count to atomic
      mode. When we hit zero ref on the completion side, then we know we can
      drop the previously registered files. When the old files have been
      dropped, switch the ref back to percpu mode for normal operation.
      
      Since there's a period between doing the update and the kernel being
      done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the
      same action. The application knows the update has completed when it gets
      the CQE for it. Between doing the update and receiving this completion,
      the application must continue to use the unregistered fd if submitting
      IO on this particular file.
      
      This takes the runtime of test/file-register from liburing from 14s to
      about 0.7s.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05f3fb3c
    • J
      io_uring: add support for IORING_OP_CLOSE · b5dba59e
      Jens Axboe 提交于
      This works just like close(2), unsurprisingly. We remove the file
      descriptor and post the completion inline, then offload the actual
      (potential) last file put to async context.
      
      Mark the async part of this work as uncancellable, as we really must
      guarantee that the latter part of the close is run.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b5dba59e
    • J
      io_uring: add support for IORING_OP_OPENAT · 15b71abe
      Jens Axboe 提交于
      This works just like openat(2), except it can be performed async. For
      the normal case of a non-blocking path lookup this will complete
      inline. If we have to do IO to perform the open, it'll be done from
      async context.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      15b71abe
    • J
      io_uring: add support for fallocate() · d63d1b5e
      Jens Axboe 提交于
      This exposes fallocate(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d63d1b5e
    • E
      io_uring: fix compat for IORING_REGISTER_FILES_UPDATE · 1292e972
      Eugene Syromiatnikov 提交于
      fds field of struct io_uring_files_update is problematic with regards
      to compat user space, as pointer size is different in 32-bit, 32-on-64-bit,
      and 64-bit user space.  In order to avoid custom handling of compat in
      the syscall implementation, make fds __u64 and use u64_to_user_ptr in
      order to retrieve it.  Also, align the field naturally and check that
      no garbage is passed there.
      
      Fixes: c3a31e60 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
      Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1292e972
  2. 12 12月, 2019 1 次提交
    • J
      io_uring: ensure we return -EINVAL on unknown opcode · 9e3aa61a
      Jens Axboe 提交于
      If we submit an unknown opcode and have fd == -1, io_op_needs_file()
      will return true as we default to needing a file. Then when we go and
      assign the file, we find the 'fd' invalid and return -EBADF. We really
      should be returning -EINVAL for that case, as we normally do for
      unsupported opcodes.
      
      Change io_op_needs_file() to have the following return values:
      
      0   - does not need a file
      1   - does need a file
      < 0 - error value
      
      and use this to pass back the right value for this invalid case.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9e3aa61a
  3. 11 12月, 2019 1 次提交
    • J
      io_uring: allow unbreakable links · 4e88d6e7
      Jens Axboe 提交于
      Some commands will invariably end in a failure in the sense that the
      completion result will be less than zero. One such example is timeouts
      that don't have a completion count set, they will always complete with
      -ETIME unless cancelled.
      
      For linked commands, we sever links and fail the rest of the chain if
      the result is less than zero. Since we have commands where we know that
      will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever
      regardless of the completion result. Note that the link will still sever
      if we fail submitting the parent request, hard links are only resilient
      in the presence of completion results for requests that did submit
      correctly.
      
      Cc: stable@vger.kernel.org # v5.4
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4e88d6e7
  4. 03 12月, 2019 1 次提交
  5. 26 11月, 2019 1 次提交
    • J
      io_uring: add support for IORING_OP_CONNECT · f8e85cf2
      Jens Axboe 提交于
      This allows an application to call connect() in an async fashion. Like
      other opcodes, we first try a non-blocking connect, then punt to async
      context if we have to.
      
      Note that we can still return -EINPROGRESS, and in that case the caller
      should use IORING_OP_POLL_ADD to do an async wait for completion of the
      connect request (just like for regular connect(2), except we can do it
      async here too).
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f8e85cf2
  6. 10 11月, 2019 1 次提交
    • J
      io_uring: add support for backlogged CQ ring · 1d7bb1d5
      Jens Axboe 提交于
      Currently we drop completion events, if the CQ ring is full. That's fine
      for requests with bounded completion times, but it may make it harder or
      impossible to use io_uring with networked IO where request completion
      times are generally unbounded. Or with POLL, for example, which is also
      unbounded.
      
      After this patch, we never overflow the ring, we simply store requests
      in a backlog for later flushing. This flushing is done automatically by
      the kernel. To prevent the backlog from growing indefinitely, if the
      backlog is non-empty, we apply back pressure on IO submissions. Any
      attempt to submit new IO with a non-empty backlog will get an -EBUSY
      return from the kernel. This is a signal to the application that it has
      backlogged CQ events, and that it must reap those before being allowed
      to submit more IO.
      
      Note that if we do return -EBUSY, we will have filled whatever
      backlogged events into the CQ ring first, if there's room. This means
      the application can safely reap events WITHOUT entering the kernel and
      waiting for them, they are already available in the CQ ring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1d7bb1d5
  7. 08 11月, 2019 1 次提交
    • J
      io_uring: add support for linked SQE timeouts · 2665abfd
      Jens Axboe 提交于
      While we have support for generic timeouts, we don't have a way to tie
      a timeout to a specific SQE. The generic timeouts simply trigger wakeups
      on the CQ ring.
      
      This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
      as a link to a previous command. The timeout specific can be either
      relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
      the timeout triggers before the dependent command completes, it will
      attempt to cancel that command. Likewise, if the dependent command
      completes before the timeout triggers, it will cancel the timeout.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2665abfd
  8. 01 11月, 2019 1 次提交
    • J
      io_uring: support for generic async request cancel · 62755e35
      Jens Axboe 提交于
      This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
      cancel requests that have been punted to async context and are now
      in-flight. This works for regular read/write requests to files, as
      long as they haven't been started yet. For socket based IO (or things
      like accept4(2)), we can cancel work that is already running as well.
      
      To cancel a request, the sqe must have ->addr set to the user_data of
      the request it wishes to cancel. If the request is cancelled
      successfully, the original request is completed with -ECANCELED
      and the cancel request is completed with a result of 0. If the
      request was already running, the original may or may not complete
      in error. The cancel request will complete with -EALREADY for that
      case. And finally, if the request to cancel wasn't found, the cancel
      request is completed with -ENOENT.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      62755e35
  9. 30 10月, 2019 5 次提交
    • J
      io_uring: add support for IORING_OP_ACCEPT · 17f2fe35
      Jens Axboe 提交于
      This allows an application to call accept4() in an async fashion. Like
      other opcodes, we first try a non-blocking accept, then punt to async
      context if we have to.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      17f2fe35
    • J
      io_uring: add support for canceling timeout requests · 11365043
      Jens Axboe 提交于
      We might have cases where the need for a specific timeout is gone, add
      support for canceling an existing timeout operation. This works like the
      POLL_REMOVE command, where the application passes in the user_data of
      the timeout it wishes to cancel in the sqe->addr field.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      11365043
    • J
      io_uring: add support for absolute timeouts · a41525ab
      Jens Axboe 提交于
      This is a pretty trivial addition on top of the relative timeouts
      we have now, but it's handy for ensuring tighter timing for those
      that are building scheduling primitives on top of io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a41525ab
    • J
      io_uring: allow application controlled CQ ring size · 33a107f0
      Jens Axboe 提交于
      We currently size the CQ ring as twice the SQ ring, to allow some
      flexibility in not overflowing the CQ ring. This is done because the
      SQE life time is different than that of the IO request itself, the SQE
      is consumed as soon as the kernel has seen the entry.
      
      Certain application don't need a huge SQ ring size, since they just
      submit IO in batches. But they may have a lot of requests pending, and
      hence need a big CQ ring to hold them all. By allowing the application
      to control the CQ ring size multiplier, we can cater to those
      applications more efficiently.
      
      If an application wants to define its own CQ ring size, it must set
      IORING_SETUP_CQSIZE in the setup flags, and fill out
      io_uring_params->cq_entries. The value must be a power of two.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      33a107f0
    • J
      io_uring: add support for IORING_REGISTER_FILES_UPDATE · c3a31e60
      Jens Axboe 提交于
      Allows the application to remove/replace/add files to/from a file set.
      Passes in a struct:
      
      struct io_uring_files_update {
      	__u32 offset;
      	__s32 *fds;
      };
      
      that holds an array of fds, size of array passed in through the usual
      nr_args part of the io_uring_register() system call. The logic is as
      follows:
      
      1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
         the set.
      2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
         replaced with ->fds[i].
      
      For case #2, is the existing file is currently empty (fd == -1), the
      new fd is simply added to the array.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c3a31e60
  10. 19 9月, 2019 1 次提交
    • J
      io_uring: IORING_OP_TIMEOUT support · 5262f567
      Jens Axboe 提交于
      There's been a few requests for functionality similar to io_getevents()
      and epoll_wait(), where the user can specify a timeout for waiting on
      events. I deliberately did not add support for this through the system
      call initially to avoid overloading the args, but I can see that the use
      cases for this are valid.
      
      This adds support for IORING_OP_TIMEOUT. If a user wants to get woken
      when waiting for events, simply submit one of these timeout commands
      with your wait call (or before). This ensures that the application
      sleeping on the CQ ring waiting for events will get woken. The timeout
      command is passed in as a pointer to a struct timespec. Timeouts are
      relative. The timeout command also includes a way to auto-cancel after
      N events has passed.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5262f567
  11. 07 9月, 2019 1 次提交
    • J
      io_uring: expose single mmap capability · ac90f249
      Jens Axboe 提交于
      After commit 75b28aff we can get by with just a single mmap to
      map both the sq and cq ring. However, userspace doesn't know that.
      
      Add a features variable to io_uring_params, and notify userspace
      that the kernel has this ability. This can then be used in liburing
      (or in applications directly) to avoid the second mmap.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ac90f249
  12. 10 7月, 2019 2 次提交
  13. 24 6月, 2019 1 次提交
    • J
      io_uring: add support for sqe links · 9e645e11
      Jens Axboe 提交于
      With SQE links, we can create chains of dependent SQEs. One example
      would be queueing an SQE that's a read from one file descriptor, with
      the linked SQE being a write to another with the same set of buffers.
      
      An SQE link will not stall the pipeline, it'll just ensure that
      dependent SQEs aren't issued before the previous link has completed.
      
      Any error at submission or completion time will break the chain of SQEs.
      For completions, this also includes short reads or writes, as the next
      SQE could depend on the previous one being fully completed.
      
      Any SQE in a chain that gets canceled due to any of the above errors,
      will get an CQE fill with -ECANCELED as the error value.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9e645e11
  14. 03 5月, 2019 3 次提交
    • J
      io_uring: add support for eventfd notifications · 9b402849
      Jens Axboe 提交于
      Allow registration of an eventfd, which will trigger an event every
      time a completion event happens for this io_uring instance.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b402849
    • J
      io_uring: add support for IORING_OP_SYNC_FILE_RANGE · 5d17b4a4
      Jens Axboe 提交于
      This behaves just like sync_file_range(2) does.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5d17b4a4
    • J
      io_uring: add support for marking commands as draining · de0617e4
      Jens Axboe 提交于
      There are no ordering constraints between the submission and completion
      side of io_uring. But sometimes that would be useful to have. One common
      example is doing an fsync, for instance, and have it ordered with
      previous writes. Without support for that, the application must do this
      tracking itself.
      
      This adds a general SQE flag, IOSQE_IO_DRAIN. If a command is marked
      with this flag, then it will not be issued before previous commands have
      completed, and subsequent commands submitted after the drain will not be
      issued before the drain is started.. If there are no pending commands,
      setting this flag will not change the behavior of the issue of the
      command.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      de0617e4
  15. 07 3月, 2019 1 次提交
    • J
      io_uring: add support for IORING_OP_POLL · 221c5eb2
      Jens Axboe 提交于
      This is basically a direct port of bfe4037e, which implements a
      one-shot poll command through aio. Description below is based on that
      commit as well. However, instead of adding a POLL command and relying
      on io_cancel(2) to remove it, we mimic the epoll(2) interface of
      having a command to add a poll notification, IORING_OP_POLL_ADD,
      and one to remove it again, IORING_OP_POLL_REMOVE.
      
      To poll for a file descriptor the application should submit an sqe of
      type IORING_OP_POLL. It will poll the fd for the events specified in the
      poll_events field.
      
      Unlike poll or epoll without EPOLLONESHOT this interface always works in
      one shot mode, that is once the sqe is completed, it will have to be
      resubmitted.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Based-on-code-from: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      221c5eb2
  16. 28 2月, 2019 6 次提交
    • J
      io_uring: add submission polling · 6c271ce2
      Jens Axboe 提交于
      This enables an application to do IO, without ever entering the kernel.
      By using the SQ ring to fill in new sqes and watching for completions
      on the CQ ring, we can submit and reap IOs without doing a single system
      call. The kernel side thread will poll for new submissions, and in case
      of HIPRI/polled IO, it'll also poll for completions.
      
      By default, we allow 1 second of active spinning. This can by changed
      by passing in a different grace period at io_uring_register(2) time.
      If the thread exceeds this idle time without having any work to do, it
      will set:
      
      sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
      
      The application will have to call io_uring_enter() to start things back
      up again. If IO is kept busy, that will never be needed. Basically an
      application that has this feature enabled will guard it's
      io_uring_enter(2) call with:
      
      read_barrier();
      if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
      	io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
      
      instead of calling it unconditionally.
      
      It's mandatory to use fixed files with this feature. Failure to do so
      will result in the application getting an -EBADF CQ entry when
      submitting IO.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6c271ce2
    • J
      io_uring: add file set registration · 6b06314c
      Jens Axboe 提交于
      We normally have to fget/fput for each IO we do on a file. Even with
      the batching we do, the cost of the atomic inc/dec of the file usage
      count adds up.
      
      This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
      for the io_uring_register(2) system call. The arguments passed in must
      be an array of __s32 holding file descriptors, and nr_args should hold
      the number of file descriptors the application wishes to pin for the
      duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
      called).
      
      When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
      member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
      to the index in the array passed in to IORING_REGISTER_FILES.
      
      Files are automatically unregistered when the io_uring instance is torn
      down. An application need only unregister if it wishes to register a new
      set of fds.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b06314c
    • J
      io_uring: add support for pre-mapped user IO buffers · edafccee
      Jens Axboe 提交于
      If we have fixed user buffers, we can map them into the kernel when we
      setup the io_uring. That avoids the need to do get_user_pages() for
      each and every IO.
      
      To utilize this feature, the application must call io_uring_register()
      after having setup an io_uring instance, passing in
      IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
      an iovec array, and the nr_args should contain how many iovecs the
      application wishes to map.
      
      If successful, these buffers are now mapped into the kernel, eligible
      for IO. To use these fixed buffers, the application must use the
      IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
      set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
      must point to somewhere inside the indexed buffer.
      
      The application may register buffers throughout the lifetime of the
      io_uring instance. It can call io_uring_register() with
      IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
      buffers, and then register a new set. The application need not
      unregister buffers explicitly before shutting down the io_uring
      instance.
      
      It's perfectly valid to setup a larger buffer, and then sometimes only
      use parts of it for an IO. As long as the range is within the originally
      mapped region, it will work just fine.
      
      For now, buffers must not be file backed. If file backed buffers are
      passed in, the registration will fail with -1/EOPNOTSUPP. This
      restriction may be relaxed in the future.
      
      RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
      arbitrary 1G per buffer size is also imposed.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      edafccee
    • J
      io_uring: support for IO polling · def596e9
      Jens Axboe 提交于
      Add support for a polled io_uring instance. When a read or write is
      submitted to a polled io_uring, the application must poll for
      completions on the CQ ring through io_uring_enter(2). Polled IO may not
      generate IRQ completions, hence they need to be actively found by the
      application itself.
      
      To use polling, io_uring_setup() must be used with the
      IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
      polled and non-polled IO on an io_uring.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      def596e9
    • C
      io_uring: add fsync support · c992fe29
      Christoph Hellwig 提交于
      Add a new fsync opcode, which either syncs a range if one is passed,
      or the whole file if the offset and length fields are both cleared
      to zero.  A flag is provided to use fdatasync semantics, that is only
      force out metadata which is required to retrieve the file data, but
      not others like metadata.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c992fe29
    • J
      Add io_uring IO interface · 2b188cc1
      Jens Axboe 提交于
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b188cc1