1. 29 1月, 2020 3 次提交
    • J
      io_uring: allow registering credentials · 071698e1
      Jens Axboe 提交于
      If an application wants to use a ring with different kinds of
      credentials, it can register them upfront. We don't lookup credentials,
      the credentials of the task calling IORING_REGISTER_PERSONALITY is used.
      
      An 'id' is returned for the application to use in subsequent personality
      support.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      071698e1
    • P
      io_uring: add io-wq workqueue sharing · 24369c2e
      Pavel Begunkov 提交于
      If IORING_SETUP_ATTACH_WQ is set, it expects wq_fd in io_uring_params to
      be a valid io_uring fd io-wq of which will be shared with the newly
      created io_uring instance. If the flag is set but it can't share io-wq,
      it fails.
      
      This allows creation of "sibling" io_urings, where we prefer to keep the
      SQ/CQ private, but want to share the async backend to minimize the amount
      of overhead associated with having multiple rings that belong to the same
      backend.
      Reported-by: NJens Axboe <axboe@kernel.dk>
      Reported-by: NDaurnimator <quae@daurnimator.com>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      24369c2e
    • J
      io_uring/io-wq: don't use static creds/mm assignments · cccf0ee8
      Jens Axboe 提交于
      We currently setup the io_wq with a static set of mm and creds. Even for
      a single-use io-wq per io_uring, this is suboptimal as we have may have
      multiple enters of the ring. For sharing the io-wq backend, it doesn't
      work at all.
      
      Switch to passing in the creds and mm when the work item is setup. This
      means that async work is no longer deferred to the io_uring mm and creds,
      it is done with the current mm and creds.
      
      Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know
      they can rely on the current personality (mm and creds) being the same
      for direct issue and async issue.
      Reviewed-by: NStefan Metzmacher <metze@samba.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cccf0ee8
  2. 21 1月, 2020 17 次提交
    • P
      io_uring: optimise sqe-to-req flags translation · 6b47ee6e
      Pavel Begunkov 提交于
      For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there
      is a repetitive pattern of their translation:
      e.g. if (sqe->flags & SQE_FLAG*) req->flags |= REQ_F_FLAG*
      
      Use same numeric values/bits for them and copy instead of manual
      handling.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b47ee6e
    • J
      io_uring: add support for probing opcodes · 66f4af93
      Jens Axboe 提交于
      The application currently has no way of knowing if a given opcode is
      supported or not without having to try and issue one and see if we get
      -EINVAL or not. And even this approach is fraught with peril, as maybe
      we're getting -EINVAL due to some fields being missing, or maybe it's
      just not that easy to issue that particular command without doing some
      other leg work in terms of setup first.
      
      This adds IORING_REGISTER_PROBE, which fills in a structure with info
      on what it supported or not. This will work even with sparse opcode
      fields, which may happen in the future or even today if someone
      backports specific features to older kernels.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      66f4af93
    • J
      io_uring: add support for IORING_OP_OPENAT2 · cebdb986
      Jens Axboe 提交于
      Add support for the new openat2(2) system call. It's trivial to do, as
      we can have openat(2) just be wrapped around it.
      Suggested-by: NStefan Metzmacher <metze@samba.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cebdb986
    • J
      io_uring: enable option to only trigger eventfd for async completions · f2842ab5
      Jens Axboe 提交于
      If an application is using eventfd notifications with poll to know when
      new SQEs can be issued, it's expecting the following read/writes to
      complete inline. And with that, it knows that there are events available,
      and don't want spurious wakeups on the eventfd for those requests.
      
      This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like
      IORING_REGISTER_EVENTFD, except it only triggers notifications for events
      that happen from async completions (IRQ, or io-wq worker completions).
      Any completions inline from the submission itself will not trigger
      notifications.
      Suggested-by: NMark Papadakis <markuspapadakis@icloud.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f2842ab5
    • J
      io_uring: add support for send(2) and recv(2) · fddaface
      Jens Axboe 提交于
      This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for
      recv(2) support.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fddaface
    • J
      io_uring: add support for IORING_SETUP_CLAMP · 8110c1a6
      Jens Axboe 提交于
      Some applications like to start small in terms of ring size, and then
      ramp up as needed. This is a bit tricky to do currently, since we don't
      advertise the max ring size.
      
      This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring
      size exceed what we support, then clamp them at the max values instead
      of returning -EINVAL. Since we return the chosen ring sizes after setup,
      no further changes are needed on the application side. io_uring already
      changes the ring sizes if the application doesn't ask for power-of-two
      sizes, for example.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8110c1a6
    • J
      io_uring: add IORING_OP_MADVISE · c1ca757b
      Jens Axboe 提交于
      This adds support for doing madvise(2) through io_uring. We assume that
      any operation can block, and hence punt everything async. This could be
      improved, but hard to make bullet proof. The async punt ensures it's
      safe.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c1ca757b
    • J
      io_uring: add IORING_OP_FADVISE · 4840e418
      Jens Axboe 提交于
      This adds support for doing fadvise through io_uring. We assume that
      WILLNEED doesn't block, but that DONTNEED may block.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4840e418
    • J
      io_uring: allow use of offset == -1 to mean file position · ba04291e
      Jens Axboe 提交于
      This behaves like preadv2/pwritev2 with offset == -1, it'll use (and
      update) the current file position. This obviously comes with the caveat
      that if the application has multiple read/writes in flight, then the
      end result will not be as expected. This is similar to threads sharing
      a file descriptor and doing IO using the current file position.
      
      Since this feature isn't easily detectable by doing a read or write,
      add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to
      detect presence of this feature.
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba04291e
    • J
      io_uring: add non-vectored read/write commands · 3a6820f2
      Jens Axboe 提交于
      For uses cases that don't already naturally have an iovec, it's easier
      (or more convenient) to just use a buffer address + length. This is
      particular true if the use case is from languages that want to create
      a memory safe abstraction on top of io_uring, and where introducing
      the need for the iovec may impose an ownership issue. For those cases,
      they currently need an indirection buffer, which means allocating data
      just for this purpose.
      
      Add basic read/write that don't require the iovec.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a6820f2
    • J
      io_uring: add IOSQE_ASYNC · ce35a47a
      Jens Axboe 提交于
      io_uring defaults to always doing inline submissions, if at all
      possible. But for larger copies, even if the data is fully cached, that
      can take a long time. Add an IOSQE_ASYNC flag that the application can
      set on the SQE - if set, it'll ensure that we always go async for those
      kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we
      get the concurrency we desire for this case.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ce35a47a
    • J
      io_uring: add support for IORING_OP_STATX · eddc7ef5
      Jens Axboe 提交于
      This provides support for async statx(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eddc7ef5
    • J
      io_uring: avoid ring quiesce for fixed file set unregister and update · 05f3fb3c
      Jens Axboe 提交于
      We currently fully quiesce the ring before an unregister or update of
      the fixed fileset. This is very expensive, and we can be a bit smarter
      about this.
      
      Add a percpu refcount for the file tables as a whole. Grab a percpu ref
      when we use a registered file, and put it on completion. This is cheap
      to do. Upon removal of a file from a set, switch the ref count to atomic
      mode. When we hit zero ref on the completion side, then we know we can
      drop the previously registered files. When the old files have been
      dropped, switch the ref back to percpu mode for normal operation.
      
      Since there's a period between doing the update and the kernel being
      done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the
      same action. The application knows the update has completed when it gets
      the CQE for it. Between doing the update and receiving this completion,
      the application must continue to use the unregistered fd if submitting
      IO on this particular file.
      
      This takes the runtime of test/file-register from liburing from 14s to
      about 0.7s.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05f3fb3c
    • J
      io_uring: add support for IORING_OP_CLOSE · b5dba59e
      Jens Axboe 提交于
      This works just like close(2), unsurprisingly. We remove the file
      descriptor and post the completion inline, then offload the actual
      (potential) last file put to async context.
      
      Mark the async part of this work as uncancellable, as we really must
      guarantee that the latter part of the close is run.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b5dba59e
    • J
      io_uring: add support for IORING_OP_OPENAT · 15b71abe
      Jens Axboe 提交于
      This works just like openat(2), except it can be performed async. For
      the normal case of a non-blocking path lookup this will complete
      inline. If we have to do IO to perform the open, it'll be done from
      async context.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      15b71abe
    • J
      io_uring: add support for fallocate() · d63d1b5e
      Jens Axboe 提交于
      This exposes fallocate(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d63d1b5e
    • E
      io_uring: fix compat for IORING_REGISTER_FILES_UPDATE · 1292e972
      Eugene Syromiatnikov 提交于
      fds field of struct io_uring_files_update is problematic with regards
      to compat user space, as pointer size is different in 32-bit, 32-on-64-bit,
      and 64-bit user space.  In order to avoid custom handling of compat in
      the syscall implementation, make fds __u64 and use u64_to_user_ptr in
      order to retrieve it.  Also, align the field naturally and check that
      no garbage is passed there.
      
      Fixes: c3a31e60 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
      Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1292e972
  3. 12 12月, 2019 1 次提交
    • J
      io_uring: ensure we return -EINVAL on unknown opcode · 9e3aa61a
      Jens Axboe 提交于
      If we submit an unknown opcode and have fd == -1, io_op_needs_file()
      will return true as we default to needing a file. Then when we go and
      assign the file, we find the 'fd' invalid and return -EBADF. We really
      should be returning -EINVAL for that case, as we normally do for
      unsupported opcodes.
      
      Change io_op_needs_file() to have the following return values:
      
      0   - does not need a file
      1   - does need a file
      < 0 - error value
      
      and use this to pass back the right value for this invalid case.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9e3aa61a
  4. 11 12月, 2019 1 次提交
    • J
      io_uring: allow unbreakable links · 4e88d6e7
      Jens Axboe 提交于
      Some commands will invariably end in a failure in the sense that the
      completion result will be less than zero. One such example is timeouts
      that don't have a completion count set, they will always complete with
      -ETIME unless cancelled.
      
      For linked commands, we sever links and fail the rest of the chain if
      the result is less than zero. Since we have commands where we know that
      will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever
      regardless of the completion result. Note that the link will still sever
      if we fail submitting the parent request, hard links are only resilient
      in the presence of completion results for requests that did submit
      correctly.
      
      Cc: stable@vger.kernel.org # v5.4
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4e88d6e7
  5. 03 12月, 2019 1 次提交
  6. 26 11月, 2019 1 次提交
    • J
      io_uring: add support for IORING_OP_CONNECT · f8e85cf2
      Jens Axboe 提交于
      This allows an application to call connect() in an async fashion. Like
      other opcodes, we first try a non-blocking connect, then punt to async
      context if we have to.
      
      Note that we can still return -EINPROGRESS, and in that case the caller
      should use IORING_OP_POLL_ADD to do an async wait for completion of the
      connect request (just like for regular connect(2), except we can do it
      async here too).
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f8e85cf2
  7. 10 11月, 2019 1 次提交
    • J
      io_uring: add support for backlogged CQ ring · 1d7bb1d5
      Jens Axboe 提交于
      Currently we drop completion events, if the CQ ring is full. That's fine
      for requests with bounded completion times, but it may make it harder or
      impossible to use io_uring with networked IO where request completion
      times are generally unbounded. Or with POLL, for example, which is also
      unbounded.
      
      After this patch, we never overflow the ring, we simply store requests
      in a backlog for later flushing. This flushing is done automatically by
      the kernel. To prevent the backlog from growing indefinitely, if the
      backlog is non-empty, we apply back pressure on IO submissions. Any
      attempt to submit new IO with a non-empty backlog will get an -EBUSY
      return from the kernel. This is a signal to the application that it has
      backlogged CQ events, and that it must reap those before being allowed
      to submit more IO.
      
      Note that if we do return -EBUSY, we will have filled whatever
      backlogged events into the CQ ring first, if there's room. This means
      the application can safely reap events WITHOUT entering the kernel and
      waiting for them, they are already available in the CQ ring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1d7bb1d5
  8. 08 11月, 2019 1 次提交
    • J
      io_uring: add support for linked SQE timeouts · 2665abfd
      Jens Axboe 提交于
      While we have support for generic timeouts, we don't have a way to tie
      a timeout to a specific SQE. The generic timeouts simply trigger wakeups
      on the CQ ring.
      
      This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
      as a link to a previous command. The timeout specific can be either
      relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
      the timeout triggers before the dependent command completes, it will
      attempt to cancel that command. Likewise, if the dependent command
      completes before the timeout triggers, it will cancel the timeout.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2665abfd
  9. 01 11月, 2019 1 次提交
    • J
      io_uring: support for generic async request cancel · 62755e35
      Jens Axboe 提交于
      This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
      cancel requests that have been punted to async context and are now
      in-flight. This works for regular read/write requests to files, as
      long as they haven't been started yet. For socket based IO (or things
      like accept4(2)), we can cancel work that is already running as well.
      
      To cancel a request, the sqe must have ->addr set to the user_data of
      the request it wishes to cancel. If the request is cancelled
      successfully, the original request is completed with -ECANCELED
      and the cancel request is completed with a result of 0. If the
      request was already running, the original may or may not complete
      in error. The cancel request will complete with -EALREADY for that
      case. And finally, if the request to cancel wasn't found, the cancel
      request is completed with -ENOENT.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      62755e35
  10. 30 10月, 2019 5 次提交
    • J
      io_uring: add support for IORING_OP_ACCEPT · 17f2fe35
      Jens Axboe 提交于
      This allows an application to call accept4() in an async fashion. Like
      other opcodes, we first try a non-blocking accept, then punt to async
      context if we have to.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      17f2fe35
    • J
      io_uring: add support for canceling timeout requests · 11365043
      Jens Axboe 提交于
      We might have cases where the need for a specific timeout is gone, add
      support for canceling an existing timeout operation. This works like the
      POLL_REMOVE command, where the application passes in the user_data of
      the timeout it wishes to cancel in the sqe->addr field.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      11365043
    • J
      io_uring: add support for absolute timeouts · a41525ab
      Jens Axboe 提交于
      This is a pretty trivial addition on top of the relative timeouts
      we have now, but it's handy for ensuring tighter timing for those
      that are building scheduling primitives on top of io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a41525ab
    • J
      io_uring: allow application controlled CQ ring size · 33a107f0
      Jens Axboe 提交于
      We currently size the CQ ring as twice the SQ ring, to allow some
      flexibility in not overflowing the CQ ring. This is done because the
      SQE life time is different than that of the IO request itself, the SQE
      is consumed as soon as the kernel has seen the entry.
      
      Certain application don't need a huge SQ ring size, since they just
      submit IO in batches. But they may have a lot of requests pending, and
      hence need a big CQ ring to hold them all. By allowing the application
      to control the CQ ring size multiplier, we can cater to those
      applications more efficiently.
      
      If an application wants to define its own CQ ring size, it must set
      IORING_SETUP_CQSIZE in the setup flags, and fill out
      io_uring_params->cq_entries. The value must be a power of two.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      33a107f0
    • J
      io_uring: add support for IORING_REGISTER_FILES_UPDATE · c3a31e60
      Jens Axboe 提交于
      Allows the application to remove/replace/add files to/from a file set.
      Passes in a struct:
      
      struct io_uring_files_update {
      	__u32 offset;
      	__s32 *fds;
      };
      
      that holds an array of fds, size of array passed in through the usual
      nr_args part of the io_uring_register() system call. The logic is as
      follows:
      
      1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
         the set.
      2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
         replaced with ->fds[i].
      
      For case #2, is the existing file is currently empty (fd == -1), the
      new fd is simply added to the array.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c3a31e60
  11. 19 9月, 2019 1 次提交
    • J
      io_uring: IORING_OP_TIMEOUT support · 5262f567
      Jens Axboe 提交于
      There's been a few requests for functionality similar to io_getevents()
      and epoll_wait(), where the user can specify a timeout for waiting on
      events. I deliberately did not add support for this through the system
      call initially to avoid overloading the args, but I can see that the use
      cases for this are valid.
      
      This adds support for IORING_OP_TIMEOUT. If a user wants to get woken
      when waiting for events, simply submit one of these timeout commands
      with your wait call (or before). This ensures that the application
      sleeping on the CQ ring waiting for events will get woken. The timeout
      command is passed in as a pointer to a struct timespec. Timeouts are
      relative. The timeout command also includes a way to auto-cancel after
      N events has passed.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5262f567
  12. 07 9月, 2019 1 次提交
    • J
      io_uring: expose single mmap capability · ac90f249
      Jens Axboe 提交于
      After commit 75b28aff we can get by with just a single mmap to
      map both the sq and cq ring. However, userspace doesn't know that.
      
      Add a features variable to io_uring_params, and notify userspace
      that the kernel has this ability. This can then be used in liburing
      (or in applications directly) to avoid the second mmap.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ac90f249
  13. 10 7月, 2019 2 次提交
  14. 24 6月, 2019 1 次提交
    • J
      io_uring: add support for sqe links · 9e645e11
      Jens Axboe 提交于
      With SQE links, we can create chains of dependent SQEs. One example
      would be queueing an SQE that's a read from one file descriptor, with
      the linked SQE being a write to another with the same set of buffers.
      
      An SQE link will not stall the pipeline, it'll just ensure that
      dependent SQEs aren't issued before the previous link has completed.
      
      Any error at submission or completion time will break the chain of SQEs.
      For completions, this also includes short reads or writes, as the next
      SQE could depend on the previous one being fully completed.
      
      Any SQE in a chain that gets canceled due to any of the above errors,
      will get an CQE fill with -ECANCELED as the error value.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9e645e11
  15. 03 5月, 2019 3 次提交
    • J
      io_uring: add support for eventfd notifications · 9b402849
      Jens Axboe 提交于
      Allow registration of an eventfd, which will trigger an event every
      time a completion event happens for this io_uring instance.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b402849
    • J
      io_uring: add support for IORING_OP_SYNC_FILE_RANGE · 5d17b4a4
      Jens Axboe 提交于
      This behaves just like sync_file_range(2) does.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5d17b4a4
    • J
      io_uring: add support for marking commands as draining · de0617e4
      Jens Axboe 提交于
      There are no ordering constraints between the submission and completion
      side of io_uring. But sometimes that would be useful to have. One common
      example is doing an fsync, for instance, and have it ordered with
      previous writes. Without support for that, the application must do this
      tracking itself.
      
      This adds a general SQE flag, IOSQE_IO_DRAIN. If a command is marked
      with this flag, then it will not be issued before previous commands have
      completed, and subsequent commands submitted after the drain will not be
      issued before the drain is started.. If there are no pending commands,
      setting this flag will not change the behavior of the issue of the
      command.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      de0617e4