1. 31 5月, 2022 1 次提交
    • X
      io_uring: let IORING_OP_FILES_UPDATE support choosing fixed file slots · a7c41b46
      Xiaoguang Wang 提交于
      One big issue with the file registration feature is that it needs user
      space apps to maintain free slot info about io_uring's fixed file table,
      which really is a burden for development. io_uring now supports choosing
      free file slot for user space apps by using IORING_FILE_INDEX_ALLOC flag
      in accept, open, and socket operations, but they need the app to use
      direct accept or direct open, which not all apps are prepared to use yet.
      
      To support apps that still need real fds, make use of the registration
      feature easier. Let IORING_OP_FILES_UPDATE support choosing fixed file
      slots, which will store picked fixed files slots in fd array and let cqe
      return the number of slots allocated.
      Suggested-by: NHao Xu <howeyxu@tencent.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      [axboe: move flag to uapi io_uring header, change goto to break, init]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a7c41b46
  2. 18 5月, 2022 1 次提交
    • J
      io_uring: add support for ring mapped supplied buffers · c7fb1942
      Jens Axboe 提交于
      Provided buffers allow an application to supply io_uring with buffers
      that can then be grabbed for a read/receive request, when the data
      source is ready to deliver data. The existing scheme relies on using
      IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use
      in real world applications. It's pretty efficient if the application
      is able to supply back batches of provided buffers when they have been
      consumed and the application is ready to recycle them, but if
      fragmentation occurs in the buffer space, it can become difficult to
      supply enough buffers at the time. This hurts efficiency.
      
      Add a register op, IORING_REGISTER_PBUF_RING, which allows an application
      to setup a shared queue for each buffer group of provided buffers. The
      application can then supply buffers simply by adding them to this ring,
      and the kernel can consume then just as easily. The ring shares the head
      with the application, the tail remains private in the kernel.
      
      Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use
      IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the
      ring, they must use the mapped ring. Mapped provided buffer rings can
      co-exist with normal provided buffers, just not within the same group ID.
      
      To gauge overhead of the existing scheme and evaluate the mapped ring
      approach, a simple NOP benchmark was written. It uses a ring of 128
      entries, and submits/completes 32 at the time. 'Replenish' is how
      many buffers are provided back at the time after they have been
      consumed:
      
      Test			Replenish			NOPs/sec
      ================================================================
      No provided buffers	NA				~30M
      Provided buffers	32				~16M
      Provided buffers	 1				~10M
      Ring buffers		32				~27M
      Ring buffers		 1				~27M
      
      The ring mapped buffers perform almost as well as not using provided
      buffers at all, and they don't care if you provided 1 or more back at
      the same time. This means application can just replenish as they go,
      rather than need to batch and compact, further reducing overhead in the
      application. The NOP benchmark above doesn't need to do any compaction,
      so that overhead isn't even reflected in the above test.
      Co-developed-by: NDylan Yudaken <dylany@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7fb1942
  3. 14 5月, 2022 1 次提交
  4. 13 5月, 2022 2 次提交
    • J
      io_uring: add flag for allocating a fully sparse direct descriptor space · a8da73a3
      Jens Axboe 提交于
      Currently to setup a fully sparse descriptor space upfront, the app needs
      to alloate an array of the full size and memset it to -1 and then pass
      that in. Make this a bit easier by allowing a flag that simply does
      this internally rather than needing to copy each slot separately.
      
      This works with IORING_REGISTER_FILES2 as the flag is set in struct
      io_uring_rsrc_register, and is only allow when the type is
      IORING_RSRC_FILE as this doesn't make sense for registered buffers.
      Reviewed-by: NHao Xu <howeyxu@tencent.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a8da73a3
    • J
      io_uring: allow allocated fixed files for openat/openat2 · 1339f24b
      Jens Axboe 提交于
      If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot,
      then that's a hint to allocate a fixed file descriptor rather than have
      one be passed in directly.
      
      This can be useful for having io_uring manage the direct descriptor space.
      
      Normal open direct requests will complete with 0 for success, and < 0
      in case of error. If io_uring is asked to allocated the direct descriptor,
      then the direct descriptor is returned in case of success.
      Reviewed-by: NHao Xu <howeyxu@tencent.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1339f24b
  5. 11 5月, 2022 1 次提交
  6. 09 5月, 2022 2 次提交
  7. 06 5月, 2022 1 次提交
  8. 30 4月, 2022 3 次提交
  9. 26 4月, 2022 1 次提交
  10. 25 4月, 2022 6 次提交
  11. 11 4月, 2022 1 次提交
    • J
      io_uring: flag the fact that linked file assignment is sane · c4212f3e
      Jens Axboe 提交于
      Give applications a way to tell if the kernel supports sane linked files,
      as in files being assigned at the right time to be able to reliably
      do <open file direct into slot X><read file from slot X> while using
      IOSQE_IO_LINK to order them.
      
      Not really a bug fix, but flag it as such so that it gets pulled in with
      backports of the deferred file assignment.
      
      Fixes: 6bf9c47a ("io_uring: defer file assignment")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c4212f3e
  12. 24 3月, 2022 1 次提交
    • J
      io_uring: remove IORING_CQE_F_MSG · 7ef66d18
      Jens Axboe 提交于
      This was introduced with the message ring opcode, but isn't strictly
      required for the request itself. The sender can encode what is needed
      in user_data, which is passed to the receiver. It's unclear if having
      a separate flag that essentially says "This CQE did not originate from
      an SQE on this ring" provides any real utility to applications. While
      we can always re-introduce a flag to provide this information, we cannot
      take it away at a later point in time.
      
      Remove the flag while we still can, before it's in a released kernel.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7ef66d18
  13. 11 3月, 2022 2 次提交
    • J
      io_uring: allow submissions to continue on error · bcbb7bf6
      Jens Axboe 提交于
      By default, io_uring will stop submitting a batch of requests if we run
      into an error submitting a request. This isn't strictly necessary, as
      the error result is passed out-of-band via a CQE anyway. And it can be
      a bit confusing for some applications.
      
      Provide a way to setup a ring that will continue submitting on error,
      when the error CQE has been posted.
      
      There's still one case that will break out of submission. If we fail
      allocating a request, then we'll still return -ENOMEM. We could in theory
      post a CQE for that condition too even if we never got a request. Leave
      that for a potential followup.
      Reported-by: NDylan Yudaken <dylany@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bcbb7bf6
    • J
      io_uring: add support for IORING_OP_MSG_RING command · 4f57f06c
      Jens Axboe 提交于
      This adds support for IORING_OP_MSG_RING, which allows an SQE to signal
      another ring. That allows either waking up someone waiting on the ring,
      or even passing a 64-bit value via the user_data field in the CQE.
      
      sqe->fd must contain the fd of a ring that should receive the CQE.
      sqe->off will be propagated to the cqe->user_data on the target ring,
      and sqe->len will be propagated to cqe->res. The results CQE will have
      IORING_CQE_F_MSG set in its flags, to indicate that this CQE was generated
      from a messaging request rather than a SQE issued locally on that ring.
      This effectively allows passing a 64-bit and a 32-bit quantify between
      the two rings.
      
      This request type has the following request specific error cases:
      
      - -EBADFD. Set if the sqe->fd doesn't point to a file descriptor that is
        of the io_uring type.
      - -EOVERFLOW. Set if we were not able to deliver a request to the target
        ring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4f57f06c
  14. 10 3月, 2022 1 次提交
    • J
      io_uring: add support for registering ring file descriptors · e7a6c00d
      Jens Axboe 提交于
      Lots of workloads use multiple threads, in which case the file table is
      shared between them. This makes getting and putting the ring file
      descriptor for each io_uring_enter(2) system call more expensive, as it
      involves an atomic get and put for each call.
      
      Similarly to how we allow registering normal file descriptors to avoid
      this overhead, add support for an io_uring_register(2) API that allows
      to register the ring fds themselves:
      
      1) IORING_REGISTER_RING_FDS - takes an array of io_uring_rsrc_update
         structs, and registers them with the task.
      2) IORING_UNREGISTER_RING_FDS - takes an array of io_uring_src_update
         structs, and unregisters them.
      
      When a ring fd is registered, it is internally represented by an offset.
      This offset is returned to the application, and the application then
      uses this offset and sets IORING_ENTER_REGISTERED_RING for the
      io_uring_enter(2) system call. This works just like using a registered
      file descriptor, rather than a real one, in an SQE, where
      IOSQE_FIXED_FILE gets set to tell io_uring that we're using an internal
      offset/descriptor rather than a real file descriptor.
      
      In initial testing, this provides a nice bump in performance for
      threaded applications in real world cases where the batch count (eg
      number of requests submitted per io_uring_enter(2) invocation) is low.
      In a microbenchmark, submitting NOP requests, we see the following
      increases in performance:
      
      Requests per syscall	Baseline	Registered	Increase
      ----------------------------------------------------------------
      1			 ~7030K		 ~8080K		+15%
      2			~13120K		~14800K		+13%
      4			~22740K		~25300K		+11%
      Co-developed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e7a6c00d
  15. 25 11月, 2021 1 次提交
    • P
      io_uring: add option to skip CQE posting · 04c76b41
      Pavel Begunkov 提交于
      Emitting a CQE is expensive from the kernel perspective. Often, it's
      also not convenient for the userspace, spends some cycles on processing
      and just complicates the logic. A similar problems goes for linked
      requests, where we post an CQE for each request in the link.
      
      Introduce a new flags, IOSQE_CQE_SKIP_SUCCESS, trying to help with it.
      When set and a request completed successfully, it won't generate a CQE.
      When fails, it produces an CQE, but all following linked requests will
      be CQE-less, regardless whether they have IOSQE_CQE_SKIP_SUCCESS or not.
      The notion of "fail" is the same as for link failing-cancellation, where
      it's opcode dependent, and _usually_ result >= 0 is a success, but not
      always.
      
      Linked timeouts are a bit special. When the requests it's linked to was
      not attempted to be executed, e.g. failing linked requests, it follows
      the description above. Otherwise, whether a linked timeout will post a
      completion or not solely depends on IOSQE_CQE_SKIP_SUCCESS of that
      linked timeout request. Linked timeout never "fail" during execution, so
      for them it's unconditional. It's expected for users to not really care
      about the result of it but rely solely on the result of the master
      request. Another reason for such a treatment is that it's racy, and the
      timeout callback may be running awhile the master request posts its
      completion.
      
      use case 1:
      If one doesn't care about results of some requests, e.g. normal
      timeouts, just set IOSQE_CQE_SKIP_SUCCESS. Error result will still be
      posted and need to be handled.
      
      use case 2:
      Set IOSQE_CQE_SKIP_SUCCESS for all requests of a link but the last,
      and it'll post a completion only for the last one if everything goes
      right, otherwise there will be one only one CQE for the first failed
      request.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/0220fbe06f7cf99e6fc71b4297bb1cb6c0e89c2c.1636559119.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      04c76b41
  16. 19 10月, 2021 1 次提交
  17. 14 9月, 2021 1 次提交
  18. 30 8月, 2021 1 次提交
  19. 29 8月, 2021 2 次提交
    • J
      io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts · 50c1df2b
      Jens Axboe 提交于
      Certain use cases want to use CLOCK_BOOTTIME or CLOCK_REALTIME rather than
      CLOCK_MONOTONIC, instead of the default CLOCK_MONOTONIC.
      
      Add an IORING_TIMEOUT_BOOTTIME and IORING_TIMEOUT_REALTIME flag that
      allows timeouts and linked timeouts to use the selected clock source.
      
      Only one clock source may be selected, and we -EINVAL the request if more
      than one is given. If neither BOOTIME nor REALTIME are selected, the
      previous default of MONOTONIC is used.
      
      Link: https://github.com/axboe/liburing/issues/369Signed-off-by: NJens Axboe <axboe@kernel.dk>
      50c1df2b
    • J
      io-wq: provide a way to limit max number of workers · 2e480058
      Jens Axboe 提交于
      io-wq divides work into two categories:
      
      1) Work that completes in a bounded time, like reading from a regular file
         or a block device. This type of work is limited based on the size of
         the SQ ring.
      
      2) Work that may never complete, we call this unbounded work. The amount
         of workers here is just limited by RLIMIT_NPROC.
      
      For various uses cases, it's handy to have the kernel limit the maximum
      amount of pending workers for both categories. Provide a way to do with
      with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation.
      
      IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets
      the max worker count to what is being passed in for each category. The
      old values are returned into that same array. If 0 is being passed in for
      either category, it simply returns the current value.
      
      The value is capped at RLIMIT_NPROC. This actually isn't that important
      as it's more of a hint, if we're exceeding the value then our attempt
      to fork a new worker will fail. This happens naturally already if more
      than one node is in the system, as these values are per-node internally
      for io-wq.
      Reported-by: NJohannes Lundberg <johalun0@gmail.com>
      Link: https://github.com/axboe/liburing/issues/420Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2e480058
  20. 25 8月, 2021 1 次提交
  21. 24 8月, 2021 3 次提交
  22. 01 7月, 2021 1 次提交
  23. 18 6月, 2021 1 次提交
    • J
      io_uring: allow user configurable IO thread CPU affinity · fe76421d
      Jens Axboe 提交于
      io-wq defaults to per-node masks for IO workers. This works fine by
      default, but isn't particularly handy for workloads that prefer more
      specific affinities, for either performance or isolation reasons.
      
      This adds IORING_REGISTER_IOWQ_AFF that allows the user to pass in a CPU
      mask that is then applied to IO thread workers, and an
      IORING_UNREGISTER_IOWQ_AFF that simply resets the masks back to the
      default of per-node.
      
      Note that no care is given to existing IO threads, they will need to go
      through a reschedule before the affinity is correct if they are already
      running or sleeping.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fe76421d
  24. 11 6月, 2021 2 次提交
  25. 26 4月, 2021 2 次提交