1. 21 1月, 2020 19 次提交
    • P
      io_uring: batch getting pcpu references · 2b85edfc
      Pavel Begunkov 提交于
      percpu_ref_tryget() has its own overhead. Instead getting a reference
      for each request, grab a bunch once per io_submit_sqes().
      
      ~5% throughput boost for a "submit and wait 128 nops" benchmark.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      
      __io_req_free_empty() -> __io_req_do_free()
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b85edfc
    • J
      io_uring: add IORING_OP_MADVISE · c1ca757b
      Jens Axboe 提交于
      This adds support for doing madvise(2) through io_uring. We assume that
      any operation can block, and hence punt everything async. This could be
      improved, but hard to make bullet proof. The async punt ensures it's
      safe.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c1ca757b
    • J
      io_uring: add IORING_OP_FADVISE · 4840e418
      Jens Axboe 提交于
      This adds support for doing fadvise through io_uring. We assume that
      WILLNEED doesn't block, but that DONTNEED may block.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4840e418
    • J
      io_uring: allow use of offset == -1 to mean file position · ba04291e
      Jens Axboe 提交于
      This behaves like preadv2/pwritev2 with offset == -1, it'll use (and
      update) the current file position. This obviously comes with the caveat
      that if the application has multiple read/writes in flight, then the
      end result will not be as expected. This is similar to threads sharing
      a file descriptor and doing IO using the current file position.
      
      Since this feature isn't easily detectable by doing a read or write,
      add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to
      detect presence of this feature.
      Reported-by: N李通洲 <carter.li@eoitek.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba04291e
    • J
      io_uring: add non-vectored read/write commands · 3a6820f2
      Jens Axboe 提交于
      For uses cases that don't already naturally have an iovec, it's easier
      (or more convenient) to just use a buffer address + length. This is
      particular true if the use case is from languages that want to create
      a memory safe abstraction on top of io_uring, and where introducing
      the need for the iovec may impose an ownership issue. For those cases,
      they currently need an indirection buffer, which means allocating data
      just for this purpose.
      
      Add basic read/write that don't require the iovec.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a6820f2
    • J
      io_uring: improve poll completion performance · e94f141b
      Jens Axboe 提交于
      For busy IORING_OP_POLL_ADD workloads, we can have enough contention
      on the completion lock that we fail the inline completion path quite
      often as we fail the trylock on that lock. Add a list for deferred
      completions that we can use in that case. This helps reduce the number
      of async offloads we have to do, as if we get multiple completions in
      a row, we'll piggy back on to the poll_llist instead of having to queue
      our own offload.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e94f141b
    • J
      io_uring: split overflow state into SQ and CQ side · ad3eb2c8
      Jens Axboe 提交于
      We currently check ->cq_overflow_list from both SQ and CQ context, which
      causes some bouncing of that cache line. Add separate bits of state for
      this instead, so that the SQ side can check using its own state, and
      likewise for the CQ side.
      
      This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow
      with the CQ state. If we hit an overflow condition, both of these bits
      are set. Likewise for overflow flush clear, we clear both bits. For the
      fast path of just checking if there's an overflow condition on either
      the SQ or CQ side, we can use our own private bit for this.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ad3eb2c8
    • J
      io_uring: add lookup table for various opcode needs · d3656344
      Jens Axboe 提交于
      We currently have various switch statements that check if an opcode needs
      a file, mm, etc. These are hard to keep in sync as opcodes are added. Add
      a struct io_op_def that holds all of this information, so we have just
      one spot to update when opcodes are added.
      
      This also enables us to NOT allocate req->io if a deferred command
      doesn't need it, and corrects some mistakes we had in terms of what
      commands need mm context.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d3656344
    • J
      io_uring: remove two unnecessary function declarations · add7b6b8
      Jens Axboe 提交于
      __io_free_req() and io_double_put_req() aren't used before they are
      defined, so we can kill these two forwards.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      add7b6b8
    • P
      io_uring: move *queue_link_head() from common path · 32fe525b
      Pavel Begunkov 提交于
      Move io_queue_link_head() to links handling code in io_submit_sqe(),
      so it wouldn't need extra checks and would have better data locality.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      32fe525b
    • P
      io_uring: rename prev to head · 9d76377f
      Pavel Begunkov 提交于
      Calling "prev" a head of a link is a bit misleading. Rename it
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9d76377f
    • J
      io_uring: add IOSQE_ASYNC · ce35a47a
      Jens Axboe 提交于
      io_uring defaults to always doing inline submissions, if at all
      possible. But for larger copies, even if the data is fully cached, that
      can take a long time. Add an IOSQE_ASYNC flag that the application can
      set on the SQE - if set, it'll ensure that we always go async for those
      kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we
      get the concurrency we desire for this case.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ce35a47a
    • J
      io_uring: add support for IORING_OP_STATX · eddc7ef5
      Jens Axboe 提交于
      This provides support for async statx(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eddc7ef5
    • J
      io_uring: avoid ring quiesce for fixed file set unregister and update · 05f3fb3c
      Jens Axboe 提交于
      We currently fully quiesce the ring before an unregister or update of
      the fixed fileset. This is very expensive, and we can be a bit smarter
      about this.
      
      Add a percpu refcount for the file tables as a whole. Grab a percpu ref
      when we use a registered file, and put it on completion. This is cheap
      to do. Upon removal of a file from a set, switch the ref count to atomic
      mode. When we hit zero ref on the completion side, then we know we can
      drop the previously registered files. When the old files have been
      dropped, switch the ref back to percpu mode for normal operation.
      
      Since there's a period between doing the update and the kernel being
      done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the
      same action. The application knows the update has completed when it gets
      the CQE for it. Between doing the update and receiving this completion,
      the application must continue to use the unregistered fd if submitting
      IO on this particular file.
      
      This takes the runtime of test/file-register from liburing from 14s to
      about 0.7s.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05f3fb3c
    • J
      io_uring: add support for IORING_OP_CLOSE · b5dba59e
      Jens Axboe 提交于
      This works just like close(2), unsurprisingly. We remove the file
      descriptor and post the completion inline, then offload the actual
      (potential) last file put to async context.
      
      Mark the async part of this work as uncancellable, as we really must
      guarantee that the latter part of the close is run.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b5dba59e
    • J
      io-wq: add support for uncancellable work · 0c9d5ccd
      Jens Axboe 提交于
      Not all work can be cancelled, some of it we may need to guarantee
      that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL
      on work that must not be cancelled. Note that the caller work function
      must also check for IO_WQ_WORK_NO_CANCEL on work that is marked
      IO_WQ_WORK_CANCEL.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0c9d5ccd
    • J
      io_uring: add support for IORING_OP_OPENAT · 15b71abe
      Jens Axboe 提交于
      This works just like openat(2), except it can be performed async. For
      the normal case of a non-blocking path lookup this will complete
      inline. If we have to do IO to perform the open, it'll be done from
      async context.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      15b71abe
    • J
      io_uring: add support for fallocate() · d63d1b5e
      Jens Axboe 提交于
      This exposes fallocate(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d63d1b5e
    • E
      io_uring: fix compat for IORING_REGISTER_FILES_UPDATE · 1292e972
      Eugene Syromiatnikov 提交于
      fds field of struct io_uring_files_update is problematic with regards
      to compat user space, as pointer size is different in 32-bit, 32-on-64-bit,
      and 64-bit user space.  In order to avoid custom handling of compat in
      the syscall implementation, make fds __u64 and use u64_to_user_ptr in
      order to retrieve it.  Also, align the field naturally and check that
      no garbage is passed there.
      
      Fixes: c3a31e60 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
      Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1292e972
  2. 17 1月, 2020 1 次提交
  3. 16 1月, 2020 2 次提交
  4. 15 1月, 2020 1 次提交
  5. 14 1月, 2020 1 次提交
  6. 08 1月, 2020 1 次提交
    • J
      io_uring: remove punt of short reads to async context · eacc6dfa
      Jens Axboe 提交于
      We currently punt any short read on a regular file to async context,
      but this fails if the short read is due to running into EOF. This is
      especially problematic since we only do the single prep for commands
      now, as we don't reset kiocb->ki_pos. This can result in a 4k read on
      a 1k file returning zero, as we detect the short read and then retry
      from async context. At the time of retry, the position is now 1k, and
      we end up reading nothing, and hence return 0.
      
      Instead of trying to patch around the fact that short reads can be
      legitimate and won't succeed in case of retry, remove the logic to punt
      a short read to async context. Simply return it.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eacc6dfa
  7. 21 12月, 2019 6 次提交
    • J
      io_uring: pass in 'sqe' to the prep handlers · 3529d8c2
      Jens Axboe 提交于
      This moves the prep handlers outside of the opcode handlers, and allows
      us to pass in the sqe directly. If the sqe is non-NULL, it means that
      the request should be prepared for the first time.
      
      With the opcode handlers not having access to the sqe at all, we are
      guaranteed that the prep handler has setup the request fully by the
      time we get there. As before, for opcodes that need to copy in more
      data then the io_kiocb allows for, the io_async_ctx holds that info. If
      a prep handler is invoked with req->io set, it must use that to retain
      information for later.
      
      Finally, we can remove io_kiocb->sqe as well.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3529d8c2
    • J
      io_uring: standardize the prep methods · 06b76d44
      Jens Axboe 提交于
      We currently have a mix of use cases. Most of the newer ones are pretty
      uniform, but we have some older ones that use different calling
      calling conventions. This is confusing.
      
      For the opcodes that currently rely on the req->io->sqe copy saving
      them from reuse, add a request type struct in the io_kiocb command
      union to store the data they need.
      
      Prepare for all opcodes having a standard prep method, so we can call
      it in a uniform fashion and outside of the opcode handler. This is in
      preparation for passing in the 'sqe' pointer, rather than storing it
      in the io_kiocb. Once we have uniform prep handlers, we can leave all
      the prep work to that part, and not even pass in the sqe to the opcode
      handler. This ensures that we don't reuse sqe data inadvertently.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      06b76d44
    • J
      io_uring: read 'count' for IORING_OP_TIMEOUT in prep handler · 26a61679
      Jens Axboe 提交于
      Add the count field to struct io_timeout, and ensure the prep handler
      has read it. Timeout also needs an async context always, set it up
      in the prep handler if we don't have one.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      26a61679
    • J
      io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handler · e47293fd
      Jens Axboe 提交于
      Add struct io_sr_msg in our io_kiocb per-command union, and ensure that
      the send/recvmsg prep handlers have grabbed what they need from the SQE
      by the time prep is done.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e47293fd
    • J
      io_uring: move all prep state for IORING_OP_CONNECT to prep handler · 3fbb51c1
      Jens Axboe 提交于
      Add struct io_connect in our io_kiocb per-command union, and ensure
      that io_connect_prep() has grabbed what it needs from the SQE.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3fbb51c1
    • J
      io_uring: add and use struct io_rw for read/writes · 9adbd45d
      Jens Axboe 提交于
      Put the kiocb in struct io_rw, and add the addr/len for the request as
      well. Use the kiocb->private field for the buffer index for fixed reads
      and writes.
      
      Any use of kiocb->ki_filp is flipped to req->file. It's the same thing,
      and less confusing.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9adbd45d
  8. 20 12月, 2019 1 次提交
  9. 19 12月, 2019 2 次提交
    • J
      io_uring: io_wq_submit_work() should not touch req->rw · fd6c2e4c
      Jens Axboe 提交于
      I've been chasing a weird and obscure crash that was userspace stack
      corruption, and finally narrowed it down to a bit flip that made a
      stack address invalid. io_wq_submit_work() unconditionally flips
      the req->rw.ki_flags IOCB_NOWAIT bit, but since it's a generic work
      handler, this isn't valid. Normal read/write operations own that
      part of the request, on other types it could be something else.
      
      Move the IOCB_NOWAIT clear to the read/write handlers where it belongs.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fd6c2e4c
    • P
      io_uring: don't wait when under-submitting · 7c504e65
      Pavel Begunkov 提交于
      There is no reliable way to submit and wait in a single syscall, as
      io_submit_sqes() may under-consume sqes (in case of an early error).
      Then it will wait for not-yet-submitted requests, deadlocking the user
      in most cases.
      
      Don't wait/poll if can't submit all sqes
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7c504e65
  10. 18 12月, 2019 6 次提交