1. 04 6月, 2020 4 次提交
    • J
      io_uring: support buffer selection for OP_READ and OP_RECV · 1d68f9f6
      Jens Axboe 提交于
      to #28170604
      
      commit bcda7baaa3f15c7a95db3c024bb046d6e298f76b upstream
      
      If a server process has tons of pending socket connections, generally
      it uses epoll to wait for activity. When the socket is ready for reading
      (or writing), the task can select a buffer and issue a recv/send on the
      given fd.
      
      Now that we have fast (non-async thread) support, a task can have tons
      of pending reads or writes pending. But that means they need buffers to
      back that data, and if the number of connections is high enough, having
      them preallocated for all possible connections is unfeasible.
      
      With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
      use for any request. The request then sets IOSQE_BUFFER_SELECT in the
      sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
      a free buffer from the specified group is selected. If none are
      available, the request is terminated with -ENOBUFS. If successful, the
      CQE on completion will contain the buffer ID chosen in the cqe->flags
      member, encoded as:
      
      	(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;
      
      Once a buffer has been consumed by a request, it is no longer available
      and must be registered again with IORING_OP_PROVIDE_BUFFERS.
      
      Requests need to support this feature. For now, IORING_OP_READ and
      IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
      res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      1d68f9f6
    • J
      io_uring: add IORING_OP_PROVIDE_BUFFERS · 72e1286a
      Jens Axboe 提交于
      to #28170604
      
      commit ddf0322db79c5984dc1a1db890f946dd19b7d6d9 upstream
      
      IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
      support passing in an addr/len that is associated with a buffer ID and
      buffer group ID. The group ID is used to index and lookup the buffers,
      while the buffer ID can be used to notify the application which buffer
      in the group was used. The addr passed in is the starting buffer address,
      and length is each buffer length. A number of buffers to add with can be
      specified, in which case addr is incremented by length for each addition,
      and each buffer increments the buffer ID specified.
      
      No validation is done of the buffer ID. If the application provides
      buffers within the same group with identical buffer IDs, then it'll have
      a hard time telling which buffer ID was used. The only restriction is
      that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
      maximum ID that can be used.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      
      Notes: use VERIFY_WRITE for access_ok()
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      72e1286a
    • J
      io_uring: use poll driven retry for files that support it · 73411260
      Jens Axboe 提交于
      to #28170604
      
      commit d7718a9d25a61442da8ee8aeeff6a0097f0ccfd6 upstream
      
      Currently io_uring tries any request in a non-blocking manner, if it can,
      and then retries from a worker thread if we get -EAGAIN. Now that we have
      a new and fancy poll based retry backend, use that to retry requests if
      the file supports it.
      
      This means that, for example, an IORING_OP_RECVMSG on a socket no longer
      requires an async thread to complete the IO. If we get -EAGAIN reading
      from the socket in a non-blocking manner, we arm a poll handler for
      notification on when the socket becomes readable. When it does, the
      pending read is executed directly by the task again, through the io_uring
      task work handlers. Not only is this faster and more efficient, it also
      means we're not generating potentially tons of async threads that just
      sit and block, waiting for the IO to complete.
      
      The feature is marked with IORING_FEAT_FAST_POLL, meaning that async
      pollable IO is fast, and that poll<link>other_op is fast as well.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      73411260
    • P
      io_uring: add splice(2) support · a3e58e00
      Pavel Begunkov 提交于
      to #28170604
      
      commit 7d67af2c013402537385dae343a2d0f6a4cb3bfd upstream
      
      Add support for splice(2).
      
      - output file is specified as sqe->fd, so it's handled by generic code
      - hash_reg_file handled by generic code as well
      - len is 32bit, but should be fine
      - the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which
      is a splice flag (i.e. sqe->splice_flags).
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      a3e58e00
  2. 28 5月, 2020 21 次提交
  3. 27 5月, 2020 5 次提交
  4. 18 3月, 2020 10 次提交
    • J
      io_uring: add support for backlogged CQ ring · b43c8385
      Jens Axboe 提交于
      commit 1d7bb1d50fb4dc141c7431cc21fdd24ffcc83c76 upstream.
      
      Currently we drop completion events, if the CQ ring is full. That's fine
      for requests with bounded completion times, but it may make it harder or
      impossible to use io_uring with networked IO where request completion
      times are generally unbounded. Or with POLL, for example, which is also
      unbounded.
      
      After this patch, we never overflow the ring, we simply store requests
      in a backlog for later flushing. This flushing is done automatically by
      the kernel. To prevent the backlog from growing indefinitely, if the
      backlog is non-empty, we apply back pressure on IO submissions. Any
      attempt to submit new IO with a non-empty backlog will get an -EBUSY
      return from the kernel. This is a signal to the application that it has
      backlogged CQ events, and that it must reap those before being allowed
      to submit more IO.
      
      Note that if we do return -EBUSY, we will have filled whatever
      backlogged events into the CQ ring first, if there's room. This means
      the application can safely reap events WITHOUT entering the kernel and
      waiting for them, they are already available in the CQ ring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b43c8385
    • J
      io_uring: add support for linked SQE timeouts · d4be78c6
      Jens Axboe 提交于
      commit 2665abfd757fb35a241c6f0b1ebf620e3ffb36fb upstream.
      
      While we have support for generic timeouts, we don't have a way to tie
      a timeout to a specific SQE. The generic timeouts simply trigger wakeups
      on the CQ ring.
      
      This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
      as a link to a previous command. The timeout specific can be either
      relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
      the timeout triggers before the dependent command completes, it will
      attempt to cancel that command. Likewise, if the dependent command
      completes before the timeout triggers, it will cancel the timeout.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      d4be78c6
    • J
      io_uring: support for generic async request cancel · e025ee0c
      Jens Axboe 提交于
      commit 62755e35dfb2b113c52b81cd96d01c20971c8e02 upstream.
      
      This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
      cancel requests that have been punted to async context and are now
      in-flight. This works for regular read/write requests to files, as
      long as they haven't been started yet. For socket based IO (or things
      like accept4(2)), we can cancel work that is already running as well.
      
      To cancel a request, the sqe must have ->addr set to the user_data of
      the request it wishes to cancel. If the request is cancelled
      successfully, the original request is completed with -ECANCELED
      and the cancel request is completed with a result of 0. If the
      request was already running, the original may or may not complete
      in error. The cancel request will complete with -EALREADY for that
      case. And finally, if the request to cancel wasn't found, the cancel
      request is completed with -ENOENT.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e025ee0c
    • J
      io_uring: add support for IORING_OP_ACCEPT · bdc2063b
      Jens Axboe 提交于
      commit 17f2fe35d080d8f64e86a60cdcd3a97edcbc213b upstream.
      
      This allows an application to call accept4() in an async fashion. Like
      other opcodes, we first try a non-blocking accept, then punt to async
      context if we have to.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      bdc2063b
    • J
      io_uring: add support for canceling timeout requests · f39f9b94
      Jens Axboe 提交于
      commit 11365043e5271fea4c92189a976833da477a3a44 upstream.
      
      We might have cases where the need for a specific timeout is gone, add
      support for canceling an existing timeout operation. This works like the
      POLL_REMOVE command, where the application passes in the user_data of
      the timeout it wishes to cancel in the sqe->addr field.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      f39f9b94
    • J
      io_uring: add support for absolute timeouts · 17da935f
      Jens Axboe 提交于
      commit a41525ab2e75987e809926352ebc6f1397da900e upstream.
      
      This is a pretty trivial addition on top of the relative timeouts
      we have now, but it's handy for ensuring tighter timing for those
      that are building scheduling primitives on top of io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      17da935f
    • J
      io_uring: allow application controlled CQ ring size · 29635426
      Jens Axboe 提交于
      commit 33a107f0a1b8df0ad925e39d8afc97bb78e0cec1 upstream.
      
      We currently size the CQ ring as twice the SQ ring, to allow some
      flexibility in not overflowing the CQ ring. This is done because the
      SQE life time is different than that of the IO request itself, the SQE
      is consumed as soon as the kernel has seen the entry.
      
      Certain application don't need a huge SQ ring size, since they just
      submit IO in batches. But they may have a lot of requests pending, and
      hence need a big CQ ring to hold them all. By allowing the application
      to control the CQ ring size multiplier, we can cater to those
      applications more efficiently.
      
      If an application wants to define its own CQ ring size, it must set
      IORING_SETUP_CQSIZE in the setup flags, and fill out
      io_uring_params->cq_entries. The value must be a power of two.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      29635426
    • J
      io_uring: add support for IORING_REGISTER_FILES_UPDATE · 54633bb3
      Jens Axboe 提交于
      commit c3a31e605620c279163c14068a60869ea3fda203 upstream.
      
      Allows the application to remove/replace/add files to/from a file set.
      Passes in a struct:
      
      struct io_uring_files_update {
      	__u32 offset;
      	__s32 *fds;
      };
      
      that holds an array of fds, size of array passed in through the usual
      nr_args part of the io_uring_register() system call. The logic is as
      follows:
      
      1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
         the set.
      2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
         replaced with ->fds[i].
      
      For case #2, is the existing file is currently empty (fd == -1), the
      new fd is simply added to the array.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      54633bb3
    • J
      io_uring: IORING_OP_TIMEOUT support · 94601f2b
      Jens Axboe 提交于
      commit 5262f567987d3c30052b22e78c35c2313d07b230 upstream.
      
      There's been a few requests for functionality similar to io_getevents()
      and epoll_wait(), where the user can specify a timeout for waiting on
      events. I deliberately did not add support for this through the system
      call initially to avoid overloading the args, but I can see that the use
      cases for this are valid.
      
      This adds support for IORING_OP_TIMEOUT. If a user wants to get woken
      when waiting for events, simply submit one of these timeout commands
      with your wait call (or before). This ensures that the application
      sleeping on the CQ ring waiting for events will get woken. The timeout
      command is passed in as a pointer to a struct timespec. Timeouts are
      relative. The timeout command also includes a way to auto-cancel after
      N events has passed.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      94601f2b
    • J
      io_uring: expose single mmap capability · c15b5945
      Jens Axboe 提交于
      commit ac90f249e15cd2a850daa9e36e15f81ce1ff6550 upstream.
      
      After commit 75b28affdd6a we can get by with just a single mmap to
      map both the sq and cq ring. However, userspace doesn't know that.
      
      Add a features variable to io_uring_params, and notify userspace
      that the kernel has this ability. This can then be used in liburing
      (or in applications directly) to avoid the second mmap.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      c15b5945