1. 12 1月, 2020 24 次提交
    • J
      block: add BIO_NO_PAGE_REF flag · ab780c9b
      Jens Axboe 提交于
      commit 399254aaf4892113c806816f7e64cf40c804d46d upstream.
      
      If bio_iov_iter_get_pages() is called on an iov_iter that is flagged
      with NO_REF, then we don't need to add a page reference for the pages
      that we add.
      
      Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows
      not to drop a reference to these pages.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ab780c9b
    • J
      iov_iter: add ITER_BVEC_FLAG_NO_REF flag · 7874dda9
      Jens Axboe 提交于
      commit 875f1d0769cdcfe1596ff0ca609b453359e42ec9 upstream.
      
      For ITER_BVEC, if we're holding on to kernel pages, the caller
      doesn't need to grab a reference to the bvec pages, and drop that
      same reference on IO completion. This is essentially safe for any
      ITER_BVEC, but some use cases end up reusing pages and uncondtionally
      dropping a page reference on completion. And example of that is
      sendfile(2), that ends up being a splice_in + splice_out on the
      pipe pages.
      
      Add a flag that tells us it's fine to not grab a page reference
      to the bvec pages, since that caller knows not to drop a reference
      when it's done with the pages.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      7874dda9
    • J
      io_uring: mark me as the maintainer · cc590c48
      Jens Axboe 提交于
      commit bf33a7699e992b12d4c7d39dc3f0b61f6b26c5c2 upstream.
      
      And io_uring as maintained in general.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      cc590c48
    • J
      io_uring: retry bulk slab allocs as single allocs · c7346c03
      Jens Axboe 提交于
      commit fd6fab2cb78d3b6023c26ec53e0aa6f0b477d2f7 upstream.
      
      I've seen cases where bulk alloc fails, since the bulk alloc API
      is all-or-nothing - either we get the number we ask for, or it
      returns 0 as number of entries.
      
      If we fail a batch bulk alloc, retry a "normal" kmem_cache_alloc()
      and just use that instead of failing with -EAGAIN.
      
      While in there, ensure we use GFP_KERNEL. That was an oversight in
      the original code, when we switched away from GFP_ATOMIC.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c7346c03
    • J
      io_uring: fix poll races · 28ace2a6
      Jens Axboe 提交于
      commit 8c838788775a593527803786d376393b7c28f589 upstream.
      
      This is a straight port of Al's fix for the aio poll implementation,
      since the io_uring version is heavily based on that. The below
      description is almost straight from that patch, just modified to
      fit the io_uring situation.
      
      io_poll() has to cope with several unpleasant problems:
      	* requests that might stay around indefinitely need to
      be made visible for io_cancel(2); that must not be done to
      a request already completed, though.
      	* in cases when ->poll() has placed us on a waitqueue,
      wakeup might have happened (and request completed) before ->poll()
      returns.
      	* worse, in some early wakeup cases request might end
      up re-added into the queue later - we can't treat "woken up and
      currently not in the queue" as "it's not going to stick around
      indefinitely"
      	* ... moreover, ->poll() might have decided not to
      put it on any queues to start with, and that needs to be distinguished
      from the previous case
      	* ->poll() might have tried to put us on more than one queue.
      Only the first will succeed for io poll, so we might end up missing
      wakeups.  OTOH, we might very well notice that only after the
      wakeup hits and request gets completed (all before ->poll() gets
      around to the second poll_wait()).  In that case it's too late to
      decide that we have an error.
      
      req->woken was an attempt to deal with that.  Unfortunately, it was
      broken.  What we need to keep track of is not that wakeup has happened -
      the thing might come back after that.  It's that async reference is
      already gone and won't come back, so we can't (and needn't) put the
      request on the list of cancellables.
      
      The easiest case is "request hadn't been put on any waitqueues"; we
      can tell by seeing NULL apt.head, and in that case there won't be
      anything async.  We should either complete the request ourselves
      (if vfs_poll() reports anything of interest) or return an error.
      
      In all other cases we get exclusion with wakeups by grabbing the
      queue lock.
      
      If request is currently on queue and we have something interesting
      from vfs_poll(), we can steal it and complete the request ourselves.
      
      If it's on queue and vfs_poll() has not reported anything interesting,
      we either put it on the cancellable list, or, if we know that it
      hadn't been put on all queues ->poll() wanted it on, we steal it and
      return an error.
      
      If it's _not_ on queue, it's either been already dealt with (in which
      case we do nothing), or there's io_poll_complete_work() about to be
      executed.  In that case we either put it on the cancellable list,
      or, if we know it hadn't been put on all queues ->poll() wanted it on,
      simulate what cancel would've done.
      
      Fixes: 221c5eb23382 ("io_uring: add support for IORING_OP_POLL")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      28ace2a6
    • J
      io_uring: fix fget/fput handling · 9b0f9fca
      Jens Axboe 提交于
      commit 09bb839434bd845c01da3d159b0c126fe7fa90da upstream.
      
      This isn't a straight port of commit 84c4e1f89fef for aio.c, since
      io_uring doesn't use files in exactly the same way. But it's pretty
      close. See the commit message for that commit.
      
      This essentially fixes a use-after-free with the poll command
      handling, but it takes cue from Linus's approach to just simplifying
      the file handling. We move the setup of the file into a higher level
      location, so the individual commands don't have to deal with it. And
      then we release the reference when we free the associated io_kiocb.
      
      Fixes: 221c5eb23382 ("io_uring: add support for IORING_OP_POLL")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      9b0f9fca
    • J
      io_uring: add prepped flag · b81024e3
      Jens Axboe 提交于
      commit d530a402a114efcf6d2b88d7f628856dade5b90b upstream.
      
      We currently use the fact that if ->ki_filp is already set, then we've
      done the prep. In preparation for moving the file assignment earlier,
      use a separate flag to tell whether the request has been prepped for
      IO or not.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b81024e3
    • J
      io_uring: make io_read/write return an integer · b645ef91
      Jens Axboe 提交于
      commit e0c5c576d5074b5bb7b1b4b59848c25ceb521331 upstream.
      
      The callers all convert to an integer, and we only return 0/-ERROR
      anyway.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b645ef91
    • J
      io_uring: use regular request ref counts · d0cc10c5
      Jens Axboe 提交于
      commit e65ef56db4945fb18a0d522e056c02ddf939e644 upstream.
      
      Get rid of the special casing of "normal" requests not having
      any references to the io_kiocb. We initialize the ref count to 2,
      one for the submission side, and one or the completion side.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d0cc10c5
    • J
      io_uring: add a few test tools · 499dd0f6
      Jens Axboe 提交于
      commit 21b4aa5d20fd07207e73270cadffed5c63fb4343 upstream.
      
      This adds two test programs in tools/io_uring/ that demonstrate both
      the raw io_uring API (and all features) through a small benchmark
      app, io_uring-bench, and the liburing exposed API in a simplified
      cp(1) implementation through io_uring-cp.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      499dd0f6
    • J
      io_uring: allow workqueue item to handle multiple buffered requests · 465c847c
      Jens Axboe 提交于
      commit 31b515106428b9717d2b6475b6f6182cf231b1e6 upstream.
      
      Right now we punt any buffered request that ends up triggering an
      -EAGAIN to an async workqueue. This works fine in terms of providing
      async execution of them, but it also can create quite a lot of work
      queue items. For sequentially buffered IO, it's advantageous to
      serialize the issue of them. For reads, the first one will trigger a
      read-ahead, and subsequent request merely end up waiting on later pages
      to complete. For writes, devices usually respond better to streamed
      sequential writes.
      
      Add state to track the last buffered request we punted to a work queue,
      and if the next one is sequential to the previous, attempt to get the
      previous work item to handle it. We limit the number of sequential
      add-ons to the a multiple (8) of the max read-ahead size of the file.
      This should be a good number for both reads and wries, as it defines the
      max IO size the device can do directly.
      
      This drastically cuts down on the number of context switches we need to
      handle buffered sequential IO, and a basic test case of copying a big
      file with io_uring sees a 5x speedup.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      465c847c
    • J
      io_uring: add support for IORING_OP_POLL · 4a6205ae
      Jens Axboe 提交于
      commit 221c5eb2338232f7340386de1c43decc32682e58 upstream.
      
      This is basically a direct port of bfe4037e, which implements a
      one-shot poll command through aio. Description below is based on that
      commit as well. However, instead of adding a POLL command and relying
      on io_cancel(2) to remove it, we mimic the epoll(2) interface of
      having a command to add a poll notification, IORING_OP_POLL_ADD,
      and one to remove it again, IORING_OP_POLL_REMOVE.
      
      To poll for a file descriptor the application should submit an sqe of
      type IORING_OP_POLL. It will poll the fd for the events specified in the
      poll_events field.
      
      Unlike poll or epoll without EPOLLONESHOT this interface always works in
      one shot mode, that is once the sqe is completed, it will have to be
      resubmitted.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Based-on-code-from: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4a6205ae
    • J
      io_uring: add io_kiocb ref count · 688844fa
      Jens Axboe 提交于
      commit c16361c1d805b6ea50c3c1fc5c314e944c71a984 upstream.
      
      We'll use this for the POLL implementation. Regular requests will
      NOT be using references, so initialize it to 0. Any real use of
      the io_kiocb ref will initialize it to at least 2.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      688844fa
    • J
      io_uring: add submission polling · 9b6956ca
      Jens Axboe 提交于
      commit 6c271ce2f1d572f7fa225700a13cfe7ced492434 upstream.
      
      This enables an application to do IO, without ever entering the kernel.
      By using the SQ ring to fill in new sqes and watching for completions
      on the CQ ring, we can submit and reap IOs without doing a single system
      call. The kernel side thread will poll for new submissions, and in case
      of HIPRI/polled IO, it'll also poll for completions.
      
      By default, we allow 1 second of active spinning. This can by changed
      by passing in a different grace period at io_uring_register(2) time.
      If the thread exceeds this idle time without having any work to do, it
      will set:
      
      sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
      
      The application will have to call io_uring_enter() to start things back
      up again. If IO is kept busy, that will never be needed. Basically an
      application that has this feature enabled will guard it's
      io_uring_enter(2) call with:
      
      read_barrier();
      if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
      	io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
      
      instead of calling it unconditionally.
      
      It's mandatory to use fixed files with this feature. Failure to do so
      will result in the application getting an -EBADF CQ entry when
      submitting IO.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      9b6956ca
    • J
      io_uring: add file set registration · fae40a9c
      Jens Axboe 提交于
      commit 6b06314c47e141031be043539900d80d2c7ba10f upstream.
      
      We normally have to fget/fput for each IO we do on a file. Even with
      the batching we do, the cost of the atomic inc/dec of the file usage
      count adds up.
      
      This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
      for the io_uring_register(2) system call. The arguments passed in must
      be an array of __s32 holding file descriptors, and nr_args should hold
      the number of file descriptors the application wishes to pin for the
      duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
      called).
      
      When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
      member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
      to the index in the array passed in to IORING_REGISTER_FILES.
      
      Files are automatically unregistered when the io_uring instance is torn
      down. An application need only unregister if it wishes to register a new
      set of fds.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      fae40a9c
    • J
      net: split out functions related to registering inflight socket files · 4cac9e97
      Jens Axboe 提交于
      commit f4e65870e5cede5ca1ec0006b6c9803994e5f7b8 upstream.
      
      We need this functionality for the io_uring file registration, but
      we cannot rely on it since CONFIG_UNIX can be modular. Move the helpers
      to a separate file, that's always builtin to the kernel if CONFIG_UNIX is
      m/y.
      
      No functional changes in this patch, just moving code around.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4cac9e97
    • J
      io_uring: add support for pre-mapped user IO buffers · 89043c8b
      Jens Axboe 提交于
      commit edafccee56ff31678a091ddb7219aba9b28bc3cb upstream.
      
      If we have fixed user buffers, we can map them into the kernel when we
      setup the io_uring. That avoids the need to do get_user_pages() for
      each and every IO.
      
      To utilize this feature, the application must call io_uring_register()
      after having setup an io_uring instance, passing in
      IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
      an iovec array, and the nr_args should contain how many iovecs the
      application wishes to map.
      
      If successful, these buffers are now mapped into the kernel, eligible
      for IO. To use these fixed buffers, the application must use the
      IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
      set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
      must point to somewhere inside the indexed buffer.
      
      The application may register buffers throughout the lifetime of the
      io_uring instance. It can call io_uring_register() with
      IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
      buffers, and then register a new set. The application need not
      unregister buffers explicitly before shutting down the io_uring
      instance.
      
      It's perfectly valid to setup a larger buffer, and then sometimes only
      use parts of it for an IO. As long as the range is within the originally
      mapped region, it will work just fine.
      
      For now, buffers must not be file backed. If file backed buffers are
      passed in, the registration will fail with -1/EOPNOTSUPP. This
      restriction may be relaxed in the future.
      
      RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
      arbitrary 1G per buffer size is also imposed.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      89043c8b
    • J
      block: implement bio helper to add iter bvec pages to bio · df1d15a6
      Jens Axboe 提交于
      commit 6d0c48aede85e38316d0251564cab39cbc2422f6 upstream.
      
      For an ITER_BVEC, we can just iterate the iov and add the pages
      to the bio directly. For now, we grab a reference to those pages,
      and release them normally on IO completion. This isn't really needed
      for the normal case of O_DIRECT from/to a file, but some of the more
      esoteric use cases (like splice(2)) will unconditionally put the
      pipe buffer pages when the buffers are released. Until we can manage
      that case properly, ITER_BVEC pages are treated like normal pages
      in terms of reference counting.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      df1d15a6
    • J
      io_uring: batch io_kiocb allocation · 40159116
      Jens Axboe 提交于
      commit 2579f913d41a086563bb81762c519f3d62ddee37 upstream.
      
      Similarly to how we use the state->ios_left to know how many references
      to get to a file, we can use it to allocate the io_kiocb's we need in
      bulk.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      40159116
    • J
      io_uring: use fget/fput_many() for file references · a4c44f2b
      Jens Axboe 提交于
      commit 9a56a2323dbbd8ed7f380a5af7ae3ff82caa55a6 upstream.
      
      Add a separate io_submit_state structure, to cache some of the things
      we need for IO submission.
      
      One such example is file reference batching. io_submit_state. We get as
      many references as the number of sqes we are submitting, and drop
      unused ones if we end up switching files. The assumption here is that
      we're usually only dealing with one fd, and if there are multiple,
      hopefuly they are at least somewhat ordered. Could trivially be extended
      to cover multiple fds, if needed.
      
      On the completion side we do the same thing, except this is trivially
      done just locally in io_iopoll_reap().
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a4c44f2b
    • J
      fs: add fget_many() and fput_many() · bbc4ceed
      Jens Axboe 提交于
      commit 091141a42e15fe47ada737f3996b317072afcefb upstream.
      
      Some uses cases repeatedly get and put references to the same file, but
      the only exposed interface is doing these one at the time. As each of
      these entail an atomic inc or dec on a shared structure, that cost can
      add up.
      
      Add fget_many(), which works just like fget(), except it takes an
      argument for how many references to get on the file. Ditto fput_many(),
      which can drop an arbitrary number of references to a file.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      bbc4ceed
    • J
      io_uring: support for IO polling · ccdb9931
      Jens Axboe 提交于
      commit def596e9557c91d9846fc4d84d26f2c564644416 upstream.
      
      Add support for a polled io_uring instance. When a read or write is
      submitted to a polled io_uring, the application must poll for
      completions on the CQ ring through io_uring_enter(2). Polled IO may not
      generate IRQ completions, hence they need to be actively found by the
      application itself.
      
      To use polling, io_uring_setup() must be used with the
      IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
      polled and non-polled IO on an io_uring.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ccdb9931
    • C
      io_uring: add fsync support · 8bcb6bcc
      Christoph Hellwig 提交于
      commit c992fe2925d776be066d9f6cc13f9ea11d78b657 upstream.
      
      Add a new fsync opcode, which either syncs a range if one is passed,
      or the whole file if the offset and length fields are both cleared
      to zero.  A flag is provided to use fdatasync semantics, that is only
      force out metadata which is required to retrieve the file data, but
      not others like metadata.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      8bcb6bcc
    • J
      Add io_uring IO interface · 67beb508
      Jens Axboe 提交于
      commit 2b188cc1bb857a9d4701ae59aa7768b5124e262e upstream.
      
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      67beb508
  2. 09 1月, 2020 14 次提交
  3. 07 1月, 2020 2 次提交