1. 28 5月, 2020 5 次提交
    • J
      io_uring: avoid ring quiesce for fixed file set unregister and update · 43df7c78
      Jens Axboe 提交于
      to #26323588
      
      We currently fully quiesce the ring before an unregister or update of
      the fixed fileset. This is very expensive, and we can be a bit smarter
      about this.
      
      Add a percpu refcount for the file tables as a whole. Grab a percpu ref
      when we use a registered file, and put it on completion. This is cheap
      to do. Upon removal of a file from a set, switch the ref count to atomic
      mode. When we hit zero ref on the completion side, then we know we can
      drop the previously registered files. When the old files have been
      dropped, switch the ref back to percpu mode for normal operation.
      
      Since there's a period between doing the update and the kernel being
      done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the
      same action. The application knows the update has completed when it gets
      the CQE for it. Between doing the update and receiving this completion,
      the application must continue to use the unregistered fd if submitting
      IO on this particular file.
      
      This takes the runtime of test/file-register from liburing from 14s to
      about 0.7s.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      43df7c78
    • J
      io_uring: add support for IORING_OP_CLOSE · 7a194789
      Jens Axboe 提交于
      to #26323588
      
      commit b5dba59e0cf7e2cc4d3b3b1ac5fe81ddf21959eb upstream.
      
      This works just like close(2), unsurprisingly. We remove the file
      descriptor and post the completion inline, then offload the actual
      (potential) last file put to async context.
      
      Mark the async part of this work as uncancellable, as we really must
      guarantee that the latter part of the close is run.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      7a194789
    • J
      io_uring: add support for IORING_OP_OPENAT · dd7ff1b1
      Jens Axboe 提交于
      to #26323588
      
      commit 15b71abe7b52df214785dde0de9f581cc0216d17 upstream.
      
      This works just like openat(2), except it can be performed async. For
      the normal case of a non-blocking path lookup this will complete
      inline. If we have to do IO to perform the open, it'll be done from
      async context.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      dd7ff1b1
    • A
      open: introduce openat2(2) syscall · 5b9369e5
      Aleksa Sarai 提交于
      to #26323588
      
      commit fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179 upstream.
      
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5b9369e5
    • J
      io_uring: add support for fallocate() · 41076b20
      Jens Axboe 提交于
      to #26323588
      
      commit d63d1b5edb7b832210bfde587ba9e7549fa064eb upstream.
      
      This exposes fallocate(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      41076b20
  2. 27 5月, 2020 5 次提交
  3. 30 4月, 2020 2 次提交
  4. 26 4月, 2020 1 次提交
  5. 13 4月, 2020 3 次提交
    • A
      virtio-balloon: add support for providing free page reports to host · 671b4f34
      Alexander Duyck 提交于
      to #26589565
      
      Add support for the page reporting feature provided by virtio-balloon.
      Reporting differs from the regular balloon functionality in that is is
      much less durable than a standard memory balloon.  Instead of creating a
      list of pages that cannot be accessed the pages are only inaccessible
      while they are being indicated to the virtio interface.  Once the
      interface has acknowledged them they are placed back into their respective
      free lists and are once again accessible by the guest system.
      
      Unlike a standard balloon we don't inflate and deflate the pages.  Instead
      we perform the reporting, and once the reporting is completed it is
      assumed that the page has been dropped from the guest and will be faulted
      back in the next time the page is accessed.
      
      Link: http://lkml.kernel.org/r/20200211224657.29318.68624.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      671b4f34
    • W
      virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON · 67595fa8
      Wei Wang 提交于
      to #26589565
      
      commit 2e991629bcf55a43681aec1ee096eeb03cf81709 upstream
      
      The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
      guest is using page poisoning. Guest writes to the poison_val config
      field to tell host about the page poisoning value that is in use.
      Suggested-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      67595fa8
    • W
      virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT · 0db17e11
      Wei Wang 提交于
      to #26589565
      
      commit 86a559787e6f5cf662c081363f64a20cad654195 upstream
      
      Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
      support of reporting hints of guest free pages to host via virtio-balloon.
      Currenlty, only free page blocks of MAX_ORDER - 1 are reported. They are
      obtained one by one from the mm free list via the regular allocation
      function.
      
      Host requests the guest to report free page hints by sending a new cmd id
      to the guest via the free_page_report_cmd_id configuration register. When
      the guest starts to report, it first sends a start cmd to host via the
      free page vq, which acks to host the cmd id received. When the guest
      finishes reporting free pages, a stop cmd is sent to host via the vq.
      Host may also send a stop cmd id to the guest to stop the reporting.
      
      VIRTIO_BALLOON_CMD_ID_STOP: Host sends this cmd to stop the guest reporting.
      VIRTIO_BALLOON_CMD_ID_DONE: Host sends this cmd to tell the guest that
      the reported pages are ready to be freed.
      
      Why does the guest free the reported pages when host tells it is ready to free?
      This is because freeing pages appears to be expensive for live migration.
      free_pages() dirties memory very quickly and makes the live migraion not
      converge in some cases. So it is good to delay the free_page operation
      when the migration is done, and host sends a command to guest about that.
      
      Why do we need the new VIRTIO_BALLOON_CMD_ID_DONE, instead of reusing
      VIRTIO_BALLOON_CMD_ID_STOP?
      This is because live migration is usually done in several rounds. At the
      end of each round, host needs to send a VIRTIO_BALLOON_CMD_ID_STOP cmd to
      the guest to stop (or say pause) the reporting. The guest resumes the
      reporting when it receives a new command id at the beginning of the next
      round. So we need a new cmd id to distinguish between "stop reporting" and
      "ready to free the reported pages".
      
      TODO:
      - Add a batch page allocation API to amortize the allocation overhead.
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Signed-off-by: NLiang Li <liang.z.li@intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      0db17e11
  6. 18 3月, 2020 18 次提交
  7. 17 1月, 2020 6 次提交
    • J
      io_uring: add support for IORING_OP_POLL · 51de0e8f
      Jens Axboe 提交于
      commit 221c5eb2338232f7340386de1c43decc32682e58 upstream.
      
      This is basically a direct port of bfe4037e, which implements a
      one-shot poll command through aio. Description below is based on that
      commit as well. However, instead of adding a POLL command and relying
      on io_cancel(2) to remove it, we mimic the epoll(2) interface of
      having a command to add a poll notification, IORING_OP_POLL_ADD,
      and one to remove it again, IORING_OP_POLL_REMOVE.
      
      To poll for a file descriptor the application should submit an sqe of
      type IORING_OP_POLL. It will poll the fd for the events specified in the
      poll_events field.
      
      Unlike poll or epoll without EPOLLONESHOT this interface always works in
      one shot mode, that is once the sqe is completed, it will have to be
      resubmitted.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Based-on-code-from: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      51de0e8f
    • J
      io_uring: add submission polling · aa124ba8
      Jens Axboe 提交于
      commit 6c271ce2f1d572f7fa225700a13cfe7ced492434 upstream.
      
      This enables an application to do IO, without ever entering the kernel.
      By using the SQ ring to fill in new sqes and watching for completions
      on the CQ ring, we can submit and reap IOs without doing a single system
      call. The kernel side thread will poll for new submissions, and in case
      of HIPRI/polled IO, it'll also poll for completions.
      
      By default, we allow 1 second of active spinning. This can by changed
      by passing in a different grace period at io_uring_register(2) time.
      If the thread exceeds this idle time without having any work to do, it
      will set:
      
      sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
      
      The application will have to call io_uring_enter() to start things back
      up again. If IO is kept busy, that will never be needed. Basically an
      application that has this feature enabled will guard it's
      io_uring_enter(2) call with:
      
      read_barrier();
      if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
      	io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
      
      instead of calling it unconditionally.
      
      It's mandatory to use fixed files with this feature. Failure to do so
      will result in the application getting an -EBADF CQ entry when
      submitting IO.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      aa124ba8
    • J
      io_uring: add file set registration · 7bfbdad6
      Jens Axboe 提交于
      commit 6b06314c47e141031be043539900d80d2c7ba10f upstream.
      
      We normally have to fget/fput for each IO we do on a file. Even with
      the batching we do, the cost of the atomic inc/dec of the file usage
      count adds up.
      
      This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
      for the io_uring_register(2) system call. The arguments passed in must
      be an array of __s32 holding file descriptors, and nr_args should hold
      the number of file descriptors the application wishes to pin for the
      duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
      called).
      
      When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
      member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
      to the index in the array passed in to IORING_REGISTER_FILES.
      
      Files are automatically unregistered when the io_uring instance is torn
      down. An application need only unregister if it wishes to register a new
      set of fds.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      7bfbdad6
    • J
      io_uring: add support for pre-mapped user IO buffers · a078ed69
      Jens Axboe 提交于
      commit edafccee56ff31678a091ddb7219aba9b28bc3cb upstream.
      
      If we have fixed user buffers, we can map them into the kernel when we
      setup the io_uring. That avoids the need to do get_user_pages() for
      each and every IO.
      
      To utilize this feature, the application must call io_uring_register()
      after having setup an io_uring instance, passing in
      IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
      an iovec array, and the nr_args should contain how many iovecs the
      application wishes to map.
      
      If successful, these buffers are now mapped into the kernel, eligible
      for IO. To use these fixed buffers, the application must use the
      IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
      set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
      must point to somewhere inside the indexed buffer.
      
      The application may register buffers throughout the lifetime of the
      io_uring instance. It can call io_uring_register() with
      IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
      buffers, and then register a new set. The application need not
      unregister buffers explicitly before shutting down the io_uring
      instance.
      
      It's perfectly valid to setup a larger buffer, and then sometimes only
      use parts of it for an IO. As long as the range is within the originally
      mapped region, it will work just fine.
      
      For now, buffers must not be file backed. If file backed buffers are
      passed in, the registration will fail with -1/EOPNOTSUPP. This
      restriction may be relaxed in the future.
      
      RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
      arbitrary 1G per buffer size is also imposed.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a078ed69
    • J
      io_uring: support for IO polling · c3440f68
      Jens Axboe 提交于
      commit def596e9557c91d9846fc4d84d26f2c564644416 upstream.
      
      Add support for a polled io_uring instance. When a read or write is
      submitted to a polled io_uring, the application must poll for
      completions on the CQ ring through io_uring_enter(2). Polled IO may not
      generate IRQ completions, hence they need to be actively found by the
      application itself.
      
      To use polling, io_uring_setup() must be used with the
      IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
      polled and non-polled IO on an io_uring.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c3440f68
    • C
      io_uring: add fsync support · cb0d3740
      Christoph Hellwig 提交于
      commit c992fe2925d776be066d9f6cc13f9ea11d78b657 upstream.
      
      Add a new fsync opcode, which either syncs a range if one is passed,
      or the whole file if the offset and length fields are both cleared
      to zero.  A flag is provided to use fdatasync semantics, that is only
      force out metadata which is required to retrieve the file data, but
      not others like metadata.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      cb0d3740