1. 28 5月, 2020 2 次提交
    • A
      open: introduce openat2(2) syscall · 5b9369e5
      Aleksa Sarai 提交于
      to #26323588
      
      commit fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179 upstream.
      
      /* Background. */
      For a very long time, extending openat(2) with new features has been
      incredibly frustrating. This stems from the fact that openat(2) is
      possibly the most famous counter-example to the mantra "don't silently
      accept garbage from userspace" -- it doesn't check whether unknown flags
      are present[1].
      
      This means that (generally) the addition of new flags to openat(2) has
      been fraught with backwards-compatibility issues (O_TMPFILE has to be
      defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
      kernels gave errors, since it's insecure to silently ignore the
      flag[2]). All new security-related flags therefore have a tough road to
      being added to openat(2).
      
      Userspace also has a hard time figuring out whether a particular flag is
      supported on a particular kernel. While it is now possible with
      contemporary kernels (thanks to [3]), older kernels will expose unknown
      flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
      openat(2) time matches modern syscall designs and is far more
      fool-proof.
      
      In addition, the newly-added path resolution restriction LOOKUP flags
      (which we would like to expose to user-space) don't feel related to the
      pre-existing O_* flag set -- they affect all components of path lookup.
      We'd therefore like to add a new flag argument.
      
      Adding a new syscall allows us to finally fix the flag-ignoring problem,
      and we can make it extensible enough so that we will hopefully never
      need an openat3(2).
      
      /* Syscall Prototype. */
        /*
         * open_how is an extensible structure (similar in interface to
         * clone3(2) or sched_setattr(2)). The size parameter must be set to
         * sizeof(struct open_how), to allow for future extensions. All future
         * extensions will be appended to open_how, with their zero value
         * acting as a no-op default.
         */
        struct open_how { /* ... */ };
      
        int openat2(int dfd, const char *pathname,
                    struct open_how *how, size_t size);
      
      /* Description. */
      The initial version of 'struct open_how' contains the following fields:
      
        flags
          Used to specify openat(2)-style flags. However, any unknown flag
          bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
          will result in -EINVAL. In addition, this field is 64-bits wide to
          allow for more O_ flags than currently permitted with openat(2).
      
        mode
          The file mode for O_CREAT or O_TMPFILE.
      
          Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.
      
        resolve
          Restrict path resolution (in contrast to O_* flags they affect all
          path components). The current set of flags are as follows (at the
          moment, all of the RESOLVE_ flags are implemented as just passing
          the corresponding LOOKUP_ flag).
      
          RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
          RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
          RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
          RESOLVE_BENEATH       => LOOKUP_BENEATH
          RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT
      
      open_how does not contain an embedded size field, because it is of
      little benefit (userspace can figure out the kernel open_how size at
      runtime fairly easily without it). It also only contains u64s (even
      though ->mode arguably should be a u16) to avoid having padding fields
      which are never used in the future.
      
      Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
      is no longer permitted for openat(2). As far as I can tell, this has
      always been a bug and appears to not be used by userspace (and I've not
      seen any problems on my machines by disallowing it). If it turns out
      this breaks something, we can special-case it and only permit it for
      openat(2) but not openat2(2).
      
      After input from Florian Weimer, the new open_how and flag definitions
      are inside a separate header from uapi/linux/fcntl.h, to avoid problems
      that glibc has with importing that header.
      
      /* Testing. */
      In a follow-up patch there are over 200 selftests which ensure that this
      syscall has the correct semantics and will correctly handle several
      attack scenarios.
      
      In addition, I've written a userspace library[4] which provides
      convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
      because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
      must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
      syscalls). During the development of this patch, I've run numerous
      verification tests using libpathrs (showing that the API is reasonably
      usable by userspace).
      
      /* Future Work. */
      Additional RESOLVE_ flags have been suggested during the review period.
      These can be easily implemented separately (such as blocking auto-mount
      during resolution).
      
      Furthermore, there are some other proposed changes to the openat(2)
      interface (the most obvious example is magic-link hardening[5]) which
      would be a good opportunity to add a way for userspace to restrict how
      O_PATH file descriptors can be re-opened.
      
      Another possible avenue of future work would be some kind of
      CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
      which openat2(2) flags and fields are supported by the current kernel
      (to avoid userspace having to go through several guesses to figure it
      out).
      
      [1]: https://lwn.net/Articles/588444/
      [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
      [3]: commit 629e014b ("fs: completely ignore unknown open flags")
      [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
      [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
      [6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5b9369e5
    • J
      io_uring: add support for fallocate() · 41076b20
      Jens Axboe 提交于
      to #26323588
      
      commit d63d1b5edb7b832210bfde587ba9e7549fa064eb upstream.
      
      This exposes fallocate(2) through io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      41076b20
  2. 27 5月, 2020 5 次提交
  3. 30 4月, 2020 2 次提交
  4. 26 4月, 2020 1 次提交
  5. 13 4月, 2020 3 次提交
    • A
      virtio-balloon: add support for providing free page reports to host · 671b4f34
      Alexander Duyck 提交于
      to #26589565
      
      Add support for the page reporting feature provided by virtio-balloon.
      Reporting differs from the regular balloon functionality in that is is
      much less durable than a standard memory balloon.  Instead of creating a
      list of pages that cannot be accessed the pages are only inaccessible
      while they are being indicated to the virtio interface.  Once the
      interface has acknowledged them they are placed back into their respective
      free lists and are once again accessible by the guest system.
      
      Unlike a standard balloon we don't inflate and deflate the pages.  Instead
      we perform the reporting, and once the reporting is completed it is
      assumed that the page has been dropped from the guest and will be faulted
      back in the next time the page is accessed.
      
      Link: http://lkml.kernel.org/r/20200211224657.29318.68624.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      671b4f34
    • W
      virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON · 67595fa8
      Wei Wang 提交于
      to #26589565
      
      commit 2e991629bcf55a43681aec1ee096eeb03cf81709 upstream
      
      The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
      guest is using page poisoning. Guest writes to the poison_val config
      field to tell host about the page poisoning value that is in use.
      Suggested-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      67595fa8
    • W
      virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT · 0db17e11
      Wei Wang 提交于
      to #26589565
      
      commit 86a559787e6f5cf662c081363f64a20cad654195 upstream
      
      Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
      support of reporting hints of guest free pages to host via virtio-balloon.
      Currenlty, only free page blocks of MAX_ORDER - 1 are reported. They are
      obtained one by one from the mm free list via the regular allocation
      function.
      
      Host requests the guest to report free page hints by sending a new cmd id
      to the guest via the free_page_report_cmd_id configuration register. When
      the guest starts to report, it first sends a start cmd to host via the
      free page vq, which acks to host the cmd id received. When the guest
      finishes reporting free pages, a stop cmd is sent to host via the vq.
      Host may also send a stop cmd id to the guest to stop the reporting.
      
      VIRTIO_BALLOON_CMD_ID_STOP: Host sends this cmd to stop the guest reporting.
      VIRTIO_BALLOON_CMD_ID_DONE: Host sends this cmd to tell the guest that
      the reported pages are ready to be freed.
      
      Why does the guest free the reported pages when host tells it is ready to free?
      This is because freeing pages appears to be expensive for live migration.
      free_pages() dirties memory very quickly and makes the live migraion not
      converge in some cases. So it is good to delay the free_page operation
      when the migration is done, and host sends a command to guest about that.
      
      Why do we need the new VIRTIO_BALLOON_CMD_ID_DONE, instead of reusing
      VIRTIO_BALLOON_CMD_ID_STOP?
      This is because live migration is usually done in several rounds. At the
      end of each round, host needs to send a VIRTIO_BALLOON_CMD_ID_STOP cmd to
      the guest to stop (or say pause) the reporting. The guest resumes the
      reporting when it receives a new command id at the beginning of the next
      round. So we need a new cmd id to distinguish between "stop reporting" and
      "ready to free the reported pages".
      
      TODO:
      - Add a batch page allocation API to amortize the allocation overhead.
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Signed-off-by: NLiang Li <liang.z.li@intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      0db17e11
  6. 18 3月, 2020 20 次提交
    • M
      mm: introduce MADV_PAGEOUT · 23757dcc
      Minchan Kim 提交于
      commit 1a4e58cce84ee88129d5d49c064bd2852b481357 upstream
      
      When a process expects no accesses to a certain memory range for a long
      time, it could hint kernel that the pages can be reclaimed instantly but
      data should be preserved for future use.  This could reduce workingset
      eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
      MADV_PAGEOUT can be used by a process to mark a memory range as not
      expected to be used for a long time so that kernel reclaims *any LRU*
      pages instantly.  The hint can help kernel in deciding which pages to
      evict proactively.
      
      A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
      intentionally because it's automatically bounded by PMD size.  If PMD
      size(e.g., 256) makes some trouble, we could fix it later by limit it to
      SWAP_CLUSTER_MAX[1].
      
      - man-page material
      
      MADV_PAGEOUT (since Linux x.x)
      
      Do not expect access in the near future so pages in the specified
      regions could be reclaimed instantly regardless of memory pressure.
      Thus, access in the range after successful operation could cause
      major page fault but never lose the up-to-date contents unlike
      MADV_DONTNEED. Pages belonging to a shared mapping are only processed
      if a write access is allowed for the calling process.
      
      MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
      VM_PFNMAP pages.
      
      [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
      
      [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
        Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      23757dcc
    • M
      mm: introduce MADV_COLD · 1af766e8
      Minchan Kim 提交于
      commit 9c276cc65a58faf98be8e56962745ec99ab87636 upstream
      
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1af766e8
    • J
      io_uring: add support for backlogged CQ ring · b43c8385
      Jens Axboe 提交于
      commit 1d7bb1d50fb4dc141c7431cc21fdd24ffcc83c76 upstream.
      
      Currently we drop completion events, if the CQ ring is full. That's fine
      for requests with bounded completion times, but it may make it harder or
      impossible to use io_uring with networked IO where request completion
      times are generally unbounded. Or with POLL, for example, which is also
      unbounded.
      
      After this patch, we never overflow the ring, we simply store requests
      in a backlog for later flushing. This flushing is done automatically by
      the kernel. To prevent the backlog from growing indefinitely, if the
      backlog is non-empty, we apply back pressure on IO submissions. Any
      attempt to submit new IO with a non-empty backlog will get an -EBUSY
      return from the kernel. This is a signal to the application that it has
      backlogged CQ events, and that it must reap those before being allowed
      to submit more IO.
      
      Note that if we do return -EBUSY, we will have filled whatever
      backlogged events into the CQ ring first, if there's room. This means
      the application can safely reap events WITHOUT entering the kernel and
      waiting for them, they are already available in the CQ ring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b43c8385
    • J
      io_uring: add support for linked SQE timeouts · d4be78c6
      Jens Axboe 提交于
      commit 2665abfd757fb35a241c6f0b1ebf620e3ffb36fb upstream.
      
      While we have support for generic timeouts, we don't have a way to tie
      a timeout to a specific SQE. The generic timeouts simply trigger wakeups
      on the CQ ring.
      
      This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
      as a link to a previous command. The timeout specific can be either
      relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
      the timeout triggers before the dependent command completes, it will
      attempt to cancel that command. Likewise, if the dependent command
      completes before the timeout triggers, it will cancel the timeout.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      d4be78c6
    • J
      io_uring: support for generic async request cancel · e025ee0c
      Jens Axboe 提交于
      commit 62755e35dfb2b113c52b81cd96d01c20971c8e02 upstream.
      
      This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
      cancel requests that have been punted to async context and are now
      in-flight. This works for regular read/write requests to files, as
      long as they haven't been started yet. For socket based IO (or things
      like accept4(2)), we can cancel work that is already running as well.
      
      To cancel a request, the sqe must have ->addr set to the user_data of
      the request it wishes to cancel. If the request is cancelled
      successfully, the original request is completed with -ECANCELED
      and the cancel request is completed with a result of 0. If the
      request was already running, the original may or may not complete
      in error. The cancel request will complete with -EALREADY for that
      case. And finally, if the request to cancel wasn't found, the cancel
      request is completed with -ENOENT.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e025ee0c
    • J
      io_uring: add support for IORING_OP_ACCEPT · bdc2063b
      Jens Axboe 提交于
      commit 17f2fe35d080d8f64e86a60cdcd3a97edcbc213b upstream.
      
      This allows an application to call accept4() in an async fashion. Like
      other opcodes, we first try a non-blocking accept, then punt to async
      context if we have to.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      bdc2063b
    • J
      io_uring: add support for canceling timeout requests · f39f9b94
      Jens Axboe 提交于
      commit 11365043e5271fea4c92189a976833da477a3a44 upstream.
      
      We might have cases where the need for a specific timeout is gone, add
      support for canceling an existing timeout operation. This works like the
      POLL_REMOVE command, where the application passes in the user_data of
      the timeout it wishes to cancel in the sqe->addr field.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      f39f9b94
    • J
      io_uring: add support for absolute timeouts · 17da935f
      Jens Axboe 提交于
      commit a41525ab2e75987e809926352ebc6f1397da900e upstream.
      
      This is a pretty trivial addition on top of the relative timeouts
      we have now, but it's handy for ensuring tighter timing for those
      that are building scheduling primitives on top of io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      17da935f
    • J
      io_uring: allow application controlled CQ ring size · 29635426
      Jens Axboe 提交于
      commit 33a107f0a1b8df0ad925e39d8afc97bb78e0cec1 upstream.
      
      We currently size the CQ ring as twice the SQ ring, to allow some
      flexibility in not overflowing the CQ ring. This is done because the
      SQE life time is different than that of the IO request itself, the SQE
      is consumed as soon as the kernel has seen the entry.
      
      Certain application don't need a huge SQ ring size, since they just
      submit IO in batches. But they may have a lot of requests pending, and
      hence need a big CQ ring to hold them all. By allowing the application
      to control the CQ ring size multiplier, we can cater to those
      applications more efficiently.
      
      If an application wants to define its own CQ ring size, it must set
      IORING_SETUP_CQSIZE in the setup flags, and fill out
      io_uring_params->cq_entries. The value must be a power of two.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      29635426
    • J
      io_uring: add support for IORING_REGISTER_FILES_UPDATE · 54633bb3
      Jens Axboe 提交于
      commit c3a31e605620c279163c14068a60869ea3fda203 upstream.
      
      Allows the application to remove/replace/add files to/from a file set.
      Passes in a struct:
      
      struct io_uring_files_update {
      	__u32 offset;
      	__s32 *fds;
      };
      
      that holds an array of fds, size of array passed in through the usual
      nr_args part of the io_uring_register() system call. The logic is as
      follows:
      
      1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
         the set.
      2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
         replaced with ->fds[i].
      
      For case #2, is the existing file is currently empty (fd == -1), the
      new fd is simply added to the array.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      54633bb3
    • J
      io_uring: IORING_OP_TIMEOUT support · 94601f2b
      Jens Axboe 提交于
      commit 5262f567987d3c30052b22e78c35c2313d07b230 upstream.
      
      There's been a few requests for functionality similar to io_getevents()
      and epoll_wait(), where the user can specify a timeout for waiting on
      events. I deliberately did not add support for this through the system
      call initially to avoid overloading the args, but I can see that the use
      cases for this are valid.
      
      This adds support for IORING_OP_TIMEOUT. If a user wants to get woken
      when waiting for events, simply submit one of these timeout commands
      with your wait call (or before). This ensures that the application
      sleeping on the CQ ring waiting for events will get woken. The timeout
      command is passed in as a pointer to a struct timespec. Timeouts are
      relative. The timeout command also includes a way to auto-cancel after
      N events has passed.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      94601f2b
    • J
      io_uring: expose single mmap capability · c15b5945
      Jens Axboe 提交于
      commit ac90f249e15cd2a850daa9e36e15f81ce1ff6550 upstream.
      
      After commit 75b28affdd6a we can get by with just a single mmap to
      map both the sq and cq ring. However, userspace doesn't know that.
      
      Add a features variable to io_uring_params, and notify userspace
      that the kernel has this ability. This can then be used in liburing
      (or in applications directly) to avoid the second mmap.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      c15b5945
    • J
      io_uring: add support for recvmsg() · 3962b3d0
      Jens Axboe 提交于
      commit aa1fa28fc73ea6b740ee7b62bf3b07141883dbb8 upstream.
      
      This is done through IORING_OP_RECVMSG. This opcode uses the same
      sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the
      msghdr struct in the sqe->addr field as well.
      
      We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      3962b3d0
    • J
      io_uring: add support for sendmsg() · 0cb8acf9
      Jens Axboe 提交于
      commit 0fa03c624d8fc9932d0f27c39a9deca6a37e0e17 upstream.
      
      This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags
      for the flags argument, and the msghdr struct is passed in the
      sqe->addr field.
      
      We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0cb8acf9
    • J
      io_uring: add support for sqe links · fda445b3
      Jens Axboe 提交于
      commit 9e645e1105ca60fbbc6bddf2fd5ef7e57ed3dca8 upstream.
      
      With SQE links, we can create chains of dependent SQEs. One example
      would be queueing an SQE that's a read from one file descriptor, with
      the linked SQE being a write to another with the same set of buffers.
      
      An SQE link will not stall the pipeline, it'll just ensure that
      dependent SQEs aren't issued before the previous link has completed.
      
      Any error at submission or completion time will break the chain of SQEs.
      For completions, this also includes short reads or writes, as the next
      SQE could depend on the previous one being fully completed.
      
      Any SQE in a chain that gets canceled due to any of the above errors,
      will get an CQE fill with -ECANCELED as the error value.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      fda445b3
    • J
      io_uring: add support for eventfd notifications · 3707251b
      Jens Axboe 提交于
      commit 9b402849e80c85eee10bbd341aab3f1a0f942d4f upstream.
      
      Allow registration of an eventfd, which will trigger an event every
      time a completion event happens for this io_uring instance.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      3707251b
    • J
      io_uring: add support for IORING_OP_SYNC_FILE_RANGE · 2c12f33e
      Jens Axboe 提交于
      commit 5d17b4a4b7fa172b205be8a05051ae705d1dc3bb upstream.
      
      This behaves just like sync_file_range(2) does.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2c12f33e
    • J
      io_uring: add support for marking commands as draining · b52f2397
      Jens Axboe 提交于
      commit de0617e467171ba44c73efd1ba63f101b164a035 upstream.
      
      There are no ordering constraints between the submission and completion
      side of io_uring. But sometimes that would be useful to have. One common
      example is doing an fsync, for instance, and have it ordered with
      previous writes. Without support for that, the application must do this
      tracking itself.
      
      This adds a general SQE flag, IOSQE_IO_DRAIN. If a command is marked
      with this flag, then it will not be issued before previous commands have
      completed, and subsequent commands submitted after the drain will not be
      issued before the drain is started.. If there are no pending commands,
      setting this flag will not change the behavior of the issue of the
      command.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b52f2397
    • T
      tcp: Add snd_wnd to TCP_INFO · ecee8235
      Thomas Higdon 提交于
      commit 8f7baad7f03543451af27f5380fc816b008aa1f2 upstream
      
      Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
      performance problems --
      > (1) Usually when we're diagnosing TCP performance problems, we do so
      > from the sender, since the sender makes most of the
      > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
      > From the sender-side the thing that would be most useful is to see
      > tp->snd_wnd, the receive window that the receiver has advertised to
      > the sender.
      
      This serves the purpose of adding an additional __u32 to avoid the
      would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
      Signed-off-by: NThomas Higdon <tph@fb.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      ecee8235
    • T
      tcp: Add TCP_INFO counter for packets received out-of-order · 0107d737
      Thomas Higdon 提交于
      commit f9af2dbbfe01def62765a58af7fbc488351893c3 upstream
      
      For receive-heavy cases on the server-side, we want to track the
      connection quality for individual client IPs. This counter, similar to
      the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
      tracks out-of-order packet reception. By providing this counter in
      TCP_INFO, it will allow understanding to what degree receive-heavy
      sockets are experiencing out-of-order delivery and packet drops
      indicating congestion.
      
      Please note that this is similar to the counter in NetBSD TCP_INFO, and
      has the same name.
      
      Also note that we avoid increasing the size of the tcp_sock struct by
      taking advantage of a hole.
      Signed-off-by: NThomas Higdon <tph@fb.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      0107d737
  7. 17 1月, 2020 7 次提交
    • J
      io_uring: add support for IORING_OP_POLL · 51de0e8f
      Jens Axboe 提交于
      commit 221c5eb2338232f7340386de1c43decc32682e58 upstream.
      
      This is basically a direct port of bfe4037e, which implements a
      one-shot poll command through aio. Description below is based on that
      commit as well. However, instead of adding a POLL command and relying
      on io_cancel(2) to remove it, we mimic the epoll(2) interface of
      having a command to add a poll notification, IORING_OP_POLL_ADD,
      and one to remove it again, IORING_OP_POLL_REMOVE.
      
      To poll for a file descriptor the application should submit an sqe of
      type IORING_OP_POLL. It will poll the fd for the events specified in the
      poll_events field.
      
      Unlike poll or epoll without EPOLLONESHOT this interface always works in
      one shot mode, that is once the sqe is completed, it will have to be
      resubmitted.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Based-on-code-from: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      51de0e8f
    • J
      io_uring: add submission polling · aa124ba8
      Jens Axboe 提交于
      commit 6c271ce2f1d572f7fa225700a13cfe7ced492434 upstream.
      
      This enables an application to do IO, without ever entering the kernel.
      By using the SQ ring to fill in new sqes and watching for completions
      on the CQ ring, we can submit and reap IOs without doing a single system
      call. The kernel side thread will poll for new submissions, and in case
      of HIPRI/polled IO, it'll also poll for completions.
      
      By default, we allow 1 second of active spinning. This can by changed
      by passing in a different grace period at io_uring_register(2) time.
      If the thread exceeds this idle time without having any work to do, it
      will set:
      
      sq_ring->flags |= IORING_SQ_NEED_WAKEUP.
      
      The application will have to call io_uring_enter() to start things back
      up again. If IO is kept busy, that will never be needed. Basically an
      application that has this feature enabled will guard it's
      io_uring_enter(2) call with:
      
      read_barrier();
      if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
      	io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
      
      instead of calling it unconditionally.
      
      It's mandatory to use fixed files with this feature. Failure to do so
      will result in the application getting an -EBADF CQ entry when
      submitting IO.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      aa124ba8
    • J
      io_uring: add file set registration · 7bfbdad6
      Jens Axboe 提交于
      commit 6b06314c47e141031be043539900d80d2c7ba10f upstream.
      
      We normally have to fget/fput for each IO we do on a file. Even with
      the batching we do, the cost of the atomic inc/dec of the file usage
      count adds up.
      
      This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
      for the io_uring_register(2) system call. The arguments passed in must
      be an array of __s32 holding file descriptors, and nr_args should hold
      the number of file descriptors the application wishes to pin for the
      duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
      called).
      
      When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
      member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
      to the index in the array passed in to IORING_REGISTER_FILES.
      
      Files are automatically unregistered when the io_uring instance is torn
      down. An application need only unregister if it wishes to register a new
      set of fds.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      7bfbdad6
    • J
      io_uring: add support for pre-mapped user IO buffers · a078ed69
      Jens Axboe 提交于
      commit edafccee56ff31678a091ddb7219aba9b28bc3cb upstream.
      
      If we have fixed user buffers, we can map them into the kernel when we
      setup the io_uring. That avoids the need to do get_user_pages() for
      each and every IO.
      
      To utilize this feature, the application must call io_uring_register()
      after having setup an io_uring instance, passing in
      IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
      an iovec array, and the nr_args should contain how many iovecs the
      application wishes to map.
      
      If successful, these buffers are now mapped into the kernel, eligible
      for IO. To use these fixed buffers, the application must use the
      IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
      set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
      must point to somewhere inside the indexed buffer.
      
      The application may register buffers throughout the lifetime of the
      io_uring instance. It can call io_uring_register() with
      IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
      buffers, and then register a new set. The application need not
      unregister buffers explicitly before shutting down the io_uring
      instance.
      
      It's perfectly valid to setup a larger buffer, and then sometimes only
      use parts of it for an IO. As long as the range is within the originally
      mapped region, it will work just fine.
      
      For now, buffers must not be file backed. If file backed buffers are
      passed in, the registration will fail with -1/EOPNOTSUPP. This
      restriction may be relaxed in the future.
      
      RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
      arbitrary 1G per buffer size is also imposed.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a078ed69
    • J
      io_uring: support for IO polling · c3440f68
      Jens Axboe 提交于
      commit def596e9557c91d9846fc4d84d26f2c564644416 upstream.
      
      Add support for a polled io_uring instance. When a read or write is
      submitted to a polled io_uring, the application must poll for
      completions on the CQ ring through io_uring_enter(2). Polled IO may not
      generate IRQ completions, hence they need to be actively found by the
      application itself.
      
      To use polling, io_uring_setup() must be used with the
      IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
      polled and non-polled IO on an io_uring.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c3440f68
    • C
      io_uring: add fsync support · cb0d3740
      Christoph Hellwig 提交于
      commit c992fe2925d776be066d9f6cc13f9ea11d78b657 upstream.
      
      Add a new fsync opcode, which either syncs a range if one is passed,
      or the whole file if the offset and length fields are both cleared
      to zero.  A flag is provided to use fdatasync semantics, that is only
      force out metadata which is required to retrieve the file data, but
      not others like metadata.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      cb0d3740
    • J
      Add io_uring IO interface · 209d771f
      Jens Axboe 提交于
      commit 2b188cc1bb857a9d4701ae59aa7768b5124e262e upstream.
      
      The submission queue (SQ) and completion queue (CQ) rings are shared
      between the application and the kernel. This eliminates the need to
      copy data back and forth to submit and complete IO.
      
      IO submissions use the io_uring_sqe data structure, and completions
      are generated in the form of io_uring_cqe data structures. The SQ
      ring is an index into the io_uring_sqe array, which makes it possible
      to submit a batch of IOs without them being contiguous in the ring.
      The CQ ring is always contiguous, as completion events are inherently
      unordered, and hence any io_uring_cqe entry can point back to an
      arbitrary submission.
      
      Two new system calls are added for this:
      
      io_uring_setup(entries, params)
      	Sets up an io_uring instance for doing async IO. On success,
      	returns a file descriptor that the application can mmap to
      	gain access to the SQ ring, CQ ring, and io_uring_sqes.
      
      io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
      	Initiates IO against the rings mapped to this fd, or waits for
      	them to complete, or both. The behavior is controlled by the
      	parameters passed in. If 'to_submit' is non-zero, then we'll
      	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
      	kernel will wait for 'min_complete' events, if they aren't
      	already available. It's valid to set IORING_ENTER_GETEVENTS
      	and 'min_complete' == 0 at the same time, this allows the
      	kernel to return already completed events without waiting
      	for them. This is useful only for polling, as for IRQ
      	driven IO, the application can just check the CQ ring
      	without entering the kernel.
      
      With this setup, it's possible to do async IO with a single system
      call. Future developments will enable polled IO with this interface,
      and polled submission as well. The latter will enable an application
      to do IO without doing ANY system calls at all.
      
      For IRQ driven IO, an application only needs to enter the kernel for
      completions if it wants to wait for them to occur.
      
      Each io_uring is backed by a workqueue, to support buffered async IO
      as well. We will only punt to an async context if the command would
      need to wait for IO on the device side. Any data that can be accessed
      directly in the page cache is done inline. This avoids the slowness
      issue of usual threadpools, since cached data is accessed as quickly
      as a sync interface.
      
      Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      209d771f