1. 28 5月, 2020 21 次提交
  2. 27 5月, 2020 5 次提交
  3. 30 4月, 2020 2 次提交
  4. 26 4月, 2020 1 次提交
  5. 13 4月, 2020 3 次提交
    • A
      virtio-balloon: add support for providing free page reports to host · 671b4f34
      Alexander Duyck 提交于
      to #26589565
      
      Add support for the page reporting feature provided by virtio-balloon.
      Reporting differs from the regular balloon functionality in that is is
      much less durable than a standard memory balloon.  Instead of creating a
      list of pages that cannot be accessed the pages are only inaccessible
      while they are being indicated to the virtio interface.  Once the
      interface has acknowledged them they are placed back into their respective
      free lists and are once again accessible by the guest system.
      
      Unlike a standard balloon we don't inflate and deflate the pages.  Instead
      we perform the reporting, and once the reporting is completed it is
      assumed that the page has been dropped from the guest and will be faulted
      back in the next time the page is accessed.
      
      Link: http://lkml.kernel.org/r/20200211224657.29318.68624.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      671b4f34
    • W
      virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON · 67595fa8
      Wei Wang 提交于
      to #26589565
      
      commit 2e991629bcf55a43681aec1ee096eeb03cf81709 upstream
      
      The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
      guest is using page poisoning. Guest writes to the poison_val config
      field to tell host about the page poisoning value that is in use.
      Suggested-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      67595fa8
    • W
      virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT · 0db17e11
      Wei Wang 提交于
      to #26589565
      
      commit 86a559787e6f5cf662c081363f64a20cad654195 upstream
      
      Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
      support of reporting hints of guest free pages to host via virtio-balloon.
      Currenlty, only free page blocks of MAX_ORDER - 1 are reported. They are
      obtained one by one from the mm free list via the regular allocation
      function.
      
      Host requests the guest to report free page hints by sending a new cmd id
      to the guest via the free_page_report_cmd_id configuration register. When
      the guest starts to report, it first sends a start cmd to host via the
      free page vq, which acks to host the cmd id received. When the guest
      finishes reporting free pages, a stop cmd is sent to host via the vq.
      Host may also send a stop cmd id to the guest to stop the reporting.
      
      VIRTIO_BALLOON_CMD_ID_STOP: Host sends this cmd to stop the guest reporting.
      VIRTIO_BALLOON_CMD_ID_DONE: Host sends this cmd to tell the guest that
      the reported pages are ready to be freed.
      
      Why does the guest free the reported pages when host tells it is ready to free?
      This is because freeing pages appears to be expensive for live migration.
      free_pages() dirties memory very quickly and makes the live migraion not
      converge in some cases. So it is good to delay the free_page operation
      when the migration is done, and host sends a command to guest about that.
      
      Why do we need the new VIRTIO_BALLOON_CMD_ID_DONE, instead of reusing
      VIRTIO_BALLOON_CMD_ID_STOP?
      This is because live migration is usually done in several rounds. At the
      end of each round, host needs to send a VIRTIO_BALLOON_CMD_ID_STOP cmd to
      the guest to stop (or say pause) the reporting. The guest resumes the
      reporting when it receives a new command id at the beginning of the next
      round. So we need a new cmd id to distinguish between "stop reporting" and
      "ready to free the reported pages".
      
      TODO:
      - Add a batch page allocation API to amortize the allocation overhead.
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Signed-off-by: NLiang Li <liang.z.li@intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      0db17e11
  6. 18 3月, 2020 8 次提交
    • J
      io_uring: add support for backlogged CQ ring · b43c8385
      Jens Axboe 提交于
      commit 1d7bb1d50fb4dc141c7431cc21fdd24ffcc83c76 upstream.
      
      Currently we drop completion events, if the CQ ring is full. That's fine
      for requests with bounded completion times, but it may make it harder or
      impossible to use io_uring with networked IO where request completion
      times are generally unbounded. Or with POLL, for example, which is also
      unbounded.
      
      After this patch, we never overflow the ring, we simply store requests
      in a backlog for later flushing. This flushing is done automatically by
      the kernel. To prevent the backlog from growing indefinitely, if the
      backlog is non-empty, we apply back pressure on IO submissions. Any
      attempt to submit new IO with a non-empty backlog will get an -EBUSY
      return from the kernel. This is a signal to the application that it has
      backlogged CQ events, and that it must reap those before being allowed
      to submit more IO.
      
      Note that if we do return -EBUSY, we will have filled whatever
      backlogged events into the CQ ring first, if there's room. This means
      the application can safely reap events WITHOUT entering the kernel and
      waiting for them, they are already available in the CQ ring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      b43c8385
    • J
      io_uring: add support for linked SQE timeouts · d4be78c6
      Jens Axboe 提交于
      commit 2665abfd757fb35a241c6f0b1ebf620e3ffb36fb upstream.
      
      While we have support for generic timeouts, we don't have a way to tie
      a timeout to a specific SQE. The generic timeouts simply trigger wakeups
      on the CQ ring.
      
      This adds support for IORING_OP_LINK_TIMEOUT. This command is only valid
      as a link to a previous command. The timeout specific can be either
      relative or absolute, following the same rules as IORING_OP_TIMEOUT. If
      the timeout triggers before the dependent command completes, it will
      attempt to cancel that command. Likewise, if the dependent command
      completes before the timeout triggers, it will cancel the timeout.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      d4be78c6
    • J
      io_uring: support for generic async request cancel · e025ee0c
      Jens Axboe 提交于
      commit 62755e35dfb2b113c52b81cd96d01c20971c8e02 upstream.
      
      This adds support for IORING_OP_ASYNC_CANCEL, which will attempt to
      cancel requests that have been punted to async context and are now
      in-flight. This works for regular read/write requests to files, as
      long as they haven't been started yet. For socket based IO (or things
      like accept4(2)), we can cancel work that is already running as well.
      
      To cancel a request, the sqe must have ->addr set to the user_data of
      the request it wishes to cancel. If the request is cancelled
      successfully, the original request is completed with -ECANCELED
      and the cancel request is completed with a result of 0. If the
      request was already running, the original may or may not complete
      in error. The cancel request will complete with -EALREADY for that
      case. And finally, if the request to cancel wasn't found, the cancel
      request is completed with -ENOENT.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      e025ee0c
    • J
      io_uring: add support for IORING_OP_ACCEPT · bdc2063b
      Jens Axboe 提交于
      commit 17f2fe35d080d8f64e86a60cdcd3a97edcbc213b upstream.
      
      This allows an application to call accept4() in an async fashion. Like
      other opcodes, we first try a non-blocking accept, then punt to async
      context if we have to.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      bdc2063b
    • J
      io_uring: add support for canceling timeout requests · f39f9b94
      Jens Axboe 提交于
      commit 11365043e5271fea4c92189a976833da477a3a44 upstream.
      
      We might have cases where the need for a specific timeout is gone, add
      support for canceling an existing timeout operation. This works like the
      POLL_REMOVE command, where the application passes in the user_data of
      the timeout it wishes to cancel in the sqe->addr field.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      f39f9b94
    • J
      io_uring: add support for absolute timeouts · 17da935f
      Jens Axboe 提交于
      commit a41525ab2e75987e809926352ebc6f1397da900e upstream.
      
      This is a pretty trivial addition on top of the relative timeouts
      we have now, but it's handy for ensuring tighter timing for those
      that are building scheduling primitives on top of io_uring.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      17da935f
    • J
      io_uring: allow application controlled CQ ring size · 29635426
      Jens Axboe 提交于
      commit 33a107f0a1b8df0ad925e39d8afc97bb78e0cec1 upstream.
      
      We currently size the CQ ring as twice the SQ ring, to allow some
      flexibility in not overflowing the CQ ring. This is done because the
      SQE life time is different than that of the IO request itself, the SQE
      is consumed as soon as the kernel has seen the entry.
      
      Certain application don't need a huge SQ ring size, since they just
      submit IO in batches. But they may have a lot of requests pending, and
      hence need a big CQ ring to hold them all. By allowing the application
      to control the CQ ring size multiplier, we can cater to those
      applications more efficiently.
      
      If an application wants to define its own CQ ring size, it must set
      IORING_SETUP_CQSIZE in the setup flags, and fill out
      io_uring_params->cq_entries. The value must be a power of two.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      29635426
    • J
      io_uring: add support for IORING_REGISTER_FILES_UPDATE · 54633bb3
      Jens Axboe 提交于
      commit c3a31e605620c279163c14068a60869ea3fda203 upstream.
      
      Allows the application to remove/replace/add files to/from a file set.
      Passes in a struct:
      
      struct io_uring_files_update {
      	__u32 offset;
      	__s32 *fds;
      };
      
      that holds an array of fds, size of array passed in through the usual
      nr_args part of the io_uring_register() system call. The logic is as
      follows:
      
      1) If ->fds[i] is -1, the existing file at i + ->offset is removed from
         the set.
      2) If ->fds[i] is a valid fd, the existing file at i + ->offset is
         replaced with ->fds[i].
      
      For case #2, is the existing file is currently empty (fd == -1), the
      new fd is simply added to the array.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      54633bb3