1. 15 11月, 2020 6 次提交
    • P
      KVM: X86: Implement ring-based dirty memory tracking · fb04a1ed
      Peter Xu 提交于
      This patch is heavily based on previous work from Lei Cao
      <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
      
      KVM currently uses large bitmaps to track dirty memory.  These bitmaps
      are copied to userspace when userspace queries KVM for its dirty page
      information.  The use of bitmaps is mostly sufficient for live
      migration, as large parts of memory are be dirtied from one log-dirty
      pass to another.  However, in a checkpointing system, the number of
      dirty pages is small and in fact it is often bounded---the VM is
      paused when it has dirtied a pre-defined number of pages. Traversing a
      large, sparsely populated bitmap to find set bits is time-consuming,
      as is copying the bitmap to user-space.
      
      A similar issue will be there for live migration when the guest memory
      is huge while the page dirty procedure is trivial.  In that case for
      each dirty sync we need to pull the whole dirty bitmap to userspace
      and analyse every bit even if it's mostly zeros.
      
      The preferred data structure for above scenarios is a dense list of
      guest frame numbers (GFN).  This patch series stores the dirty list in
      kernel memory that can be memory mapped into userspace to allow speedy
      harvesting.
      
      This patch enables dirty ring for X86 only.  However it should be
      easily extended to other archs as well.
      
      [1] https://patchwork.kernel.org/patch/10471409/Signed-off-by: NLei Cao <lei.cao@stratus.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012222.5767-1-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fb04a1ed
    • P
      KVM: Pass in kvm pointer into mark_page_dirty_in_slot() · 28bd726a
      Peter Xu 提交于
      The context will be needed to implement the kvm dirty ring.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012044.5151-5-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      28bd726a
    • P
      KVM: remove kvm_clear_guest_page · 2f541442
      Paolo Bonzini 提交于
      kvm_clear_guest_page is not used anymore after "KVM: X86: Don't track dirty
      for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]", except from kvm_clear_guest.
      We can just inline it in its sole user.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2f541442
    • V
      KVM: x86: hyper-v: allow KVM_GET_SUPPORTED_HV_CPUID as a system ioctl · c21d54f0
      Vitaly Kuznetsov 提交于
      KVM_GET_SUPPORTED_HV_CPUID is a vCPU ioctl but its output is now
      independent from vCPU and in some cases VMMs may want to use it as a system
      ioctl instead. In particular, QEMU doesn CPU feature expansion before any
      vCPU gets created so KVM_GET_SUPPORTED_HV_CPUID can't be used.
      
      Convert KVM_GET_SUPPORTED_HV_CPUID to 'dual' system/vCPU ioctl with the
      same meaning.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200929150944.1235688-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c21d54f0
    • D
      eventfd: Export eventfd_ctx_do_read() · 28f13267
      David Woodhouse 提交于
      Where events are consumed in the kernel, for example by KVM's
      irqfd_wakeup() and VFIO's virqfd_wakeup(), they currently lack a
      mechanism to drain the eventfd's counter.
      
      Since the wait queue is already locked while the wakeup functions are
      invoked, all they really need to do is call eventfd_ctx_do_read().
      
      Add a check for the lock, and export it for them.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20201027135523.646811-2-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      28f13267
    • D
      sched/wait: Add add_wait_queue_priority() · c4d51a52
      David Woodhouse 提交于
      This allows an exclusive wait_queue_entry to be added at the head of the
      queue, instead of the tail as normal. Thus, it gets to consume events
      first without allowing non-exclusive waiters to be woken at all.
      
      The (first) intended use is for KVM IRQFD, which currently has
      inconsistent behaviour depending on whether posted interrupts are
      available or not. If they are, KVM will bypass the eventfd completely
      and deliver interrupts directly to the appropriate vCPU. If not, events
      are delivered through the eventfd and userspace will receive them when
      polling on the eventfd.
      
      By using add_wait_queue_priority(), KVM will be able to consistently
      consume events within the kernel without accidentally exposing them
      to userspace when they're supposed to be bypassed. This, in turn, means
      that userspace doesn't have to jump through hoops to avoid listening
      on the erroneously noisy eventfd and injecting duplicate interrupts.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20201027143944.648769-2-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c4d51a52
  2. 11 11月, 2020 3 次提交
  3. 07 11月, 2020 7 次提交
  4. 06 11月, 2020 1 次提交
  5. 05 11月, 2020 2 次提交
    • J
      io_uring: properly handle SQPOLL request cancelations · fdaf083c
      Jens Axboe 提交于
      Track if a given task io_uring context contains SQPOLL instances, so we
      can iterate those for cancelation (and request counts). This ensures that
      we properly wait on SQPOLL contexts, and find everything that needs
      canceling.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fdaf083c
    • B
      iomap: support partial page discard on writeback block mapping failure · 763e4cdc
      Brian Foster 提交于
      iomap writeback mapping failure only calls into ->discard_page() if
      the current page has not been added to the ioend. Accordingly, the
      XFS callback assumes a full page discard and invalidation. This is
      problematic for sub-page block size filesystems where some portion
      of a page might have been mapped successfully before a failure to
      map a delalloc block occurs. ->discard_page() is not called in that
      error scenario and the bio is explicitly failed by iomap via the
      error return from ->prepare_ioend(). As a result, the filesystem
      leaks delalloc blocks and corrupts the filesystem block counters.
      
      Since XFS is the only user of ->discard_page(), tweak the semantics
      to invoke the callback unconditionally on mapping errors and provide
      the file offset that failed to map. Update xfs_discard_page() to
      discard the corresponding portion of the file and pass the range
      along to iomap_invalidatepage(). The latter already properly handles
      both full and sub-page scenarios by not changing any iomap or page
      state on sub-page invalidations.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      763e4cdc
  6. 04 11月, 2020 1 次提交
    • O
      can: can_create_echo_skb(): fix echo skb generation: always use skb_clone() · 286228d3
      Oleksij Rempel 提交于
      All user space generated SKBs are owned by a socket (unless injected into the
      key via AF_PACKET). If a socket is closed, all associated skbs will be cleaned
      up.
      
      This leads to a problem when a CAN driver calls can_put_echo_skb() on a
      unshared SKB. If the socket is closed prior to the TX complete handler,
      can_get_echo_skb() and the subsequent delivering of the echo SKB to all
      registered callbacks, a SKB with a refcount of 0 is delivered.
      
      To avoid the problem, in can_get_echo_skb() the original SKB is now always
      cloned, regardless of shared SKB or not. If the process exists it can now
      safely discard its SKBs, without disturbing the delivery of the echo SKB.
      
      The problem shows up in the j1939 stack, when it clones the incoming skb, which
      detects the already 0 refcount.
      
      We can easily reproduce this with following example:
      
      testj1939 -B -r can0: &
      cansend can0 1823ff40#0123
      
      WARNING: CPU: 0 PID: 293 at lib/refcount.c:25 refcount_warn_saturate+0x108/0x174
      refcount_t: addition on 0; use-after-free.
      Modules linked in: coda_vpu imx_vdoa videobuf2_vmalloc dw_hdmi_ahb_audio vcan
      CPU: 0 PID: 293 Comm: cansend Not tainted 5.5.0-rc6-00376-g9e20dcb7040d #1
      Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
      Backtrace:
      [<c010f570>] (dump_backtrace) from [<c010f90c>] (show_stack+0x20/0x24)
      [<c010f8ec>] (show_stack) from [<c0c3e1a4>] (dump_stack+0x8c/0xa0)
      [<c0c3e118>] (dump_stack) from [<c0127fec>] (__warn+0xe0/0x108)
      [<c0127f0c>] (__warn) from [<c01283c8>] (warn_slowpath_fmt+0xa8/0xcc)
      [<c0128324>] (warn_slowpath_fmt) from [<c0539c0c>] (refcount_warn_saturate+0x108/0x174)
      [<c0539b04>] (refcount_warn_saturate) from [<c0ad2cac>] (j1939_can_recv+0x20c/0x210)
      [<c0ad2aa0>] (j1939_can_recv) from [<c0ac9dc8>] (can_rcv_filter+0xb4/0x268)
      [<c0ac9d14>] (can_rcv_filter) from [<c0aca2cc>] (can_receive+0xb0/0xe4)
      [<c0aca21c>] (can_receive) from [<c0aca348>] (can_rcv+0x48/0x98)
      [<c0aca300>] (can_rcv) from [<c09b1fdc>] (__netif_receive_skb_one_core+0x64/0x88)
      [<c09b1f78>] (__netif_receive_skb_one_core) from [<c09b2070>] (__netif_receive_skb+0x38/0x94)
      [<c09b2038>] (__netif_receive_skb) from [<c09b2130>] (netif_receive_skb_internal+0x64/0xf8)
      [<c09b20cc>] (netif_receive_skb_internal) from [<c09b21f8>] (netif_receive_skb+0x34/0x19c)
      [<c09b21c4>] (netif_receive_skb) from [<c0791278>] (can_rx_offload_napi_poll+0x58/0xb4)
      
      Fixes: 0ae89beb ("can: add destructor for self generated skbs")
      Signed-off-by: NOleksij Rempel <o.rempel@pengutronix.de>
      Link: http://lore.kernel.org/r/20200124132656.22156-1-o.rempel@pengutronix.deAcked-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      286228d3
  7. 03 11月, 2020 3 次提交
    • J
      mm: always have io_remap_pfn_range() set pgprot_decrypted() · f8f6ae5d
      Jason Gunthorpe 提交于
      The purpose of io_remap_pfn_range() is to map IO memory, such as a
      memory mapped IO exposed through a PCI BAR.  IO devices do not
      understand encryption, so this memory must always be decrypted.
      Automatically call pgprot_decrypted() as part of the generic
      implementation.
      
      This fixes a bug where enabling AMD SME causes subsystems, such as RDMA,
      using io_remap_pfn_range() to expose BAR pages to user space to fail.
      The CPU will encrypt access to those BAR pages instead of passing
      unencrypted IO directly to the device.
      
      Places not mapping IO should use remap_pfn_range().
      
      Fixes: aca20d54 ("x86/mm: Add support to make use of Secure Memory Encryption")
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "Dave Young" <dyoung@redhat.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Toshimitsu Kani <toshi.kani@hpe.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/0-v1-025d64bdf6c4+e-amd_sme_fix_jgg@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8f6ae5d
    • R
      PM: runtime: Drop pm_runtime_clean_up_links() · d6e36668
      Rafael J. Wysocki 提交于
      After commit d12544fb ("PM: runtime: Remove link state checks in
      rpm_get/put_supplier()") nothing prevents the consumer device's
      runtime PM from acquiring additional references to the supplier
      device after pm_runtime_clean_up_links() has run (or even while it
      is running), so calling this function from __device_release_driver()
      may be pointless (or even harmful).
      
      Moreover, it ignores stateless device links, so the runtime PM
      handling of managed and stateless device links is inconsistent
      because of it, so better get rid of it entirely.
      
      Fixes: d12544fb ("PM: runtime: Remove link state checks in rpm_get/put_supplier()")
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: 5.1+ <stable@vger.kernel.org> # 5.1+
      Tested-by: NXiang Chen <chenxiang66@hisilicon.com>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d6e36668
    • R
      PM: runtime: Drop runtime PM references to supplier on link removal · e0e398e2
      Rafael J. Wysocki 提交于
      While removing a device link, drop the supplier device's runtime PM
      usage counter as many times as needed to drop all of the runtime PM
      references to it from the consumer in addition to dropping the
      consumer's link count.
      
      Fixes: baa8809f ("PM / runtime: Optimize the use of device links")
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: 5.1+ <stable@vger.kernel.org> # 5.1+
      Tested-by: NXiang Chen <chenxiang66@hisilicon.com>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e0e398e2
  8. 02 11月, 2020 1 次提交
  9. 01 11月, 2020 1 次提交
  10. 31 10月, 2020 1 次提交
  11. 30 10月, 2020 11 次提交
  12. 29 10月, 2020 3 次提交
    • M
      xsk: Fix possible memory leak at socket close · e5e1a4bc
      Magnus Karlsson 提交于
      Fix a possible memory leak at xsk socket close that is caused by the
      refcounting of the umem object being wrong. The reference count of the
      umem was decremented only after the pool had been freed. Note that if
      the buffer pool is destroyed, it is important that the umem is
      destroyed after the pool, otherwise the umem would disappear while the
      driver is still running. And as the buffer pool needs to be destroyed
      in a work queue, the umem is also (if its refcount reaches zero)
      destroyed after the buffer pool in that same work queue.
      
      What was missing is that the refcount also needs to be decremented
      when the pool is not freed and when the pool has not even been
      created. The first case happens when the refcount of the pool is
      higher than 1, i.e. it is still being used by some other socket using
      the same device and queue id. In this case, it is safe to decrement
      the refcount of the umem outside of the work queue as the umem will
      never be freed because the refcount of the umem is always greater than
      or equal to the refcount of the buffer pool. The second case is if the
      buffer pool has not been created yet, i.e. the socket was closed
      before it was bound but after the umem was created. In this case, it
      is safe to destroy the umem outside of the work queue, since there is
      no pool that can use it by definition.
      
      Fixes: 1c1efc2a ("xsk: Create and free buffer pool independently from umem")
      Reported-by: syzbot+eb71df123dc2be2c1456@syzkaller.appspotmail.com
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
      Link: https://lore.kernel.org/bpf/1603801921-2712-1-git-send-email-magnus.karlsson@gmail.com
      e5e1a4bc
    • D
      afs: Fix afs_invalidatepage to adjust the dirty region · f86726a6
      David Howells 提交于
      Fix afs_invalidatepage() to adjust the dirty region recorded in
      page->private when truncating a page.  If the dirty region is entirely
      removed, then the private data is cleared and the page dirty state is
      cleared.
      
      Without this, if the page is truncated and then expanded again by truncate,
      zeros from the expanded, but no-longer dirty region may get written back to
      the server if the page gets laundered due to a conflicting 3rd-party write.
      
      It mustn't, however, shorten the dirty region of the page if that page is
      still mmapped and has been marked dirty by afs_page_mkwrite(), so a flag is
      stored in page->private to record this.
      
      Fixes: 4343d008 ("afs: Get rid of the afs_writeback record")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f86726a6
    • D
      afs: Wrap page->private manipulations in inline functions · 185f0c70
      David Howells 提交于
      The afs filesystem uses page->private to store the dirty range within a
      page such that in the event of a conflicting 3rd-party write to the server,
      we write back just the bits that got changed locally.
      
      However, there are a couple of problems with this:
      
       (1) I need a bit to note if the page might be mapped so that partial
           invalidation doesn't shrink the range.
      
       (2) There aren't necessarily sufficient bits to store the entire range of
           data altered (say it's a 32-bit system with 64KiB pages or transparent
           huge pages are in use).
      
      So wrap the accesses in inline functions so that future commits can change
      how this works.
      
      Also move them out of the tracing header into the in-directory header.
      There's not really any need for them to be in the tracing header.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      185f0c70