1. 08 3月, 2022 24 次提交
    • P
      KVM: x86/mmu: Zap defunct roots via asynchronous worker · efd995da
      Paolo Bonzini 提交于
      Zap defunct roots, a.k.a. roots that have been invalidated after their
      last reference was initially dropped, asynchronously via the existing work
      queue instead of forcing the work upon the unfortunate task that happened
      to drop the last reference.
      
      If a vCPU task drops the last reference, the vCPU is effectively blocked
      by the host for the entire duration of the zap.  If the root being zapped
      happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
      being active, the zap can take several hundred seconds.  Unsurprisingly,
      most guests are unhappy if a vCPU disappears for hundreds of seconds.
      
      E.g. running a synthetic selftest that triggers a vCPU root zap with
      ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
      Offloading the zap to a worker drops the block time to <100ms.
      
      There is an important nuance to this change.  If the same work item
      was queued twice before the work function has run, it would only
      execute once and one reference would be leaked.  Therefore, now that
      queueing and flushing items is not anymore protected by kvm->slots_lock,
      kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
      skip already invalid roots.  On the other hand, kvm_mmu_zap_all_fast()
      must return only after those skipped roots have been zapped as well.
      These two requirements can be satisfied only if _all_ places that
      change invalid to true now schedule the worker before releasing the
      mmu_lock.  There are just two, kvm_tdp_mmu_put_root() and
      kvm_tdp_mmu_invalidate_all_roots().
      Co-developed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-23-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      efd995da
    • S
      KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls · 1b6043e8
      Sean Christopherson 提交于
      When zapping a TDP MMU root, perform the zap in two passes to avoid
      zapping an entire top-level SPTE while holding RCU, which can induce RCU
      stalls.  In the first pass, zap SPTEs at PG_LEVEL_1G, and then
      zap top-level entries in the second pass.
      
      With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
      SPTEs take up to ~7 or so seconds (time varies based on kernel config,
      number of (v)CPUs, etc...).  With 5-level paging, that time can balloon
      well into hundreds of seconds.
      
      Before remote TLB flushes were omitted, the problem was even worse as
      waiting for all active vCPUs to respond to the IPI introduced significant
      overhead for VMs with large numbers of vCPUs.
      
      By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
      the amount of work that is done without dropping RCU protection is
      strictly bounded, with the worst case latency for a single operation
      being less than 100ms.
      
      Zapping at 1gb in the first pass is not arbitrary.  First and foremost,
      KVM relies on being able to zap 1gb shadow pages in a single shot when
      when repacing a shadow page with a hugepage.  Zapping a 1gb shadow page
      that is fully populated with 4kb dirty SPTEs also triggers the worst case
      latency due writing back the struct page accessed/dirty bits for each 4kb
      page, i.e. the two-pass approach is guaranteed to work so long as KVM can
      cleany zap a 1gb shadow page.
      
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu:     52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
                                                softirq=15759/15759 fqs=5058
         (t=21016 jiffies g=66453 q=238577)
        NMI backtrace for cpu 52
        Call Trace:
         ...
         mark_page_accessed+0x266/0x2f0
         kvm_set_pfn_accessed+0x31/0x40
         handle_removed_tdp_mmu_page+0x259/0x2e0
         __handle_changed_spte+0x223/0x2c0
         handle_removed_tdp_mmu_page+0x1c1/0x2e0
         __handle_changed_spte+0x223/0x2c0
         handle_removed_tdp_mmu_page+0x1c1/0x2e0
         __handle_changed_spte+0x223/0x2c0
         zap_gfn_range+0x141/0x3b0
         kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
         kvm_mmu_zap_all_fast+0x121/0x190
         kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
         kvm_page_track_flush_slot+0x5c/0x80
         kvm_arch_flush_shadow_memslot+0xe/0x10
         kvm_set_memslot+0x172/0x4e0
         __kvm_set_memory_region+0x337/0x590
         kvm_vm_ioctl+0x49c/0xf80
      Reported-by: NDavid Matlack <dmatlack@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Mingwei Zhang <mizhang@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-22-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1b6043e8
    • P
      KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root · 8351779c
      Paolo Bonzini 提交于
      Allow yielding when zapping SPTEs after the last reference to a valid
      root is put.  Because KVM must drop all SPTEs in response to relevant
      mmu_notifier events, mark defunct roots invalid and reset their refcount
      prior to zapping the root.  Keeping the refcount elevated while the zap
      is in-progress ensures the root is reachable via mmu_notifier until the
      zap completes and the last reference to the invalid, defunct root is put.
      
      Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the
      root in being put has a massive paging structure, e.g. zapping a root
      that is backed entirely by 4kb pages for a guest with 32tb of memory can
      take hundreds of seconds to complete.
      
        watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368]
        RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm]
         __handle_changed_spte+0x1b2/0x2f0 [kvm]
         handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
         __handle_changed_spte+0x1f4/0x2f0 [kvm]
         handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
         __handle_changed_spte+0x1f4/0x2f0 [kvm]
         tdp_mmu_zap_root+0x307/0x4d0 [kvm]
         kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm]
         kvm_mmu_free_roots+0x22d/0x350 [kvm]
         kvm_mmu_reset_context+0x20/0x60 [kvm]
         kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm]
         kvm_vcpu_ioctl+0x5bd/0x710 [kvm]
         __se_sys_ioctl+0x77/0xc0
         __x64_sys_ioctl+0x1d/0x20
         do_syscall_64+0x44/0xa0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      KVM currently doesn't put a root from a non-preemptible context, so other
      than the mmu_notifier wrinkle, yielding when putting a root is safe.
      
      Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
      take a reference to each root (it requires mmu_lock be held for the
      entire duration of the walk).
      
      tdp_mmu_next_root() is used only by the yield-friendly iterator.
      
      tdp_mmu_zap_root_work() is explicitly yield friendly.
      
      kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
      but is still yield-friendly in all call sites, as all callers can be
      traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
      kvm_create_vm().
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-21-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8351779c
    • P
      KVM: x86/mmu: Zap invalidated roots via asynchronous worker · 22b94c4b
      Paolo Bonzini 提交于
      Use the system worker threads to zap the roots invalidated
      by the TDP MMU's "fast zap" mechanism, implemented by
      kvm_tdp_mmu_invalidate_all_roots().
      
      At this point, apart from allowing some parallelism in the zapping of
      roots, the workqueue is a glorified linked list: work items are added and
      flushed entirely within a single kvm->slots_lock critical section.  However,
      the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
      assumes that it owns a reference to all invalid roots; therefore, no
      one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
      invalidated roots on a linked list... erm, on a workqueue ensures that
      tdp_mmu_zap_root_work() only puts back those extra references that
      kvm_mmu_zap_all_invalidated_roots() had gifted to it.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      22b94c4b
    • S
      KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages · bb95dfb9
      Sean Christopherson 提交于
      Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead
      of immediately flushing.  Because the shadow pages are freed in an RCU
      callback, so long as at least one CPU holds RCU, all CPUs are protected.
      For vCPUs running in the guest, i.e. consuming TLB entries, KVM only
      needs to ensure the caller services the pending TLB flush before dropping
      its RCU protections.  I.e. use the caller's RCU as a proxy for all vCPUs
      running in the guest.
      
      Deferring the flushes allows batching flushes, e.g. when installing a
      1gb hugepage and zapping a pile of SPs.  And when zapping an entire root,
      deferring flushes allows skipping the flush entirely (because flushes are
      not needed in that case).
      
      Avoiding flushes when zapping an entire root is especially important as
      synchronizing with other CPUs via IPI after zapping every shadow page can
      cause significant performance issues for large VMs.  The issue is
      exacerbated by KVM zapping entire top-level entries without dropping
      RCU protection, which can lead to RCU stalls even when zapping roots
      backing relatively "small" amounts of guest memory, e.g. 2tb.  Removing
      the IPI bottleneck largely mitigates the RCU issues, though it's likely
      still a problem for 5-level paging.  A future patch will further address
      the problem by zapping roots in multiple passes to avoid holding RCU for
      an extended duration.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-20-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb95dfb9
    • S
      KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched · bd296779
      Sean Christopherson 提交于
      When yielding in the TDP MMU iterator, service any pending TLB flush
      before dropping RCU protections in anticipation of using the caller's RCU
      "lock" as a proxy for vCPUs in the guest.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-19-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bd296779
    • S
      KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() · cf3e2642
      Sean Christopherson 提交于
      Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
      functions accordingly.  When removing mappings for functional correctness
      (except for the stupid VFIO GPU passthrough memslots bug), zapping the
      leaf SPTEs is sufficient as the paging structures themselves do not point
      at guest memory and do not directly impact the final translation (in the
      TDP MMU).
      
      Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
      the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
      kvm_unmap_gfn_range().
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-18-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cf3e2642
    • S
      KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range · acbda82a
      Sean Christopherson 提交于
      Now that all callers of zap_gfn_range() hold mmu_lock for write, drop
      support for zapping with mmu_lock held for read.  That all callers hold
      mmu_lock for write isn't a random coincidence; now that the paths that
      need to zap _everything_ have their own path, the only callers left are
      those that need to zap for functional correctness.  And when zapping is
      required for functional correctness, mmu_lock must be held for write,
      otherwise the caller has no guarantees about the state of the TDP MMU
      page tables after it has run, e.g. the SPTE(s) it zapped can be
      immediately replaced by a vCPU faulting in a page.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-17-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      acbda82a
    • S
      KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page · e2b5b21d
      Sean Christopherson 提交于
      Add a dedicated helper for zapping a TDP MMU root, and use it in the three
      flows that do "zap_all" and intentionally do not do a TLB flush if SPTEs
      are zapped (zapping an entire root is safe if and only if it cannot be in
      use by any vCPU).  Because a TLB flush is never required, unconditionally
      pass "false" to tdp_mmu_iter_cond_resched() when potentially yielding.
      
      Opportunistically document why KVM must not yield when zapping roots that
      are being zapped by kvm_tdp_mmu_put_root(), i.e. roots whose refcount has
      reached zero, and further harden the flow to detect improper KVM behavior
      with respect to roots that are supposed to be unreachable.
      
      In addition to hardening zapping of roots, isolating zapping of roots
      will allow future simplification of zap_gfn_range() by having it zap only
      leaf SPTEs, and by removing its tricky "zap all" heuristic.  By having
      all paths that truly need to free _all_ SPs flow through the dedicated
      root zapper, the generic zapper can be freed of those concerns.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-16-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e2b5b21d
    • S
      KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU · 77c8cd6b
      Sean Christopherson 提交于
      Don't flush the TLBs when zapping all TDP MMU pages, as the only time KVM
      uses the slow version of "zap everything" is when the VM is being
      destroyed or the owning mm has exited.  In either case, KVM_RUN is
      unreachable for the VM, i.e. the guest TLB entries cannot be consumed.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-15-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      77c8cd6b
    • S
      KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery · c10743a1
      Sean Christopherson 提交于
      When recovering a potential hugepage that was shattered for the iTLB
      multihit workaround, precisely zap only the target page instead of
      iterating over the TDP MMU to find the SP that was passed in.  This will
      allow future simplification of zap_gfn_range() by having it zap only
      leaf SPTEs.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-14-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c10743a1
    • S
      KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw values · 626808d1
      Sean Christopherson 提交于
      Refactor __tdp_mmu_set_spte() to work with raw values instead of a
      tdp_iter objects so that a future patch can modify SPTEs without doing a
      walk, and without having to synthesize a tdp_iter.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-13-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      626808d1
    • S
      KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path · 966da62a
      Sean Christopherson 提交于
      WARN if the new_spte being set by __tdp_mmu_set_spte() is a REMOVED_SPTE,
      which is called out by the comment as being disallowed but not actually
      checked.  Keep the WARN on the old_spte as well, because overwriting a
      REMOVED_SPTE in the non-atomic path is also disallowed (as evidence by
      lack of splats with the existing WARN).
      
      Fixes: 08f07c80 ("KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler")
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-12-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      966da62a
    • S
      KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU · 0e587aa7
      Sean Christopherson 提交于
      Add helpers to read and write TDP MMU SPTEs instead of open coding
      rcu_dereference() all over the place, and to provide a convenient
      location to document why KVM doesn't exempt holding mmu_lock for write
      from having to hold RCU (and any future changes to the rules).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-11-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0e587aa7
    • S
      KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks · a151acec
      Sean Christopherson 提交于
      Drop RCU protection after processing each root when handling MMU notifier
      hooks that aren't the "unmap" path, i.e. aren't zapping.  Temporarily
      drop RCU to let RCU do its thing between roots, and to make it clear that
      there's no special behavior that relies on holding RCU across all roots.
      
      Currently, the RCU protection is completely superficial, it's necessary
      only to make rcu_dereference() of SPTE pointers happy.  A future patch
      will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to
      ensure shadow pages aren't freed before all vCPUs do a TLB flush (or
      rather, acknowledge the need for a flush), but in that case RCU needs to
      be held until the flush is complete if and only if the flush is needed
      because a shadow page may have been removed.  And except for the "unmap"
      path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit,
      and can't change the PFN for a SP).
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-10-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a151acec
    • S
      KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte · 93fa50f6
      Sean Christopherson 提交于
      Batch TLB flushes (with other MMUs) when handling ->change_spte()
      notifications in the TDP MMU.  The MMU notifier path in question doesn't
      allow yielding and correcty flushes before dropping mmu_lock.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-9-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      93fa50f6
    • S
      KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal · c8e5a0d0
      Sean Christopherson 提交于
      Look for a !leaf=>leaf conversion instead of a PFN change when checking
      if a SPTE change removed a TDP MMU shadow page.  Convert the PFN check
      into a WARN, as KVM should never change the PFN of a shadow page (except
      when its being zapped or replaced).
      
      From a purely theoretical perspective, it's not illegal to replace a SP
      with a hugepage pointing at the same PFN.  In practice, it's impossible
      as that would require mapping guest memory overtop a kernel-allocated SP.
      Either way, the check is odd.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-8-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c8e5a0d0
    • P
      KVM: x86/mmu: do not allow readers to acquire references to invalid roots · 614f6970
      Paolo Bonzini 提交于
      Remove the "shared" argument of for_each_tdp_mmu_root_yield_safe, thus ensuring
      that readers do not ever acquire a reference to an invalid root.  After this
      patch, all readers except kvm_tdp_mmu_zap_invalidated_roots() treat
      refcount=0/valid, refcount=0/invalid and refcount=1/invalid in exactly the
      same way.  kvm_tdp_mmu_zap_invalidated_roots() is different but it also
      does not acquire a reference to the invalid root, and it cannot see
      refcount=0/invalid because it is guaranteed to run after
      kvm_tdp_mmu_invalidate_all_roots().
      
      Opportunistically add a lockdep assertion to the yield-safe iterator.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      614f6970
    • P
      KVM: x86/mmu: only perform eager page splitting on valid roots · 7c554d8e
      Paolo Bonzini 提交于
      Eager page splitting is an optimization; it does not have to be performed on
      invalid roots.  It is also the only case in which a reader might acquire
      a reference to an invalid root, so after this change we know that readers
      will skip both dying and invalid roots.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7c554d8e
    • S
      KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter · 226b8c8f
      Sean Christopherson 提交于
      Assert that mmu_lock is held for write by users of the yield-unfriendly
      TDP iterator.  The nature of a shared walk means that the caller needs to
      play nice with other tasks modifying the page tables, which is more or
      less the same thing as playing nice with yielding.  Theoretically, KVM
      could gain a flow where it could legitimately take mmu_lock for read in
      a non-preemptible context, but that's highly unlikely and any such case
      should be viewed with a fair amount of scrutiny.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      226b8c8f
    • S
      KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush · 7ae5840e
      Sean Christopherson 提交于
      Remove the misleading flush "handling" when zapping invalidated TDP MMU
      roots, and document that flushing is unnecessary for all flavors of MMUs
      when zapping invalid/obsolete roots/pages.  The "handling" in the TDP MMU
      is dead code, as zap_gfn_range() is called with shared=true, in which
      case it will never return true due to the flushing being handled by
      tdp_mmu_zap_spte_atomic().
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7ae5840e
    • S
      KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic · db01416b
      Sean Christopherson 提交于
      Explicitly ignore the result of zap_gfn_range() when putting the last
      reference to a TDP MMU root, and add a pile of comments to formalize the
      TDP MMU's behavior of deferring TLB flushes to alloc/reuse.  Note, this
      only affects the !shared case, as zap_gfn_range() subtly never returns
      true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic().
      
      Putting the root without a flush is ok because even if there are stale
      references to the root in the TLB, they are unreachable because KVM will
      not run the guest with the same ASID without first flushing (where ASID
      in this context refers to both SVM's explicit ASID and Intel's implicit
      ASID that is constructed from VPID+PCID+EPT4A+etc...).
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-5-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      db01416b
    • S
      KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap · f28e9c7f
      Sean Christopherson 提交于
      Fix misleading and arguably wrong comments in the TDP MMU's fast zap
      flow.  The comments, and the fact that actually zapping invalid roots was
      added separately, strongly suggests that zapping invalid roots is an
      optimization and not required for correctness.  That is a lie.
      
      KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(),
      because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(),
      KVM is relying on it to fully remove all references to the memslot.  Once
      the memslot is gone, KVM's mmu_notifier hooks will be unable to find the
      stale references as the hva=>gfn translation is done via the memslots.
      If KVM doesn't immediately zap SPTEs and userspace unmaps a range after
      deleting a memslot, KVM will fail to zap in response to the mmu_notifier
      due to not finding a memslot corresponding to the notifier's range, which
      leads to a variation of use-after-free.
      
      The other misleading comment (and code) explicitly states that roots
      without a reference should be skipped.  While that's technically true,
      it's also extremely misleading as it should be impossible for KVM to
      encounter a defunct root on the list while holding mmu_lock for write.
      Opportunistically add a WARN to enforce that invariant.
      
      Fixes: b7cccd39 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
      Fixes: 4c6654bd ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f28e9c7f
    • S
      KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU · 3354ef5a
      Sean Christopherson 提交于
      Explicitly check for present SPTEs when clearing dirty bits in the TDP
      MMU.  This isn't strictly required for correctness, as setting the dirty
      bit in a defunct SPTE will not change the SPTE from !PRESENT to PRESENT.
      However, the guarded MMU_WARN_ON() in spte_ad_need_write_protect() would
      complain if anyone actually turned on KVM's MMU debugging.
      
      Fixes: a6a0b05d ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-3-seanjc@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3354ef5a
  2. 25 2月, 2022 1 次提交
  3. 11 2月, 2022 15 次提交
    • D
      KVM: x86/mmu: Add tracepoint for splitting huge pages · e0b728b1
      David Matlack 提交于
      Add a tracepoint that records whenever KVM eagerly splits a huge page
      and the error status of the split to indicate if it succeeded or failed
      and why.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-18-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e0b728b1
    • D
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG · cb00a70b
      David Matlack 提交于
      When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
      write-protected when dirty logging is enabled on the memslot. Instead
      they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
      the first time and only for the specific sub-region being cleared.
      
      Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
      write-protecting to avoid causing write-protection faults on vCPU
      threads. This also allows userspace to smear the cost of huge page
      splitting across multiple ioctls, rather than splitting the entire
      memslot as is the case when initially-all-set is not used.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-17-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb00a70b
    • D
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled · a3fe5dbd
      David Matlack 提交于
      When dirty logging is enabled without initially-all-set, try to split
      all huge pages in the memslot down to 4KB pages so that vCPUs do not
      have to take expensive write-protection faults to split huge pages.
      
      Eager page splitting is best-effort only. This commit only adds the
      support for the TDP MMU, and even there splitting may fail due to out
      of memory conditions. Failures to split a huge page is fine from a
      correctness standpoint because KVM will always follow up splitting by
      write-protecting any remaining huge pages.
      
      Eager page splitting moves the cost of splitting huge pages off of the
      vCPU threads and onto the thread enabling dirty logging on the memslot.
      This is useful because:
      
       1. Splitting on the vCPU thread interrupts vCPUs execution and is
          disruptive to customers whereas splitting on VM ioctl threads can
          run in parallel with vCPU execution.
      
       2. Splitting all huge pages at once is more efficient because it does
          not require performing VM-exit handling or walking the page table for
          every 4KiB page in the memslot, and greatly reduces the amount of
          contention on the mmu_lock.
      
      For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
      per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
      all of their memory after dirty logging is enabled decreased by 95% from
      2.94s to 0.14s.
      
      Eager Page Splitting is over 100x more efficient than the current
      implementation of splitting on fault under the read lock. For example,
      taking the same workload as above, Eager Page Splitting reduced the CPU
      required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
      * 96 vCPU threads) to only 1.55 CPU-seconds.
      
      Eager page splitting does increase the amount of time it takes to enable
      dirty logging since it has split all huge pages. For example, the time
      it took to enable dirty logging in the 96GiB region of the
      aforementioned test increased from 0.001s to 1.55s.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a3fe5dbd
    • D
      KVM: x86/mmu: Separate TDP MMU shadow page allocation and initialization · a82070b6
      David Matlack 提交于
      Separate the allocation of shadow pages from their initialization.  This
      is in preparation for splitting huge pages outside of the vCPU fault
      context, which requires a different allocation mechanism.
      
      No functional changed intended.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-15-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a82070b6
    • D
      KVM: x86/mmu: Derive page role for TDP MMU shadow pages from parent · a3aca4de
      David Matlack 提交于
      Derive the page role from the parent shadow page, since the only thing
      that changes is the level. This is in preparation for splitting huge
      pages during VM-ioctls which do not have access to the vCPU MMU context.
      
      No functional change intended.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-14-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a3aca4de
    • D
      KVM: x86/mmu: Remove redundant role overrides for TDP MMU shadow pages · a81399a5
      David Matlack 提交于
      The vCPU's mmu_role already has the correct values for direct,
      has_4_byte_gpte, access, and ad_disabled. Remove the code that was
      redundantly overwriting these fields with the same values.
      
      No functional change intended.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-13-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a81399a5
    • D
      KVM: x86/mmu: Refactor TDP MMU iterators to take kvm_mmu_page root · 77aa6075
      David Matlack 提交于
      Instead of passing a pointer to the root page table and the root level
      separately, pass in a pointer to the root kvm_mmu_page struct.  This
      reduces the number of arguments by 1, cutting down on line lengths.
      
      No functional change intended.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-12-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      77aa6075
    • D
      KVM: x86/mmu: Consolidate logic to atomically install a new TDP MMU page table · 7b7e1ab6
      David Matlack 提交于
      Consolidate the logic to atomically replace an SPTE with an SPTE that
      points to a new page table into a single helper function. This will be
      used in a follow-up commit to split huge pages, which involves replacing
      each huge page SPTE with an SPTE that points to a page table.
      
      Opportunistically drop the call to trace_kvm_mmu_get_page() in
      kvm_tdp_mmu_map() since it is redundant with the identical tracepoint in
      tdp_mmu_alloc_sp().
      
      No functional change intended.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-8-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7b7e1ab6
    • D
      KVM: x86/mmu: Rename handle_removed_tdp_mmu_page() to handle_removed_pt() · 0f53dfa3
      David Matlack 提交于
      First remove tdp_mmu_ from the name since it is redundant given that it
      is a static function in tdp_mmu.c. There is a pattern of using tdp_mmu_
      as a prefix in the names of static TDP MMU functions, but all of the
      other handle_*() variants do not include such a prefix. So drop it
      entirely.
      
      Then change "page" to "pt" to convey that this is operating on a page
      table rather than an struct page. Purposely use "pt" instead of "sp"
      since this function takes the raw RCU-protected page table pointer as an
      argument rather than  a pointer to the struct kvm_mmu_page.
      
      No functional change intended.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-7-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0f53dfa3
    • D
      KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages · c298a30c
      David Matlack 提交于
      Rename 3 functions in tdp_mmu.c that handle shadow pages:
      
        alloc_tdp_mmu_page()  -> tdp_mmu_alloc_sp()
        tdp_mmu_link_page()   -> tdp_mmu_link_sp()
        tdp_mmu_unlink_page() -> tdp_mmu_unlink_sp()
      
      These changed make tdp_mmu a consistent prefix before the verb in the
      function name, and make it more clear that these functions deal with
      kvm_mmu_page structs rather than struct pages.
      
      One could argue that "shadow page" is the wrong term for a page table in
      the TDP MMU since it never actually shadows a guest page table.
      However, "shadow page" (or "sp" for short) has evolved to become the
      standard term in KVM when referring to a kvm_mmu_page struct, and its
      associated page table and other metadata, regardless of whether the page
      table shadows a guest page table. So this commit just makes the TDP MMU
      more consistent with the rest of KVM.
      
      No functional change intended.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-6-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c298a30c
    • D
      KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return 0/-EBUSY · 3e72c791
      David Matlack 提交于
      tdp_mmu_set_spte_atomic() and tdp_mmu_zap_spte_atomic() return a bool
      with true indicating the SPTE modification was successful and false
      indicating failure. Change these functions to return an int instead
      since that is the common practice.
      
      Opportunistically fix up the kernel-doc style for the Return section
      above tdp_mmu_set_spte_atomic().
      
      No functional change intended.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-5-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3e72c791
    • D
      KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails · 3255530a
      David Matlack 提交于
      Consolidate a bunch of code that was manually re-reading the spte if the
      cmpxchg failed. There is no extra cost of doing this because we already
      have the spte value as a result of the cmpxchg (and in fact this
      eliminates re-reading the spte), and none of the call sites depend on
      iter->old_spte retaining the stale spte value.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-4-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3255530a
    • D
      KVM: x86/mmu: Check SPTE writable invariants when setting leaf SPTEs · 115111ef
      David Matlack 提交于
      Check SPTE writable invariants when setting SPTEs rather than in
      spte_can_locklessly_be_made_writable(). By the time KVM checks
      spte_can_locklessly_be_made_writable(), the SPTE has long been since
      corrupted.
      
      Note that these invariants only apply to shadow-present leaf SPTEs (i.e.
      not to MMIO SPTEs, non-leaf SPTEs, etc.). Add a comment explaining the
      restriction and only instrument the code paths that set shadow-present
      leaf SPTEs.
      
      To account for access tracking, also check the SPTE writable invariants
      when marking an SPTE as an access track SPTE. This also lets us remove
      a redundant WARN from mark_spte_for_access_track().
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220125230518.1697048-3-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      115111ef
    • J
      KVM: x86/tdp_mmu: Remove unused "kvm" of kvm_tdp_mmu_get_root() · ad6d6b94
      Jinrong Liang 提交于
      The "struct kvm *kvm" parameter of kvm_tdp_mmu_get_root() is not used,
      so remove it. No functional change intended.
      Signed-off-by: NJinrong Liang <cloudliang@tencent.com>
      Message-Id: <20220125095909.38122-5-cloudliang@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad6d6b94
    • S
      KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU · d62007ed
      Sean Christopherson 提交于
      Zap both valid and invalid roots when zapping/unmapping a gfn range, as
      KVM must ensure it holds no references to the freed page after returning
      from the unmap operation.  Most notably, the TDP MMU doesn't zap invalid
      roots in mmu_notifier callbacks.  This leads to use-after-free and other
      issues if the mmu_notifier runs to completion while an invalid root
      zapper yields as KVM fails to honor the requirement that there must be
      _no_ references to the page after the mmu_notifier returns.
      
      The bug is most easily reproduced by hacking KVM to cause a collision
      between set_nx_huge_pages() and kvm_mmu_notifier_release(), but the bug
      exists between kvm_mmu_notifier_invalidate_range_start() and memslot
      updates as well.  Invalidating a root ensures pages aren't accessible by
      the guest, and KVM won't read or write page data itself, but KVM will
      trigger e.g. kvm_set_pfn_dirty() when zapping SPTEs, and thus completing
      a zap of an invalid root _after_ the mmu_notifier returns is fatal.
      
        WARNING: CPU: 24 PID: 1496 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:173 [kvm]
        RIP: 0010:kvm_is_zone_device_pfn+0x96/0xa0 [kvm]
        Call Trace:
         <TASK>
         kvm_set_pfn_dirty+0xa8/0xe0 [kvm]
         __handle_changed_spte+0x2ab/0x5e0 [kvm]
         __handle_changed_spte+0x2ab/0x5e0 [kvm]
         __handle_changed_spte+0x2ab/0x5e0 [kvm]
         zap_gfn_range+0x1f3/0x310 [kvm]
         kvm_tdp_mmu_zap_invalidated_roots+0x50/0x90 [kvm]
         kvm_mmu_zap_all_fast+0x177/0x1a0 [kvm]
         set_nx_huge_pages+0xb4/0x190 [kvm]
         param_attr_store+0x70/0x100
         module_attr_store+0x19/0x30
         kernfs_fop_write_iter+0x119/0x1b0
         new_sync_write+0x11c/0x1b0
         vfs_write+0x1cc/0x270
         ksys_write+0x5f/0xe0
         do_syscall_64+0x38/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Fixes: b7cccd39 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211215011557.399940-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d62007ed