1. 24 6月, 2022 2 次提交
    • D
      KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU · 47855da0
      David Matlack 提交于
      Currently make_huge_page_split_spte() assumes execute permissions can be
      granted to any 4K SPTE when splitting huge pages. This is true for the
      TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
      shadowing a non-executable huge page.
      
      To fix this, pass in the role of the child shadow page where the huge
      page will be split and derive the execution permission from that.  This
      is correct because huge pages are always split with direct shadow page
      and thus the shadow page role contains the correct access permissions.
      
      No functional change intended.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-19-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      47855da0
    • B
      KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis · 084cc29f
      Ben Gardon 提交于
      In some cases, the NX hugepage mitigation for iTLB multihit is not
      needed for all guests on a host. Allow disabling the mitigation on a
      per-VM basis to avoid the performance hit of NX hugepages on trusted
      workloads.
      
      In order to disable NX hugepages on a VM, ensure that the userspace
      actor has permission to reboot the system. Since disabling NX hugepages
      would allow a guest to crash the system, it is similar to reboot
      permissions.
      
      Ideally, KVM would require userspace to prove it has access to KVM's
      nx_huge_pages module param, e.g. so that userspace can opt out without
      needing full reboot permissions.  But getting access to the module param
      file info is difficult because it is buried in layers of sysfs and module
      glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases.
      Suggested-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220613212523.3436117-9-bgardon@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      084cc29f
  2. 20 6月, 2022 3 次提交
    • S
      KVM: x86/mmu: Shove refcounted page dependency into host_pfn_mapping_level() · 5d49f08c
      Sean Christopherson 提交于
      Move the check that restricts mapping huge pages into the guest to pfns
      that are backed by refcounted 'struct page' memory into the helper that
      actually "requires" a 'struct page', host_pfn_mapping_level().  In
      addition to deduplicating code, moving the check to the helper eliminates
      the subtle requirement that the caller check that the incoming pfn is
      backed by a refcounted struct page, and as an added bonus avoids an extra
      pfn_to_page() lookup.
      
      Note, the is_error_noslot_pfn() check in kvm_mmu_hugepage_adjust() needs
      to stay where it is, as it guards against dereferencing a NULL memslot in
      the kvm_slot_dirty_track_enabled() that follows.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-11-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5d49f08c
    • S
      KVM: Rename/refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page() · b14b2690
      Sean Christopherson 提交于
      Rename and refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()
      to better reflect what KVM is actually checking, and to eliminate extra
      pfn_to_page() lookups.  The kvm_release_pfn_*() an kvm_try_get_pfn()
      helpers in particular benefit from "refouncted" nomenclature, as it's not
      all that obvious why KVM needs to get/put refcounts for some PG_reserved
      pages (ZERO_PAGE and ZONE_DEVICE).
      
      Add a comment to call out that the list of exceptions to PG_reserved is
      all but guaranteed to be incomplete.  The list has mostly been compiled
      by people throwing noodles at KVM and finding out they stick a little too
      well, e.g. the ZERO_PAGE's refcount overflowed and ZONE_DEVICE pages
      didn't get freed.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-10-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b14b2690
    • S
      KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs · 2ca3129e
      Sean Christopherson 提交于
      Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs
      (PT64).  SPTE and PT64 are _mostly_ the same, but the few differences are
      quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and
      guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is
      very much specific to SPTEs.
      
      Opportunistically (and temporarily) move most guest macros into paging.h
      to clearly associate them with shadow paging, and to ensure that they're
      not used as of this commit.  A future patch will eliminate them entirely.
      
      Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's
      needed for the quadrant calculation in kvm_mmu_get_page().  The quadrant
      calculation is hot enough (when using shadow paging with 32-bit guests)
      that adding a per-context helper is undesirable, and burying the
      computation in paging_tmpl.h with a forward declaration isn't exactly an
      improvement.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220614233328.3896033-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2ca3129e
  3. 15 6月, 2022 1 次提交
  4. 07 6月, 2022 1 次提交
    • B
      KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging · 5ba7c4c6
      Ben Gardon 提交于
      Currently disabling dirty logging with the TDP MMU is extremely slow.
      On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds
      to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with
      the shadow MMU.
      
      When disabling dirty logging, zap non-leaf parent entries to allow
      replacement with huge pages instead of recursing and zapping all of the
      child, leaf entries. This reduces the number of TLB flushes required.
      and reduces the disable dirty log time with the TDP MMU to ~3 seconds.
      
      Opportunistically add a WARN() to catch GFNs that are mapped at a
      higher level than their max level.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220525230904.1584480-1-bgardon@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5ba7c4c6
  5. 12 5月, 2022 2 次提交
    • S
      KVM: x86/mmu: Expand and clean up page fault stats · 1075d41e
      Sean Christopherson 提交于
      Expand and clean up the page fault stats.  The current stats are at best
      incomplete, and at worst misleading.  Differentiate between faults that
      are actually fixed vs those that result in an MMIO SPTE being created,
      track faults that are spurious, faults that trigger emulation, faults
      that that are fixed in the fast path, and last but not least, track the
      number of faults that are taken.
      
      Note, the number of faults that require emulation for write-protected
      shadow pages can roughly be calculated by subtracting the number of MMIO
      SPTEs created from the overall number of faults that trigger emulation.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-10-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1075d41e
    • S
      KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use · 54275f74
      Sean Christopherson 提交于
      Check for A/D bits being disabled instead of the access tracking mask
      being non-zero when deciding whether or not to attempt to fix a page
      fault vian the fast path.  Originally, the access tracking mask was
      non-zero if and only if A/D bits were disabled by _KVM_ (including not
      being supported by hardware), but that hasn't been true since nVMX was
      fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
      KVM to not use A/D bits while running L2 despite KVM using them while
      running L1.
      
      In other words, don't attempt the fast path just because EPT is enabled.
      
      Note, attempting the fast path for all !PRESENT faults can "fix" a very,
      _VERY_ tiny percentage of faults out of mmu_lock by detecting that the
      fault is spurious, i.e. has been fixed by a different vCPU, but again the
      odds of that happening are vanishingly small.  E.g. booting an 8-vCPU VM
      gets less than 10 successes out of 30k+ faults, and that's likely one of
      the more favorable scenarios.  Disabling dirty logging can likely lead to
      a rash of collisions between vCPUs for some workloads that operate on a
      common set of pages, but penalizing _all_ !PRESENT faults for that one
      case is unlikely to be a net positive, not to mention that that problem
      is best solved by not zapping in the first place.
      
      The number of spurious faults does scale with the number of vCPUs, e.g. a
      255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
      path (again out of 30k), but that's all of 0.2% of faults.  Using legacy
      shadow paging does get more spurious faults, and a few more detected out
      of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
      faults that are reflected into the guest), i.e. the extra detections are
      purely due to the sheer number of faults observed.
      
      On the other hand, getting a "negative" in the fast path takes in the
      neighborhood of 150-250 cycles.  So while it is tempting to keep/extend
      the current behavior, such a change needs to come with hard numbers
      showing that it's actually a win in the grand scheme, or any scheme for
      that matter.
      
      Fixes: 995f00a6 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54275f74
  6. 03 5月, 2022 1 次提交
    • S
      KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits · ba3a6120
      Sean Christopherson 提交于
      Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
      if mmu_lock is held for write, as volatile SPTEs can be written by other
      tasks/vCPUs outside of mmu_lock.  If a vCPU uses the to-be-modified SPTE
      to write a page, the CPU can cache the translation as WRITABLE in the TLB
      despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
      Accessed/Dirty bits and not properly tag the backing page.
      
      Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
      non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
      KVM doesn't consume the Accessed bit of non-leaf SPTEs.
      
      Dropping the Dirty and/or Writable bits is most problematic for dirty
      logging, as doing so can result in a missed TLB flush and eventually a
      missed dirty page.  In the unlikely event that the only dirty page(s) is
      a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
      (based on the Dirty or Writable bit depending on the method) and so not
      update the SPTE and ultimately not flush.  If the SPTE is cached in the
      TLB as writable before it is clobbered, the guest can continue writing
      the associated page without ever taking a write-protect fault.
      
      For most (all?) file back memory, dropping the Dirty bit is a non-issue.
      The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
      bit is effectively ignored because the primary MMU will mark that page
      dirty when the write-protection is lifted, e.g. when KVM faults the page
      back in for write.
      
      The Accessed bit is a complete non-issue.  Aside from being unused for
      non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
      Accessed bit may be dropped anyways.
      
      Lastly, the Writable bit is also problematic as an extension of the Dirty
      bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
      !DIRTY && WRITABLE.  If KVM fixes an MMU-writable, but !WRITABLE, SPTE
      out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
      the SPTE being !WRITABLE when it is checked by KVM.  But that all depends
      on the Dirty bit being problematic in the first place.
      
      Fixes: 2f2fad08 ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Cc: Venkatesh Srinivas <venkateshs@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ba3a6120
  7. 30 4月, 2022 3 次提交
    • P
      KVM: x86/mmu: replace shadow_root_level with root_role.level · a972e29c
      Paolo Bonzini 提交于
      root_role.level is always the same value as shadow_level:
      
      - it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu
      
      - it's the level argument when going through kvm_init_shadow_ept_mmu
      
      - it's assigned directly from new_role.base.level when going
        through shadow_mmu_init_context
      
      Remove the duplication and get the level directly from the role.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a972e29c
    • P
      KVM: x86/mmu: remove extended bits from mmu_role, rename field · 7a458f0e
      Paolo Bonzini 提交于
      mmu_role represents the role of the root of the page tables.
      It does not need any extended bits, as those govern only KVM's
      page table walking; the is_* functions used for page table
      walking always use the CPU role.
      
      ext.valid is not present anymore in the MMU role, but an
      all-zero MMU role is impossible because the level field is
      never zero in the MMU role.  So just zap the whole mmu_role
      in order to force invalidation after CPUID is updated.
      
      While making this change, which requires touching almost every
      occurrence of "mmu_role", rename it to "root_role".
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a458f0e
    • S
      KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR · 86931ff7
      Sean Christopherson 提交于
      Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
      MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
      The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
      guest, possibly with help from userspace, manages to coerce KVM into
      creating a SPTE for an "impossible" gfn, KVM will leak the associated
      shadow pages (page tables):
      
        WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
                                      kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Modules linked in: kvm_intel kvm irqbypass
        CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G        W         5.18.0-rc1+ #293
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Call Trace:
         <TASK>
         kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
         kvm_destroy_vm+0x162/0x2d0 [kvm]
         kvm_vm_release+0x1d/0x30 [kvm]
         __fput+0x82/0x240
         task_work_run+0x5b/0x90
         exit_to_user_mode_prepare+0xd2/0xe0
         syscall_exit_to_user_mode+0x1d/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      On bare metal, encountering an impossible gpa in the page fault path is
      well and truly impossible, barring CPU bugs, as the CPU will signal #PF
      during the gva=>gpa translation (or a similar failure when stuffing a
      physical address into e.g. the VMCS/VMCB).  But if KVM is running as a VM
      itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
      of the underlying hardware, in which case the hardware will not fault on
      the illegal-from-KVM's-perspective gpa.
      
      Alternatively, KVM could continue allowing the dodgy behavior and simply
      zap the max possible range.  But, for hosts with MAXPHYADDR < 52, that's
      a (minor) waste of cycles, and more importantly, KVM can't reasonably
      support impossible memslots when running on bare metal (or with an
      accurate MAXPHYADDR as a VM).  Note, limiting the overhead by checking if
      KVM is running as a guest is not a safe option as the host isn't required
      to announce itself to the guest in any way, e.g. doesn't need to set the
      HYPERVISOR CPUID bit.
      
      A second alternative to disallowing the memslot behavior would be to
      disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR.  That
      restriction is undesirable as there are legitimate use cases for doing
      so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
      systems so that VMs can be migrated between hosts with different
      MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
      
      Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
      even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
      any reserved physical address bits).
      
      The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
      The memslot and TDP MMU code want an exclusive value, but the name
      implies the returned value is inclusive, and the MMIO path needs an
      inclusive check.
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Fixes: 524a1e4e ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220428233416.2446833-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      86931ff7
  8. 05 4月, 2022 1 次提交
  9. 02 4月, 2022 2 次提交
    • S
      KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap · f47e5bbb
      Sean Christopherson 提交于
      Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
      kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
      flush when processing multiple roots (including nested TDP shadow roots).
      Dropping the TLB flush resulted in random crashes when running Hyper-V
      Server 2019 in a guest with KSM enabled in the host (or any source of
      mmu_notifier invalidations, KSM is just the easiest to force).
      
      This effectively revert commits 873dd122
      and fcb93eb6, and thus restores commit
      cf3e2642, plus this delta on top:
      
      bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
              struct kvm_mmu_page *root;
      
              for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
      -               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
      +               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
      
              return flush;
       }
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220325230348.2587437-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f47e5bbb
    • P
      KVM: MMU: propagate alloc_workqueue failure · a1a39128
      Paolo Bonzini 提交于
      If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
      to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
      kvm_arch_init_vm also has to undo all the initialization, so
      group all the MMU initialization code at the beginning and
      handle cleaning up of kvm_page_track_init.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a1a39128
  10. 21 3月, 2022 2 次提交
  11. 08 3月, 2022 22 次提交
    • S
      KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE · 396fd74d
      Sean Christopherson 提交于
      Disallow calling tdp_mmu_set_spte_atomic() with a REMOVED "old" SPTE.
      This solves a conundrum introduced by commit 3255530a ("KVM: x86/mmu:
      Automatically update iter->old_spte if cmpxchg fails"); if the helper
      doesn't update old_spte in the REMOVED case, then theoretically the
      caller could get stuck in an infinite loop as it will fail indefinitely
      on the REMOVED SPTE.  E.g. until recently, clear_dirty_gfn_range() didn't
      check for a present SPTE and would have spun until getting rescheduled.
      
      In practice, only the page fault path should "create" a new SPTE, all
      other paths should only operate on existing, a.k.a. shadow present,
      SPTEs.  Now that the page fault path pre-checks for a REMOVED SPTE in all
      cases, require all other paths to indirectly pre-check by verifying the
      target SPTE is a shadow-present SPTE.
      
      Note, this does not guarantee the actual SPTE isn't REMOVED, nor is that
      scenario disallowed.  The invariant is only that the caller mustn't
      invoke tdp_mmu_set_spte_atomic() if the SPTE was REMOVED when last
      observed by the caller.
      
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-25-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      396fd74d
    • S
      KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE · 58298b06
      Sean Christopherson 提交于
      Explicitly check for a REMOVED leaf SPTE prior to attempting to map
      the final SPTE when handling a TDP MMU fault.  Functionally, this is a
      nop as tdp_mmu_set_spte_atomic() will eventually detect the frozen SPTE.
      Pre-checking for a REMOVED SPTE is a minor optmization, but the real goal
      is to allow tdp_mmu_set_spte_atomic() to have an invariant that the "old"
      SPTE is never a REMOVED SPTE.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-24-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      58298b06
    • P
      KVM: x86/mmu: Zap defunct roots via asynchronous worker · efd995da
      Paolo Bonzini 提交于
      Zap defunct roots, a.k.a. roots that have been invalidated after their
      last reference was initially dropped, asynchronously via the existing work
      queue instead of forcing the work upon the unfortunate task that happened
      to drop the last reference.
      
      If a vCPU task drops the last reference, the vCPU is effectively blocked
      by the host for the entire duration of the zap.  If the root being zapped
      happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
      being active, the zap can take several hundred seconds.  Unsurprisingly,
      most guests are unhappy if a vCPU disappears for hundreds of seconds.
      
      E.g. running a synthetic selftest that triggers a vCPU root zap with
      ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
      Offloading the zap to a worker drops the block time to <100ms.
      
      There is an important nuance to this change.  If the same work item
      was queued twice before the work function has run, it would only
      execute once and one reference would be leaked.  Therefore, now that
      queueing and flushing items is not anymore protected by kvm->slots_lock,
      kvm_tdp_mmu_invalidate_all_roots() has to check root->role.invalid and
      skip already invalid roots.  On the other hand, kvm_mmu_zap_all_fast()
      must return only after those skipped roots have been zapped as well.
      These two requirements can be satisfied only if _all_ places that
      change invalid to true now schedule the worker before releasing the
      mmu_lock.  There are just two, kvm_tdp_mmu_put_root() and
      kvm_tdp_mmu_invalidate_all_roots().
      Co-developed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-23-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      efd995da
    • S
      KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls · 1b6043e8
      Sean Christopherson 提交于
      When zapping a TDP MMU root, perform the zap in two passes to avoid
      zapping an entire top-level SPTE while holding RCU, which can induce RCU
      stalls.  In the first pass, zap SPTEs at PG_LEVEL_1G, and then
      zap top-level entries in the second pass.
      
      With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
      SPTEs take up to ~7 or so seconds (time varies based on kernel config,
      number of (v)CPUs, etc...).  With 5-level paging, that time can balloon
      well into hundreds of seconds.
      
      Before remote TLB flushes were omitted, the problem was even worse as
      waiting for all active vCPUs to respond to the IPI introduced significant
      overhead for VMs with large numbers of vCPUs.
      
      By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
      the amount of work that is done without dropping RCU protection is
      strictly bounded, with the worst case latency for a single operation
      being less than 100ms.
      
      Zapping at 1gb in the first pass is not arbitrary.  First and foremost,
      KVM relies on being able to zap 1gb shadow pages in a single shot when
      when repacing a shadow page with a hugepage.  Zapping a 1gb shadow page
      that is fully populated with 4kb dirty SPTEs also triggers the worst case
      latency due writing back the struct page accessed/dirty bits for each 4kb
      page, i.e. the two-pass approach is guaranteed to work so long as KVM can
      cleany zap a 1gb shadow page.
      
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu:     52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
                                                softirq=15759/15759 fqs=5058
         (t=21016 jiffies g=66453 q=238577)
        NMI backtrace for cpu 52
        Call Trace:
         ...
         mark_page_accessed+0x266/0x2f0
         kvm_set_pfn_accessed+0x31/0x40
         handle_removed_tdp_mmu_page+0x259/0x2e0
         __handle_changed_spte+0x223/0x2c0
         handle_removed_tdp_mmu_page+0x1c1/0x2e0
         __handle_changed_spte+0x223/0x2c0
         handle_removed_tdp_mmu_page+0x1c1/0x2e0
         __handle_changed_spte+0x223/0x2c0
         zap_gfn_range+0x141/0x3b0
         kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
         kvm_mmu_zap_all_fast+0x121/0x190
         kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
         kvm_page_track_flush_slot+0x5c/0x80
         kvm_arch_flush_shadow_memslot+0xe/0x10
         kvm_set_memslot+0x172/0x4e0
         __kvm_set_memory_region+0x337/0x590
         kvm_vm_ioctl+0x49c/0xf80
      Reported-by: NDavid Matlack <dmatlack@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Mingwei Zhang <mizhang@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-22-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1b6043e8
    • P
      KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root · 8351779c
      Paolo Bonzini 提交于
      Allow yielding when zapping SPTEs after the last reference to a valid
      root is put.  Because KVM must drop all SPTEs in response to relevant
      mmu_notifier events, mark defunct roots invalid and reset their refcount
      prior to zapping the root.  Keeping the refcount elevated while the zap
      is in-progress ensures the root is reachable via mmu_notifier until the
      zap completes and the last reference to the invalid, defunct root is put.
      
      Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the
      root in being put has a massive paging structure, e.g. zapping a root
      that is backed entirely by 4kb pages for a guest with 32tb of memory can
      take hundreds of seconds to complete.
      
        watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368]
        RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm]
         __handle_changed_spte+0x1b2/0x2f0 [kvm]
         handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
         __handle_changed_spte+0x1f4/0x2f0 [kvm]
         handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm]
         __handle_changed_spte+0x1f4/0x2f0 [kvm]
         tdp_mmu_zap_root+0x307/0x4d0 [kvm]
         kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm]
         kvm_mmu_free_roots+0x22d/0x350 [kvm]
         kvm_mmu_reset_context+0x20/0x60 [kvm]
         kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm]
         kvm_vcpu_ioctl+0x5bd/0x710 [kvm]
         __se_sys_ioctl+0x77/0xc0
         __x64_sys_ioctl+0x1d/0x20
         do_syscall_64+0x44/0xa0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      KVM currently doesn't put a root from a non-preemptible context, so other
      than the mmu_notifier wrinkle, yielding when putting a root is safe.
      
      Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't
      take a reference to each root (it requires mmu_lock be held for the
      entire duration of the walk).
      
      tdp_mmu_next_root() is used only by the yield-friendly iterator.
      
      tdp_mmu_zap_root_work() is explicitly yield friendly.
      
      kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out,
      but is still yield-friendly in all call sites, as all callers can be
      traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or
      kvm_create_vm().
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-21-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8351779c
    • P
      KVM: x86/mmu: Zap invalidated roots via asynchronous worker · 22b94c4b
      Paolo Bonzini 提交于
      Use the system worker threads to zap the roots invalidated
      by the TDP MMU's "fast zap" mechanism, implemented by
      kvm_tdp_mmu_invalidate_all_roots().
      
      At this point, apart from allowing some parallelism in the zapping of
      roots, the workqueue is a glorified linked list: work items are added and
      flushed entirely within a single kvm->slots_lock critical section.  However,
      the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
      assumes that it owns a reference to all invalid roots; therefore, no
      one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
      invalidated roots on a linked list... erm, on a workqueue ensures that
      tdp_mmu_zap_root_work() only puts back those extra references that
      kvm_mmu_zap_all_invalidated_roots() had gifted to it.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      22b94c4b
    • S
      KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages · bb95dfb9
      Sean Christopherson 提交于
      Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead
      of immediately flushing.  Because the shadow pages are freed in an RCU
      callback, so long as at least one CPU holds RCU, all CPUs are protected.
      For vCPUs running in the guest, i.e. consuming TLB entries, KVM only
      needs to ensure the caller services the pending TLB flush before dropping
      its RCU protections.  I.e. use the caller's RCU as a proxy for all vCPUs
      running in the guest.
      
      Deferring the flushes allows batching flushes, e.g. when installing a
      1gb hugepage and zapping a pile of SPs.  And when zapping an entire root,
      deferring flushes allows skipping the flush entirely (because flushes are
      not needed in that case).
      
      Avoiding flushes when zapping an entire root is especially important as
      synchronizing with other CPUs via IPI after zapping every shadow page can
      cause significant performance issues for large VMs.  The issue is
      exacerbated by KVM zapping entire top-level entries without dropping
      RCU protection, which can lead to RCU stalls even when zapping roots
      backing relatively "small" amounts of guest memory, e.g. 2tb.  Removing
      the IPI bottleneck largely mitigates the RCU issues, though it's likely
      still a problem for 5-level paging.  A future patch will further address
      the problem by zapping roots in multiple passes to avoid holding RCU for
      an extended duration.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-20-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb95dfb9
    • S
      KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched · bd296779
      Sean Christopherson 提交于
      When yielding in the TDP MMU iterator, service any pending TLB flush
      before dropping RCU protections in anticipation of using the caller's RCU
      "lock" as a proxy for vCPUs in the guest.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-19-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bd296779
    • S
      KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() · cf3e2642
      Sean Christopherson 提交于
      Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
      functions accordingly.  When removing mappings for functional correctness
      (except for the stupid VFIO GPU passthrough memslots bug), zapping the
      leaf SPTEs is sufficient as the paging structures themselves do not point
      at guest memory and do not directly impact the final translation (in the
      TDP MMU).
      
      Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
      the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
      kvm_unmap_gfn_range().
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-18-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cf3e2642
    • S
      KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range · acbda82a
      Sean Christopherson 提交于
      Now that all callers of zap_gfn_range() hold mmu_lock for write, drop
      support for zapping with mmu_lock held for read.  That all callers hold
      mmu_lock for write isn't a random coincidence; now that the paths that
      need to zap _everything_ have their own path, the only callers left are
      those that need to zap for functional correctness.  And when zapping is
      required for functional correctness, mmu_lock must be held for write,
      otherwise the caller has no guarantees about the state of the TDP MMU
      page tables after it has run, e.g. the SPTE(s) it zapped can be
      immediately replaced by a vCPU faulting in a page.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-17-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      acbda82a
    • S
      KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page · e2b5b21d
      Sean Christopherson 提交于
      Add a dedicated helper for zapping a TDP MMU root, and use it in the three
      flows that do "zap_all" and intentionally do not do a TLB flush if SPTEs
      are zapped (zapping an entire root is safe if and only if it cannot be in
      use by any vCPU).  Because a TLB flush is never required, unconditionally
      pass "false" to tdp_mmu_iter_cond_resched() when potentially yielding.
      
      Opportunistically document why KVM must not yield when zapping roots that
      are being zapped by kvm_tdp_mmu_put_root(), i.e. roots whose refcount has
      reached zero, and further harden the flow to detect improper KVM behavior
      with respect to roots that are supposed to be unreachable.
      
      In addition to hardening zapping of roots, isolating zapping of roots
      will allow future simplification of zap_gfn_range() by having it zap only
      leaf SPTEs, and by removing its tricky "zap all" heuristic.  By having
      all paths that truly need to free _all_ SPs flow through the dedicated
      root zapper, the generic zapper can be freed of those concerns.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-16-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e2b5b21d
    • S
      KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU · 77c8cd6b
      Sean Christopherson 提交于
      Don't flush the TLBs when zapping all TDP MMU pages, as the only time KVM
      uses the slow version of "zap everything" is when the VM is being
      destroyed or the owning mm has exited.  In either case, KVM_RUN is
      unreachable for the VM, i.e. the guest TLB entries cannot be consumed.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-15-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      77c8cd6b
    • S
      KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery · c10743a1
      Sean Christopherson 提交于
      When recovering a potential hugepage that was shattered for the iTLB
      multihit workaround, precisely zap only the target page instead of
      iterating over the TDP MMU to find the SP that was passed in.  This will
      allow future simplification of zap_gfn_range() by having it zap only
      leaf SPTEs.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-14-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c10743a1
    • S
      KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw values · 626808d1
      Sean Christopherson 提交于
      Refactor __tdp_mmu_set_spte() to work with raw values instead of a
      tdp_iter objects so that a future patch can modify SPTEs without doing a
      walk, and without having to synthesize a tdp_iter.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-13-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      626808d1
    • S
      KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path · 966da62a
      Sean Christopherson 提交于
      WARN if the new_spte being set by __tdp_mmu_set_spte() is a REMOVED_SPTE,
      which is called out by the comment as being disallowed but not actually
      checked.  Keep the WARN on the old_spte as well, because overwriting a
      REMOVED_SPTE in the non-atomic path is also disallowed (as evidence by
      lack of splats with the existing WARN).
      
      Fixes: 08f07c80 ("KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler")
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-12-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      966da62a
    • S
      KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU · 0e587aa7
      Sean Christopherson 提交于
      Add helpers to read and write TDP MMU SPTEs instead of open coding
      rcu_dereference() all over the place, and to provide a convenient
      location to document why KVM doesn't exempt holding mmu_lock for write
      from having to hold RCU (and any future changes to the rules).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-11-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0e587aa7
    • S
      KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks · a151acec
      Sean Christopherson 提交于
      Drop RCU protection after processing each root when handling MMU notifier
      hooks that aren't the "unmap" path, i.e. aren't zapping.  Temporarily
      drop RCU to let RCU do its thing between roots, and to make it clear that
      there's no special behavior that relies on holding RCU across all roots.
      
      Currently, the RCU protection is completely superficial, it's necessary
      only to make rcu_dereference() of SPTE pointers happy.  A future patch
      will rely on holding RCU as a proxy for vCPUs in the guest, e.g. to
      ensure shadow pages aren't freed before all vCPUs do a TLB flush (or
      rather, acknowledge the need for a flush), but in that case RCU needs to
      be held until the flush is complete if and only if the flush is needed
      because a shadow page may have been removed.  And except for the "unmap"
      path, MMU notifier events cannot remove SPs (don't toggle PRESENT bit,
      and can't change the PFN for a SP).
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-10-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a151acec
    • S
      KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte · 93fa50f6
      Sean Christopherson 提交于
      Batch TLB flushes (with other MMUs) when handling ->change_spte()
      notifications in the TDP MMU.  The MMU notifier path in question doesn't
      allow yielding and correcty flushes before dropping mmu_lock.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-9-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      93fa50f6
    • S
      KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal · c8e5a0d0
      Sean Christopherson 提交于
      Look for a !leaf=>leaf conversion instead of a PFN change when checking
      if a SPTE change removed a TDP MMU shadow page.  Convert the PFN check
      into a WARN, as KVM should never change the PFN of a shadow page (except
      when its being zapped or replaced).
      
      From a purely theoretical perspective, it's not illegal to replace a SP
      with a hugepage pointing at the same PFN.  In practice, it's impossible
      as that would require mapping guest memory overtop a kernel-allocated SP.
      Either way, the check is odd.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-8-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c8e5a0d0
    • P
      KVM: x86/mmu: do not allow readers to acquire references to invalid roots · 614f6970
      Paolo Bonzini 提交于
      Remove the "shared" argument of for_each_tdp_mmu_root_yield_safe, thus ensuring
      that readers do not ever acquire a reference to an invalid root.  After this
      patch, all readers except kvm_tdp_mmu_zap_invalidated_roots() treat
      refcount=0/valid, refcount=0/invalid and refcount=1/invalid in exactly the
      same way.  kvm_tdp_mmu_zap_invalidated_roots() is different but it also
      does not acquire a reference to the invalid root, and it cannot see
      refcount=0/invalid because it is guaranteed to run after
      kvm_tdp_mmu_invalidate_all_roots().
      
      Opportunistically add a lockdep assertion to the yield-safe iterator.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      614f6970
    • P
      KVM: x86/mmu: only perform eager page splitting on valid roots · 7c554d8e
      Paolo Bonzini 提交于
      Eager page splitting is an optimization; it does not have to be performed on
      invalid roots.  It is also the only case in which a reader might acquire
      a reference to an invalid root, so after this change we know that readers
      will skip both dying and invalid roots.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7c554d8e
    • S
      KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter · 226b8c8f
      Sean Christopherson 提交于
      Assert that mmu_lock is held for write by users of the yield-unfriendly
      TDP iterator.  The nature of a shared walk means that the caller needs to
      play nice with other tasks modifying the page tables, which is more or
      less the same thing as playing nice with yielding.  Theoretically, KVM
      could gain a flow where it could legitimately take mmu_lock for read in
      a non-preemptible context, but that's highly unlikely and any such case
      should be viewed with a fair amount of scrutiny.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      226b8c8f