1. 30 4月, 2022 1 次提交
    • S
      KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR · 86931ff7
      Sean Christopherson 提交于
      Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
      MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
      The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
      guest, possibly with help from userspace, manages to coerce KVM into
      creating a SPTE for an "impossible" gfn, KVM will leak the associated
      shadow pages (page tables):
      
        WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
                                      kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Modules linked in: kvm_intel kvm irqbypass
        CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G        W         5.18.0-rc1+ #293
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Call Trace:
         <TASK>
         kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
         kvm_destroy_vm+0x162/0x2d0 [kvm]
         kvm_vm_release+0x1d/0x30 [kvm]
         __fput+0x82/0x240
         task_work_run+0x5b/0x90
         exit_to_user_mode_prepare+0xd2/0xe0
         syscall_exit_to_user_mode+0x1d/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      On bare metal, encountering an impossible gpa in the page fault path is
      well and truly impossible, barring CPU bugs, as the CPU will signal #PF
      during the gva=>gpa translation (or a similar failure when stuffing a
      physical address into e.g. the VMCS/VMCB).  But if KVM is running as a VM
      itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
      of the underlying hardware, in which case the hardware will not fault on
      the illegal-from-KVM's-perspective gpa.
      
      Alternatively, KVM could continue allowing the dodgy behavior and simply
      zap the max possible range.  But, for hosts with MAXPHYADDR < 52, that's
      a (minor) waste of cycles, and more importantly, KVM can't reasonably
      support impossible memslots when running on bare metal (or with an
      accurate MAXPHYADDR as a VM).  Note, limiting the overhead by checking if
      KVM is running as a guest is not a safe option as the host isn't required
      to announce itself to the guest in any way, e.g. doesn't need to set the
      HYPERVISOR CPUID bit.
      
      A second alternative to disallowing the memslot behavior would be to
      disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR.  That
      restriction is undesirable as there are legitimate use cases for doing
      so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
      systems so that VMs can be migrated between hosts with different
      MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
      
      Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
      even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
      any reserved physical address bits).
      
      The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
      The memslot and TDP MMU code want an exclusive value, but the name
      implies the returned value is inclusive, and the MMIO path needs an
      inclusive check.
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Fixes: 524a1e4e ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220428233416.2446833-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      86931ff7
  2. 05 4月, 2022 1 次提交
    • S
      KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded · 1d0e8480
      Sean Christopherson 提交于
      Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
      -1 is technically undefined behavior when its value is read out by
      param_get_bool(), as boolean values are supposed to be '0' or '1'.
      
      Alternatively, KVM could define a custom getter for the param, but the
      auto value doesn't depend on the vendor module in any way, and printing
      "auto" would be unnecessarily unfriendly to the user.
      
      In addition to fixing the undefined behavior, resolving the auto value
      also fixes the scenario where the auto value resolves to N and no vendor
      module is loaded.  Previously, -1 would result in Y being printed even
      though KVM would ultimately disable the mitigation.
      
      Rename the existing MMU module init/exit helpers to clarify that they're
      invoked with respect to the vendor module, and add comments to document
      why KVM has two separate "module init" flows.
      
        =========================================================================
        UBSAN: invalid-load in kernel/params.c:320:33
        load of value 255 is not a valid value for type '_Bool'
        CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         <TASK>
         dump_stack_lvl+0x34/0x44
         ubsan_epilogue+0x5/0x40
         __ubsan_handle_load_invalid_value.cold+0x43/0x48
         param_get_bool.cold+0xf/0x14
         param_attr_show+0x55/0x80
         module_attr_show+0x1c/0x30
         sysfs_kf_seq_show+0x93/0xc0
         seq_read_iter+0x11c/0x450
         new_sync_read+0x11b/0x1a0
         vfs_read+0xf0/0x190
         ksys_read+0x5f/0xe0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        =========================================================================
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
      Cc: stable@vger.kernel.org
      Reported-by: NBruno Goncalves <bgoncalv@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220331221359.3912754-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1d0e8480
  3. 02 4月, 2022 6 次提交
    • M
      KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set · 5959ff4a
      Maxim Levitsky 提交于
      It makes more sense to print new SPTE value than the
      old value.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220302102457.588450-1-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5959ff4a
    • L
      KVM: X86: Handle implicit supervisor access with SMAP · 4f4aa80e
      Lai Jiangshan 提交于
      There are two kinds of implicit supervisor access
      	implicit supervisor access when CPL = 3
      	implicit supervisor access when CPL < 3
      
      Current permission_fault() handles only the first kind for SMAP.
      
      But if the access is implicit when SMAP is on, data may not be read
      nor write from any user-mode address regardless the current CPL.
      
      So the second kind should be also supported.
      
      The first kind can be detect via CPL and access mode: if it is
      supervisor access and CPL = 3, it must be implicit supervisor access.
      
      But it is not possible to detect the second kind without extra
      information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
      into @access. This extra information also works for the first kind, so
      the logic is changed to use this information for both cases.
      
      The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
      which is in the most significant 16 bits of u64 and less likely to be
      forced to change due to future hardware uses it.
      
      This patch removes the call to ->get_cpl() for access mode is determined
      by @access.  Not only does it reduce a function call, but also remove
      confusions when the permission is checked for nested TDP.  The nested
      TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
      on it.  The original code works just because it is always user walk for
      NPT and SMAP fault is not set for EPT in update_permission_bitmask.
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f4aa80e
    • L
      KVM: X86: Fix comments in update_permission_bitmask · 94b4a2f1
      Lai Jiangshan 提交于
      The commit 09f037aa ("KVM: MMU: speedup update_permission_bitmask")
      refactored the code of update_permission_bitmask() and change the
      comments.  It added a condition into a list to match the new code,
      so the number/order for conditions in the comments should be updated
      too.
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-3-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      94b4a2f1
    • L
      KVM: X86: Change the type of access u32 to u64 · 5b22bbe7
      Lai Jiangshan 提交于
      Change the type of access u32 to u64 for FNAME(walk_addr) and
      ->gva_to_gpa().
      
      The kinds of accesses are usually combinations of UWX, and VMX/SVM's
      nested paging adds a new factor of access: is it an access for a guest
      page table or for a final guest physical address.
      
      And SMAP relies a factor for supervisor access: explicit or implicit.
      
      So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
      all these information to do the walk.
      
      Although @access(u32) has enough bits to encode all the kinds, this
      patch extends it to u64:
      	o Extra bits will be in the higher 32 bits, so that we can
      	  easily obtain the traditional access mode (UWX) by converting
      	  it to u32.
      	o Reuse the value for the access kind defined by SVM's nested
      	  paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
      	  @error_code in kvm_handle_page_fault().
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5b22bbe7
    • S
      KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap · f47e5bbb
      Sean Christopherson 提交于
      Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
      kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
      flush when processing multiple roots (including nested TDP shadow roots).
      Dropping the TLB flush resulted in random crashes when running Hyper-V
      Server 2019 in a guest with KSM enabled in the host (or any source of
      mmu_notifier invalidations, KSM is just the easiest to force).
      
      This effectively revert commits 873dd122
      and fcb93eb6, and thus restores commit
      cf3e2642, plus this delta on top:
      
      bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
              struct kvm_mmu_page *root;
      
              for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
      -               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
      +               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
      
              return flush;
       }
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220325230348.2587437-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f47e5bbb
    • P
      KVM: MMU: propagate alloc_workqueue failure · a1a39128
      Paolo Bonzini 提交于
      If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
      to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
      kvm_arch_init_vm also has to undo all the initialization, so
      group all the MMU initialization code at the beginning and
      handle cleaning up of kvm_page_track_init.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a1a39128
  4. 21 3月, 2022 1 次提交
  5. 08 3月, 2022 6 次提交
    • P
      KVM: x86/mmu: Zap invalidated roots via asynchronous worker · 22b94c4b
      Paolo Bonzini 提交于
      Use the system worker threads to zap the roots invalidated
      by the TDP MMU's "fast zap" mechanism, implemented by
      kvm_tdp_mmu_invalidate_all_roots().
      
      At this point, apart from allowing some parallelism in the zapping of
      roots, the workqueue is a glorified linked list: work items are added and
      flushed entirely within a single kvm->slots_lock critical section.  However,
      the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
      assumes that it owns a reference to all invalid roots; therefore, no
      one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
      invalidated roots on a linked list... erm, on a workqueue ensures that
      tdp_mmu_zap_root_work() only puts back those extra references that
      kvm_mmu_zap_all_invalidated_roots() had gifted to it.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      22b94c4b
    • S
      KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages · bb95dfb9
      Sean Christopherson 提交于
      Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead
      of immediately flushing.  Because the shadow pages are freed in an RCU
      callback, so long as at least one CPU holds RCU, all CPUs are protected.
      For vCPUs running in the guest, i.e. consuming TLB entries, KVM only
      needs to ensure the caller services the pending TLB flush before dropping
      its RCU protections.  I.e. use the caller's RCU as a proxy for all vCPUs
      running in the guest.
      
      Deferring the flushes allows batching flushes, e.g. when installing a
      1gb hugepage and zapping a pile of SPs.  And when zapping an entire root,
      deferring flushes allows skipping the flush entirely (because flushes are
      not needed in that case).
      
      Avoiding flushes when zapping an entire root is especially important as
      synchronizing with other CPUs via IPI after zapping every shadow page can
      cause significant performance issues for large VMs.  The issue is
      exacerbated by KVM zapping entire top-level entries without dropping
      RCU protection, which can lead to RCU stalls even when zapping roots
      backing relatively "small" amounts of guest memory, e.g. 2tb.  Removing
      the IPI bottleneck largely mitigates the RCU issues, though it's likely
      still a problem for 5-level paging.  A future patch will further address
      the problem by zapping roots in multiple passes to avoid holding RCU for
      an extended duration.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-20-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb95dfb9
    • S
      KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() · cf3e2642
      Sean Christopherson 提交于
      Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
      functions accordingly.  When removing mappings for functional correctness
      (except for the stupid VFIO GPU passthrough memslots bug), zapping the
      leaf SPTEs is sufficient as the paging structures themselves do not point
      at guest memory and do not directly impact the final translation (in the
      TDP MMU).
      
      Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
      the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
      kvm_unmap_gfn_range().
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-18-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cf3e2642
    • S
      KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush · 7ae5840e
      Sean Christopherson 提交于
      Remove the misleading flush "handling" when zapping invalidated TDP MMU
      roots, and document that flushing is unnecessary for all flavors of MMUs
      when zapping invalid/obsolete roots/pages.  The "handling" in the TDP MMU
      is dead code, as zap_gfn_range() is called with shared=true, in which
      case it will never return true due to the flushing being handled by
      tdp_mmu_zap_spte_atomic().
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7ae5840e
    • S
      KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic · db01416b
      Sean Christopherson 提交于
      Explicitly ignore the result of zap_gfn_range() when putting the last
      reference to a TDP MMU root, and add a pile of comments to formalize the
      TDP MMU's behavior of deferring TLB flushes to alloc/reuse.  Note, this
      only affects the !shared case, as zap_gfn_range() subtly never returns
      true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic().
      
      Putting the root without a flush is ok because even if there are stale
      references to the root in the TLB, they are unreachable because KVM will
      not run the guest with the same ASID without first flushing (where ASID
      in this context refers to both SVM's explicit ASID and Intel's implicit
      ASID that is constructed from VPID+PCID+EPT4A+etc...).
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220226001546.360188-5-seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      db01416b
    • S
      KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap · f28e9c7f
      Sean Christopherson 提交于
      Fix misleading and arguably wrong comments in the TDP MMU's fast zap
      flow.  The comments, and the fact that actually zapping invalid roots was
      added separately, strongly suggests that zapping invalid roots is an
      optimization and not required for correctness.  That is a lie.
      
      KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(),
      because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(),
      KVM is relying on it to fully remove all references to the memslot.  Once
      the memslot is gone, KVM's mmu_notifier hooks will be unable to find the
      stale references as the hva=>gfn translation is done via the memslots.
      If KVM doesn't immediately zap SPTEs and userspace unmaps a range after
      deleting a memslot, KVM will fail to zap in response to the mmu_notifier
      due to not finding a memslot corresponding to the notifier's range, which
      leads to a variation of use-after-free.
      
      The other misleading comment (and code) explicitly states that roots
      without a reference should be skipped.  While that's technically true,
      it's also extremely misleading as it should be impossible for KVM to
      encounter a defunct root on the list while holding mmu_lock for write.
      Opportunistically add a WARN to enforce that invariant.
      
      Fixes: b7cccd39 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
      Fixes: 4c6654bd ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220226001546.360188-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f28e9c7f
  6. 02 3月, 2022 1 次提交
  7. 01 3月, 2022 3 次提交
    • S
      KVM: WARN if is_unsync_root() is called on a root without a shadow page · 5d6a3221
      Sean Christopherson 提交于
      WARN and bail if is_unsync_root() is passed a root for which there is no
      shadow page, i.e. is passed the physical address of one of the special
      roots, which do not have an associated shadow page.  The current usage
      squeaks by without bug reports because neither kvm_mmu_sync_roots() nor
      kvm_mmu_sync_prev_roots() calls the helper with pae_root or pml4_root,
      and 5-level AMD CPUs are not generally available, i.e. no one can coerce
      KVM into calling is_unsync_root() on pml5_root.
      
      Note, this doesn't fix the mess with 5-level nNPT, it just (hopefully)
      prevents KVM from crashing.
      
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220225182248.3812651-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5d6a3221
    • S
      KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped · 527d5cd7
      Sean Christopherson 提交于
      Zap only obsolete roots when responding to zapping a single root shadow
      page.  Because KVM keeps root_count elevated when stuffing a previous
      root into its PGD cache, shadowing a 64-bit guest means that zapping any
      root causes all vCPUs to reload all roots, even if their current root is
      not affected by the zap.
      
      For many kernels, zapping a single root is a frequent operation, e.g. in
      Linux it happens whenever an mm is dropped, e.g. process exits, etc...
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220225182248.3812651-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      527d5cd7
    • S
      KVM: Drop kvm_reload_remote_mmus(), open code request in x86 users · 2f6f66cc
      Sean Christopherson 提交于
      Remove the generic kvm_reload_remote_mmus() and open code its
      functionality into the two x86 callers.  x86 is (obviously) the only
      architecture that uses the hook, and is also the only architecture that
      uses KVM_REQ_MMU_RELOAD in a way that's consistent with the name.  That
      will change in a future patch, as x86's usage when zapping a single
      shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only
      MMUs whose root is being zapped actually need to be reloaded.
      
      s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose.
      
      Drop the generic code in anticipation of implementing s390 and x86 arch
      specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely.
      
      Opportunistically reword the x86 TDP MMU comment to avoid making
      references to functions (and requests!) when possible, and to remove the
      rather ambiguous "this".
      
      No functional change intended.
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Message-Id: <20220225182248.3812651-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2f6f66cc
  8. 25 2月, 2022 9 次提交
    • P
      KVM: x86/mmu: clear MMIO cache when unloading the MMU · 6d58f275
      Paolo Bonzini 提交于
      For cleanliness, do not leave a stale GVA in the cache after all the roots are
      cleared.  In practice, kvm_mmu_load will go through kvm_mmu_sync_roots if
      paging is on, and will not use vcpu_match_mmio_gva at all if paging is off.
      However, leaving data in the cache might cause bugs in the future.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6d58f275
    • P
      KVM: x86/mmu: Always use current mmu's role when loading new PGD · d2e5f333
      Paolo Bonzini 提交于
      Since the guest PGD is now loaded after the MMU has been set up
      completely, the desired role for a cache hit is simply the current
      mmu_role.  There is no need to compute it again, so __kvm_mmu_new_pgd
      can be folded in kvm_mmu_new_pgd.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d2e5f333
    • P
      KVM: x86/mmu: load new PGD after the shadow MMU is initialized · 3cffc89d
      Paolo Bonzini 提交于
      Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
      shadow_root_level anymore, pull the PGD load after the initialization of
      the shadow MMUs.
      
      Besides being more intuitive, this enables future simplifications
      and optimizations because it's not necessary anymore to compute the
      role outside kvm_init_mmu.  In particular, kvm_mmu_reset_context was not
      attempting to use a cached PGD to avoid having to figure out the new role.
      With this change, it could follow what nested_{vmx,svm}_load_cr3 are doing,
      and avoid unloading all the cached roots.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3cffc89d
    • P
      KVM: x86/mmu: look for a cached PGD when going from 32-bit to 64-bit · 5499ea73
      Paolo Bonzini 提交于
      Right now, PGD caching avoids placing a PAE root in the cache by using the
      old value of mmu->root_level and mmu->shadow_root_level; it does not look
      for a cached PGD if the old root is a PAE one, and then frees it using
      kvm_mmu_free_roots.
      
      Change the logic instead to free the uncacheable root early.
      This way, __kvm_new_mmu_pgd is able to look up the cache when going from
      32-bit to 64-bit (if there is a hit, the invalid root becomes the least
      recently used).  An example of this is nested virtualization with shadow
      paging, when a 64-bit L1 runs a 32-bit L2.
      
      As a side effect (which is actually the reason why this patch was
      written), PGD caching does not use the old value of mmu->root_level
      and mmu->shadow_root_level anymore.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5499ea73
    • P
      KVM: x86/mmu: do not pass vcpu to root freeing functions · 0c1c92f1
      Paolo Bonzini 提交于
      These functions only operate on a given MMU, of which there is more
      than one in a vCPU (we care about two, because the third does not have
      any roots and is only used to walk guest page tables).  They do need a
      struct kvm in order to lock the mmu_lock, but they do not needed anything
      else in the struct kvm_vcpu.  So, pass the vcpu->kvm directly to them.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0c1c92f1
    • P
      KVM: x86/mmu: do not consult levels when freeing roots · 594bef79
      Paolo Bonzini 提交于
      Right now, PGD caching requires a complicated dance of first computing
      the MMU role and passing it to __kvm_mmu_new_pgd(), and then separately calling
      kvm_init_mmu().
      
      Part of this is due to kvm_mmu_free_roots using mmu->root_level and
      mmu->shadow_root_level to distinguish whether the page table uses a single
      root or 4 PAE roots.  Because kvm_init_mmu() can overwrite mmu->root_level,
      kvm_mmu_free_roots() must be called before kvm_init_mmu().
      
      However, even after kvm_init_mmu() there is a way to detect whether the
      page table may hold PAE roots, as root.hpa isn't backed by a shadow when
      it points at PAE roots.  Using this method results in simpler code, and
      is one less obstacle in moving all calls to __kvm_mmu_new_pgd() after the
      MMU has been initialized.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      594bef79
    • P
      KVM: x86: use struct kvm_mmu_root_info for mmu->root · b9e5603c
      Paolo Bonzini 提交于
      The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
      Use the struct to have more consistency between mmu->root and
      mmu->prev_roots.
      
      The patch is entirely search and replace except for cached_root_available,
      which does not need a temporary struct kvm_mmu_root_info anymore.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b9e5603c
    • P
      KVM: x86/mmu: avoid NULL-pointer dereference on page freeing bugs · 9191b8f0
      Paolo Bonzini 提交于
      WARN and bail if KVM attempts to free a root that isn't backed by a shadow
      page.  KVM allocates a bare page for "special" roots, e.g. when using PAE
      paging or shadowing 2/3/4-level page tables with 4/5-level, and so root_hpa
      will be valid but won't be backed by a shadow page.  It's all too easy to
      blindly call mmu_free_root_page() on root_hpa, be nice and WARN instead of
      crashing KVM and possibly the kernel.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9191b8f0
    • L
      KVM: x86/mmu: make apf token non-zero to fix bug · 6f3c1fc5
      Liang Zhang 提交于
      In current async pagefault logic, when a page is ready, KVM relies on
      kvm_arch_can_dequeue_async_page_present() to determine whether to deliver
      a READY event to the Guest. This function test token value of struct
      kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a
      READY event is finished by Guest. If value is zero meaning that a READY
      event is done, so the KVM can deliver another.
      But the kvm_arch_setup_async_pf() may produce a valid token with zero
      value, which is confused with previous mention and may lead the loss of
      this READY event.
      
      This bug may cause task blocked forever in Guest:
       INFO: task stress:7532 blocked for more than 1254 seconds.
             Not tainted 5.10.0 #16
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:stress          state:D stack:    0 pid: 7532 ppid:  1409
       flags:0x00000080
       Call Trace:
        __schedule+0x1e7/0x650
        schedule+0x46/0xb0
        kvm_async_pf_task_wait_schedule+0xad/0xe0
        ? exit_to_user_mode_prepare+0x60/0x70
        __kvm_handle_async_pf+0x4f/0xb0
        ? asm_exc_page_fault+0x8/0x30
        exc_page_fault+0x6f/0x110
        ? asm_exc_page_fault+0x8/0x30
        asm_exc_page_fault+0x1e/0x30
       RIP: 0033:0x402d00
       RSP: 002b:00007ffd31912500 EFLAGS: 00010206
       RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0
       RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0
       RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086
       R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000
       R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000
      Signed-off-by: NLiang Zhang <zhangliang5@huawei.com>
      Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6f3c1fc5
  9. 19 2月, 2022 1 次提交
  10. 11 2月, 2022 11 次提交
    • D
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG · cb00a70b
      David Matlack 提交于
      When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
      write-protected when dirty logging is enabled on the memslot. Instead
      they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
      the first time and only for the specific sub-region being cleared.
      
      Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
      write-protecting to avoid causing write-protection faults on vCPU
      threads. This also allows userspace to smear the cost of huge page
      splitting across multiple ioctls, rather than splitting the entire
      memslot as is the case when initially-all-set is not used.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-17-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb00a70b
    • D
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled · a3fe5dbd
      David Matlack 提交于
      When dirty logging is enabled without initially-all-set, try to split
      all huge pages in the memslot down to 4KB pages so that vCPUs do not
      have to take expensive write-protection faults to split huge pages.
      
      Eager page splitting is best-effort only. This commit only adds the
      support for the TDP MMU, and even there splitting may fail due to out
      of memory conditions. Failures to split a huge page is fine from a
      correctness standpoint because KVM will always follow up splitting by
      write-protecting any remaining huge pages.
      
      Eager page splitting moves the cost of splitting huge pages off of the
      vCPU threads and onto the thread enabling dirty logging on the memslot.
      This is useful because:
      
       1. Splitting on the vCPU thread interrupts vCPUs execution and is
          disruptive to customers whereas splitting on VM ioctl threads can
          run in parallel with vCPU execution.
      
       2. Splitting all huge pages at once is more efficient because it does
          not require performing VM-exit handling or walking the page table for
          every 4KiB page in the memslot, and greatly reduces the amount of
          contention on the mmu_lock.
      
      For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
      per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
      all of their memory after dirty logging is enabled decreased by 95% from
      2.94s to 0.14s.
      
      Eager Page Splitting is over 100x more efficient than the current
      implementation of splitting on fault under the read lock. For example,
      taking the same workload as above, Eager Page Splitting reduced the CPU
      required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
      * 96 vCPU threads) to only 1.55 CPU-seconds.
      
      Eager page splitting does increase the amount of time it takes to enable
      dirty logging since it has split all huge pages. For example, the time
      it took to enable dirty logging in the 96GiB region of the
      aforementioned test increased from 0.001s to 1.55s.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a3fe5dbd
    • D
      KVM: x86/mmu: Move restore_acc_track_spte() to spte.h · 315d86da
      David Matlack 提交于
      restore_acc_track_spte() is pure SPTE bit manipulation, making it a good
      fit for spte.h. And now that the WARN_ON_ONCE() calls have been removed,
      there isn't any good reason to not inline it.
      
      This move also prepares for a follow-up commit that will need to call
      restore_acc_track_spte() from spte.c
      
      No functional change intended.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-11-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      315d86da
    • D
      KVM: x86/mmu: Drop new_spte local variable from restore_acc_track_spte() · 77c23c77
      David Matlack 提交于
      The new_spte local variable is unnecessary. Deleting it can save a line
      of code and simplify the remaining lines a bit.
      
      No functional change intended.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-10-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      77c23c77
    • D
      KVM: x86/mmu: Remove unnecessary warnings from restore_acc_track_spte() · 59940e76
      David Matlack 提交于
      The warnings in restore_acc_track_spte() can be removed because the only
      caller checks is_access_track_spte(), and is_access_track_spte() checks
      !spte_ad_enabled(). In other words, the warning can never be triggered.
      
      No functional change intended.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-9-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      59940e76
    • D
      KVM: x86/mmu: Rename __rmap_write_protect() to rmap_write_protect() · 1346bbb6
      David Matlack 提交于
      The function formerly known as rmap_write_protect() has been renamed to
      kvm_vcpu_write_protect_gfn(), so we can get rid of the double
      underscores in front of __rmap_write_protect().
      
      No functional change intended.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-3-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1346bbb6
    • D
      KVM: x86/mmu: Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() · cf48f9e2
      David Matlack 提交于
      rmap_write_protect() is a poor name because it also write-protects SPTEs
      in the TDP MMU, not just SPTEs in the rmap. It is also confusing that
      rmap_write_protect() is not a simple wrapper around
      __rmap_write_protect(), since that is the common pattern for functions
      with double-underscore names.
      
      Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() to convey
      that KVM is write-protecting a specific gfn in the context of a vCPU.
      
      No functional change intended.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-2-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cf48f9e2
    • D
      KVM: x86/mmu: Consolidate comments about {Host,MMU}-writable · 02844ac1
      David Matlack 提交于
      Consolidate the large comment above DEFAULT_SPTE_HOST_WRITABLE with the
      large comment above is_writable_pte() into one comment. This comment
      explains the different reasons why an SPTE may be non-writable and KVM
      keeps track of that with the {Host,MMU}-writable bits.
      
      No functional change intended.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220125230723.1701061-1-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      02844ac1
    • D
      KVM: x86/mmu: Rename DEFAULT_SPTE_MMU_WRITEABLE to DEFAULT_SPTE_MMU_WRITABLE · 1ca87e01
      David Matlack 提交于
      Both "writeable" and "writable" are valid, but we should be consistent
      about which we use. DEFAULT_SPTE_MMU_WRITEABLE was the odd one out in
      the SPTE code, so rename it to DEFAULT_SPTE_MMU_WRITABLE.
      
      No functional change intended.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220125230713.1700406-1-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1ca87e01
    • D
      KVM: x86/mmu: Check SPTE writable invariants when setting leaf SPTEs · 115111ef
      David Matlack 提交于
      Check SPTE writable invariants when setting SPTEs rather than in
      spte_can_locklessly_be_made_writable(). By the time KVM checks
      spte_can_locklessly_be_made_writable(), the SPTE has long been since
      corrupted.
      
      Note that these invariants only apply to shadow-present leaf SPTEs (i.e.
      not to MMIO SPTEs, non-leaf SPTEs, etc.). Add a comment explaining the
      restriction and only instrument the code paths that set shadow-present
      leaf SPTEs.
      
      To account for access tracking, also check the SPTE writable invariants
      when marking an SPTE as an access track SPTE. This also lets us remove
      a redundant WARN from mark_spte_for_access_track().
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220125230518.1697048-3-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      115111ef
    • S
      KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names · e27bc044
      Sean Christopherson 提交于
      Rename a variety of kvm_x86_op function pointers so that preferred name
      for vendor implementations follows the pattern <vendor>_<function>, e.g.
      rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run().  This will
      allow vendor implementations to be wired up via the KVM_X86_OP macro.
      
      In many cases, VMX and SVM "disagree" on the preferred name, though in
      reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
      the kvm_x86_ops name.  Justification for using the VMX nomenclature:
      
        - set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
          event that has already been "set" in e.g. the vIRR.  SVM's relevant
          VMCB field is even named event_inj, and KVM's stat is irq_injections.
      
        - prepare_guest_switch => prepare_switch_to_guest because the former is
          ambiguous, e.g. it could mean switching between multiple guests,
          switching from the guest to host, etc...
      
        - update_pi_irte => pi_update_irte to allow for matching match the rest
          of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().
      
        - start_assignment => pi_start_assignment to again follow VMX's posted
          interrupt naming scheme, and to provide context for what bit of code
          might care about an otherwise undescribed "assignment".
      
      The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
      Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
      wrong.  x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
      variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
      appropriate name for the Hyper-V hooks would be flush_remote_tlbs.  Leave
      that change for another time as the Hyper-V hooks always start as NULL,
      i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
      names requires an astounding amount of churn.
      
      VMX and SVM function names are intentionally left as is to minimize the
      diff.  Both VMX and SVM will need to rename even more functions in order
      to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
      inevitable.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e27bc044