1. 25 6月, 2022 15 次提交
  2. 24 6月, 2022 25 次提交
    • P
      KVM: nVMX: clean up posted interrupt descriptor try_cmpxchg · 4de5c54f
      Paolo Bonzini 提交于
      Rely on try_cmpxchg64 for re-reading the PID on failure, using READ_ONCE
      only right before the first iteration.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4de5c54f
    • Z
      KVM: selftests: Enhance handling WRMSR ICR register in x2APIC mode · 4b88b1a5
      Zeng Guang 提交于
      Hardware would directly write x2APIC ICR register instead of software
      emulation in some circumstances, e.g when Intel IPI virtualization is
      enabled. This behavior requires normal reserved bits checking to ensure
      them input as zero, otherwise it will cause #GP. So we need mask out
      those reserved bits from the data written to vICR register.
      
      Remove Delivery Status bit emulation in test case as this flag
      is invalid and not needed in x2APIC mode. KVM may ignore clearing
      it during interrupt dispatch which will lead to fake test failure.
      
      Opportunistically correct vector number for test sending IPI to
      non-existent vCPUs.
      Signed-off-by: NZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220623094511.26066-1-guang.zeng@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4b88b1a5
    • J
      KVM: selftests: Add a self test for CMCI and UCNA emulations. · eede2065
      Jue Wang 提交于
      This patch add a self test that verifies user space can inject
      UnCorrectable No Action required (UCNA) memory errors to the guest.
      It also verifies that incorrectly configured MSRs for Corrected
      Machine Check Interrupt (CMCI) emulation will result in #GP.
      Signed-off-by: NJue Wang <juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-9-juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      eede2065
    • J
      KVM: x86: Enable CMCI capability by default and handle injected UCNA errors · aebc3ca1
      Jue Wang 提交于
      This patch enables MCG_CMCI_P by default in kvm_mce_cap_supported. It
      reuses ioctl KVM_X86_SET_MCE to implement injection of UnCorrectable
      No Action required (UCNA) errors, signaled via Corrected Machine
      Check Interrupt (CMCI).
      
      Neither of the CMCI and UCNA emulations depends on hardware.
      Signed-off-by: NJue Wang <juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-8-juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      aebc3ca1
    • J
      KVM: x86: Add emulation for MSR_IA32_MCx_CTL2 MSRs. · 281b5278
      Jue Wang 提交于
      This patch adds the emulation of IA32_MCi_CTL2 registers to KVM. A
      separate mci_ctl2_banks array is used to keep the existing mce_banks
      register layout intact.
      
      In Machine Check Architecture, in addition to MCG_CMCI_P, bit 30 of
      the per-bank register IA32_MCi_CTL2 controls whether Corrected Machine
      Check error reporting is enabled.
      Signed-off-by: NJue Wang <juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-7-juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      281b5278
    • J
      KVM: x86: Use kcalloc to allocate the mce_banks array. · 087acc4e
      Jue Wang 提交于
      This patch updates the allocation of mce_banks with the array allocation
      API (kcalloc) as a precedent for the later mci_ctl2_banks to implement
      per-bank control of Corrected Machine Check Interrupt (CMCI).
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NJue Wang <juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-6-juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      087acc4e
    • J
      KVM: x86: Add Corrected Machine Check Interrupt (CMCI) emulation to lapic. · 4b903561
      Jue Wang 提交于
      This patch calculates the number of lvt entries as part of
      KVM_X86_MCE_SETUP conditioned on the presence of MCG_CMCI_P bit in
      MCG_CAP and stores result in kvm_lapic. It translats from APIC_LVTx
      register to index in lapic_lvt_entry enum. It extends the APIC_LVTx
      macro as well as other lapic write/reset handling etc to support
      Corrected Machine Check Interrupt.
      Signed-off-by: NJue Wang <juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-5-juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4b903561
    • J
      KVM: x86: Add APIC_LVTx() macro. · 987f625e
      Jue Wang 提交于
      An APIC_LVTx macro is introduced to calcualte the APIC_LVTx register
      offset based on the index in the lapic_lvt_entry enum. Later patches
      will extend the APIC_LVTx macro to support the APIC_LVTCMCI register
      in order to implement Corrected Machine Check Interrupt signaling.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NJue Wang <juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-4-juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      987f625e
    • J
      KVM: x86: Fill apic_lvt_mask with enums / explicit entries. · 1d8c681f
      Jue Wang 提交于
      This patch defines a lapic_lvt_entry enum used as explicit indices to
      the apic_lvt_mask array. In later patches a LVT_CMCI will be added to
      implement the Corrected Machine Check Interrupt signaling.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NJue Wang <juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-3-juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1d8c681f
    • J
      KVM: x86: Make APIC_VERSION capture only the magic 0x14UL. · 951ceb94
      Jue Wang 提交于
      Refactor APIC_VERSION so that the maximum number of LVT entries is
      inserted at runtime rather than compile time. This will be used in a
      subsequent commit to expose the LVT CMCI Register to VMs that support
      Corrected Machine Check error counting/signaling
      (IA32_MCG_CAP.MCG_CMCI_P=1).
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NJue Wang <juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220610171134.772566-2-juew@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      951ceb94
    • P
      KVM: x86/mmu: Avoid unnecessary flush on eager page split · 03787394
      Paolo Bonzini 提交于
      The TLB flush before installing the newly-populated lower level
      page table is unnecessary if the lower-level page table maps
      the huge page identically.  KVM knows it is if it did not reuse
      an existing shadow page table, tell drop_large_spte() to skip
      the flush in that case.
      
      Extracted from a patch by David Matlack.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      03787394
    • D
      KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs · ada51a9d
      David Matlack 提交于
      Add support for Eager Page Splitting pages that are mapped by nested
      MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
      pages, and then splitting all 2MiB pages to 4KiB pages.
      
      Note, Eager Page Splitting is limited to nested MMUs as a policy rather
      than due to any technical reason (the sp->role.guest_mode check could
      just be deleted and Eager Page Splitting would work correctly for all
      shadow MMU pages). There is really no reason to support Eager Page
      Splitting for tdp_mmu=N, since such support will eventually be phased
      out, and there is no current use case supporting Eager Page Splitting on
      hosts where TDP is either disabled or unavailable in hardware.
      Furthermore, future improvements to nested MMU scalability may diverge
      the code from the legacy shadow paging implementation. These
      improvements will be simpler to make if Eager Page Splitting does not
      have to worry about legacy shadow paging.
      
      Splitting huge pages mapped by nested MMUs requires dealing with some
      extra complexity beyond that of the TDP MMU:
      
      (1) The shadow MMU has a limit on the number of shadow pages that are
          allowed to be allocated. So, as a policy, Eager Page Splitting
          refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
          pages available.
      
      (2) Splitting a huge page may end up re-using an existing lower level
          shadow page tables. This is unlike the TDP MMU which always allocates
          new shadow page tables when splitting.
      
      (3) When installing the lower level SPTEs, they must be added to the
          rmap which may require allocating additional pte_list_desc structs.
      
      Case (2) is especially interesting since it may require a TLB flush,
      unlike the TDP MMU which can fully split huge pages without any TLB
      flushes. Specifically, an existing lower level page table may point to
      even lower level page tables that are not fully populated, effectively
      unmapping a portion of the huge page, which requires a flush.  As of
      this commit, a flush is always done always after dropping the huge page
      and before installing the lower level page table.
      
      This TLB flush could instead be delayed until the MMU lock is about to be
      dropped, which would batch flushes for multiple splits.  However these
      flushes should be rare in practice (a huge page must be aliased in
      multiple SPTEs and have been split for NX Huge Pages in only some of
      them). Flushing immediately is simpler to plumb and also reduces the
      chances of tripping over a CPU bug (e.g. see iTLB multihit).
      
      [ This commit is based off of the original implementation of Eager Page
        Splitting from Peter in Google's kernel from 2016. ]
      Suggested-by: NPeter Feiner <pfeiner@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ada51a9d
    • D
      KVM: Allow for different capacities in kvm_mmu_memory_cache structs · 837f66c7
      David Matlack 提交于
      Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
      declaration time rather than being fixed for all declarations. This will
      be used in a follow-up commit to declare an cache in x86 with a capacity
      of 512+ objects without having to increase the capacity of all caches in
      KVM.
      
      This change requires each cache now specify its capacity at runtime,
      since the cache struct itself no longer has a fixed capacity known at
      compile time. To protect against someone accidentally defining a
      kvm_mmu_memory_cache struct directly (without the extra storage), this
      commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().
      
      In order to support different capacities, this commit changes the
      objects pointer array to be dynamically allocated the first time the
      cache is topped-up.
      
      While here, opportunistically clean up the stack-allocated
      kvm_mmu_memory_cache structs in riscv and arm64 to use designated
      initializers.
      
      No functional change intended.
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-22-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      837f66c7
    • P
      KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page() · 0cd8dc73
      Paolo Bonzini 提交于
      Before allocating a child shadow page table, all callers check
      whether the parent already points to a huge page and, if so, they
      drop that SPTE.  This is done by drop_large_spte().
      
      However, dropping the large SPTE is really only necessary before the
      sp is installed.  While the sp is returned by kvm_mmu_get_child_sp(),
      installing it happens later in __link_shadow_page().  Move the call
      there instead of having it in each and every caller.
      
      To ensure that the shadow page is not linked twice if it was present,
      do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
      instead, return an error value if the shadow page already existed.
      This is a bit more verbose, but clearer than NULL.
      
      Finally, now that the drop_large_spte() name is not taken anymore,
      remove the two underscores in front of __drop_large_spte().
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0cd8dc73
    • D
      KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels · 20d49186
      David Matlack 提交于
      Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This
      is fine for now since KVM never creates intermediate huge pages during
      dirty logging. In other words, KVM always replaces 1GiB pages directly
      with 4KiB pages, so there is no reason to look for collapsible 2MiB
      pages.
      
      However, this will stop being true once the shadow MMU participates in
      eager page splitting. During eager page splitting, each 1GiB is first
      split into 2MiB pages and then those are split into 4KiB pages. The
      intermediate 2MiB pages may be left behind if an error condition causes
      eager page splitting to bail early.
      
      No functional change intended.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-20-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      20d49186
    • D
      KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU · 47855da0
      David Matlack 提交于
      Currently make_huge_page_split_spte() assumes execute permissions can be
      granted to any 4K SPTE when splitting huge pages. This is true for the
      TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
      shadowing a non-executable huge page.
      
      To fix this, pass in the role of the child shadow page where the huge
      page will be split and derive the execution permission from that.  This
      is correct because huge pages are always split with direct shadow page
      and thus the shadow page role contains the correct access permissions.
      
      No functional change intended.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-19-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      47855da0
    • D
      KVM: x86/mmu: Cache the access bits of shadowed translations · 6a97575d
      David Matlack 提交于
      Splitting huge pages requires allocating/finding shadow pages to replace
      the huge page. Shadow pages are keyed, in part, off the guest access
      permissions they are shadowing. For fully direct MMUs, there is no
      shadowing so the access bits in the shadow page role are always ACC_ALL.
      But during shadow paging, the guest can enforce whatever access
      permissions it wants.
      
      In particular, eager page splitting needs to know the permissions to use
      for the subpages, but KVM cannot retrieve them from the guest page
      tables because eager page splitting does not have a vCPU.  Fortunately,
      the guest access permissions are easy to cache whenever page faults or
      FNAME(sync_page) update the shadow page tables; this is an extension of
      the existing cache of the shadowed GFNs in the gfns array of the shadow
      page.  The access bits only take up 3 bits, which leaves 61 bits left
      over for gfns, which is more than enough.
      
      Now that the gfns array caches more information than just GFNs, rename
      it to shadowed_translation.
      
      While here, preemptively fix up the WARN_ON() that detects gfn
      mismatches in direct SPs. The WARN_ON() was paired with a
      pr_err_ratelimited(), which means that users could sometimes see the
      WARN without the accompanying error message. Fix this by outputting the
      error message as part of the WARN splat, and opportunistically make
      them WARN_ONCE() because if these ever fire, they are all but guaranteed
      to fire a lot and will bring down the kernel.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6a97575d
    • D
      KVM: x86/mmu: Update page stats in __rmap_add() · 81cb4657
      David Matlack 提交于
      Update the page stats in __rmap_add() rather than at the call site. This
      will avoid having to manually update page stats when splitting huge
      pages in a subsequent commit.
      
      No functional change intended.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-17-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      81cb4657
    • D
      KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu · 2ff9039a
      David Matlack 提交于
      Allow adding new entries to the rmap and linking shadow pages without a
      struct kvm_vcpu pointer by moving the implementation of rmap_add() and
      link_shadow_page() into inner helper functions.
      
      No functional change intended.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-16-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2ff9039a
    • D
      KVM: x86/mmu: Pass const memslot to rmap_add() · 6ec6509e
      David Matlack 提交于
      Constify rmap_add()'s @slot parameter; it is simply passed on to
      gfn_to_rmap(), which takes a const memslot.
      
      No functional change intended.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-15-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ec6509e
    • D
      KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page() · cbd858b1
      David Matlack 提交于
      Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only
      caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync
      indirect shadow pages, so it's safe to pass in NULL when looking up
      direct shadow pages.
      
      This will be used for doing eager page splitting, which allocates direct
      shadow pages from the context of a VM ioctl without access to a vCPU
      pointer.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-14-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cbd858b1
    • D
      KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page() · 3cc736b3
      David Matlack 提交于
      Get the kvm pointer from the caller, rather than deriving it from
      vcpu->kvm, and plumb the kvm pointer all the way from
      kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer
      is only needed to sync indirect shadow pages. In other words,
      __kvm_mmu_get_shadow_page() can now be used to get *direct* shadow pages
      without a vcpu pointer. This enables eager page splitting, which needs
      to allocate direct shadow pages during VM ioctls.
      
      No functional change intended.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-13-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3cc736b3
    • D
      KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page() · 336081fb
      David Matlack 提交于
      The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the
      kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer.
      
      No functional change intended.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-12-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      336081fb
    • D
      KVM: x86/mmu: Pass memory caches to allocate SPs separately · 2f8b1b53
      David Matlack 提交于
      Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
      will allocate the various pieces of memory for shadow pages as a
      parameter, rather than deriving them from the vcpu pointer. This will be
      useful in a future commit where shadow pages are allocated during VM
      ioctls for eager page splitting, and thus will use a different set of
      caches.
      
      Preemptively pull the caches out all the way to
      kvm_mmu_get_shadow_page() since eager page splitting will not be calling
      kvm_mmu_alloc_shadow_page() directly.
      
      No functional change intended.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-11-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2f8b1b53
    • D
      KVM: x86/mmu: Move guest PT write-protection to account_shadowed() · be911771
      David Matlack 提交于
      Move the code that write-protects newly-shadowed guest page tables into
      account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
      more logical place for this code to live. But most importantly, this
      reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
      kvm_vcpu pointer, which will be necessary when creating new shadow pages
      during VM ioctls for eager page splitting.
      
      Note, it is safe to drop the role.level == PG_LEVEL_4K check since
      account_shadowed() returns early if role.level > PG_LEVEL_4K.
      
      No functional change intended.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220516232138.1783324-10-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      be911771