1. 12 5月, 2022 4 次提交
    • S
      KVM: x86/mmu: Make all page fault handlers internal to the MMU · 8a009d5b
      Sean Christopherson 提交于
      Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
      of the page fault handling collateral that was in mmu.h purely for the
      async #PF handler into mmu_internal.h, where it belongs.  This will allow
      kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
      expose those enums outside of the MMU.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8a009d5b
    • S
      KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns" · 5276c616
      Sean Christopherson 提交于
      Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
      kvm_faultin_pfn() to signal that the page fault handler should continue
      doing its thing.  Aside from being gross and inefficient, using a boolean
      return to signal continue vs. stop makes it extremely difficult to add
      more helpers and/or move existing code to a helper.
      
      E.g. hypothetically, if nested MMUs were to gain a separate page fault
      handler in the future, everything up to the "is self-modifying PTE" check
      can be shared by all shadow MMUs, but communicating up the stack whether
      to continue on or stop becomes a nightmare.
      
      More concretely, proposed support for private guest memory ran into a
      similar issue, where it'll be forced to forego a helper in order to yield
      sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.
      
      No functional change intended.
      
      Cc: David Matlack <dmatlack@google.com>
      Cc: Chao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5276c616
    • S
      KVM: x86/mmu: Drop exec/NX check from "page fault can be fast" · 5c64aba5
      Sean Christopherson 提交于
      Tweak the "page fault can be fast" logic to explicitly check for !PRESENT
      faults in the access tracking case, and drop the exec/NX check that
      becomes redundant as a result.  No sane hardware will generate an access
      that is both an instruct fetch and a write, i.e. it's a waste of cycles.
      If hardware goes off the rails, or KVM runs under a misguided hypervisor,
      spuriously running throught fast path is benign (KVM has been uknowingly
      being doing exactly that for years).
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5c64aba5
    • S
      KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use · 54275f74
      Sean Christopherson 提交于
      Check for A/D bits being disabled instead of the access tracking mask
      being non-zero when deciding whether or not to attempt to fix a page
      fault vian the fast path.  Originally, the access tracking mask was
      non-zero if and only if A/D bits were disabled by _KVM_ (including not
      being supported by hardware), but that hasn't been true since nVMX was
      fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
      KVM to not use A/D bits while running L2 despite KVM using them while
      running L1.
      
      In other words, don't attempt the fast path just because EPT is enabled.
      
      Note, attempting the fast path for all !PRESENT faults can "fix" a very,
      _VERY_ tiny percentage of faults out of mmu_lock by detecting that the
      fault is spurious, i.e. has been fixed by a different vCPU, but again the
      odds of that happening are vanishingly small.  E.g. booting an 8-vCPU VM
      gets less than 10 successes out of 30k+ faults, and that's likely one of
      the more favorable scenarios.  Disabling dirty logging can likely lead to
      a rash of collisions between vCPUs for some workloads that operate on a
      common set of pages, but penalizing _all_ !PRESENT faults for that one
      case is unlikely to be a net positive, not to mention that that problem
      is best solved by not zapping in the first place.
      
      The number of spurious faults does scale with the number of vCPUs, e.g. a
      255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
      path (again out of 30k), but that's all of 0.2% of faults.  Using legacy
      shadow paging does get more spurious faults, and a few more detected out
      of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
      faults that are reflected into the guest), i.e. the extra detections are
      purely due to the sheer number of faults observed.
      
      On the other hand, getting a "negative" in the fast path takes in the
      neighborhood of 150-250 cycles.  So while it is tempting to keep/extend
      the current behavior, such a change needs to come with hard numbers
      showing that it's actually a win in the grand scheme, or any scheme for
      that matter.
      
      Fixes: 995f00a6 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54275f74
  2. 03 5月, 2022 2 次提交
    • S
      KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits() · 54eb3ef5
      Sean Christopherson 提交于
      Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
      and into its callers.  Well, caller, since only one of its two callers
      doesn't already do the shadow-present check.
      
      Opportunistically move the helper to spte.c/h so that it can be used by
      the TDP MMU, which is also the primary motivation for the shadow-present
      change.  Unlike the legacy MMU, the TDP MMU uses a single path for clear
      leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
      MMU will need to check is_last_spte() prior to calling
      spte_has_volatile_bits(), and calling is_last_spte() without first
      calling is_shadow_present_spte() is at best odd, and at worst a violation
      of KVM's loosely defines SPTE rules.
      
      Note, mmu_spte_clear_track_bits() could likely skip the write entirely
      for SPTEs that are not shadow-present.  Leave that cleanup for a future
      patch to avoid introducing a functional change, and because the
      shadow-present check can likely be moved further up the stack, e.g.
      drop_large_spte() appears to be the only path that doesn't already
      explicitly check for a shadow-present SPTE.
      
      No functional change intended.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54eb3ef5
    • S
      KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D) · 706c9c55
      Sean Christopherson 提交于
      Don't treat SPTEs that are truly writable, i.e. writable in hardware, as
      being volatile (unless they're volatile for other reasons, e.g. A/D bits).
      KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit
      out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get
      cleared just because the SPTE is MMU-writable.
      
      Rename the wrapper of MMU-writable to be more literal, the previous name
      of spte_can_locklessly_be_made_writable() is wrong and misleading.
      
      Fixes: c7ba5b48 ("KVM: MMU: fast path of handling guest page fault")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      706c9c55
  3. 30 4月, 2022 24 次提交
    • L
      KVM: X86/MMU: Fix shadowing 5-level NPT for 4-level NPT L1 guest · 84e5ffd0
      Lai Jiangshan 提交于
      When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is
      allocated with role.level = 5 and the guest pagetable's root gfn.
      
      And root_sp->spt[0] is also allocated with the same gfn and the same
      role except role.level = 4.  Luckily that they are different shadow
      pages, but only root_sp->spt[0] is the real translation of the guest
      pagetable.
      
      Here comes a problem:
      
      If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
      and uses the same gfn as the root page for nested NPT before and after
      switching gCR4_LA57.  The host (hCR4_LA57=1) might use the same root_sp
      for the guest even the guest switches gCR4_LA57.  The guest will see
      unexpected page mapped and L2 may exploit the bug and hurt L1.  It is
      lucky that the problem can't hurt L0.
      
      And three special cases need to be handled:
      
      The root_sp should be like role.direct=1 sometimes: its contents are
      not backed by gptes, root_sp->gfns is meaningless.  (For a normal high
      level sp in shadow paging, sp->gfns is often unused and kept zero, but
      it could be relevant and meaningful if sp->gfns is used because they
      are backed by concrete gptes.)
      
      For such root_sp in the case, root_sp is just a portal to contribute
      root_sp->spt[0], and root_sp->gfns should not be used and
      root_sp->spt[0] should not be dropped if gpte[0] of the guest root
      pagetable is changed.
      
      Such root_sp should not be accounted too.
      
      So add role.passthrough to distinguish the shadow pages in the hash
      when gCR4_LA57 is toggled and fix above special cases by using it in
      kvm_mmu_page_{get|set}_gfn() and sp_has_gptes().
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220420131204.2850-3-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      84e5ffd0
    • L
      KVM: X86/MMU: Add sp_has_gptes() · 767d8d8d
      Lai Jiangshan 提交于
      Add sp_has_gptes() which equals to !sp->role.direct currently.
      
      Shadow page having gptes needs to be write-protected, accounted and
      responded to kvm_mmu_pte_write().
      
      Use it in these places to replace !sp->role.direct and rename
      for_each_gfn_indirect_valid_sp.
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220420131204.2850-2-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      767d8d8d
    • P
      KVM: x86/mmu: replace direct_map with root_role.direct · 347a0d0d
      Paolo Bonzini 提交于
      direct_map is always equal to the direct field of the root page's role:
      
      - for shadow paging, direct_map is true if CR0.PG=0 and root_role.direct is
      copied from cpu_role.base.direct
      
      - for TDP, it is always true and root_role.direct is also always true
      
      - for shadow TDP, it is always false and root_role.direct is also always
      false
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      347a0d0d
    • P
      KVM: x86/mmu: replace root_level with cpu_role.base.level · 4d25502a
      Paolo Bonzini 提交于
      Remove another duplicate field of struct kvm_mmu.  This time it's
      the root level for page table walking; the separate field is
      always initialized as cpu_role.base.level, so its users can look
      up the CPU mode directly instead.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d25502a
    • P
      KVM: x86/mmu: replace shadow_root_level with root_role.level · a972e29c
      Paolo Bonzini 提交于
      root_role.level is always the same value as shadow_level:
      
      - it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu
      
      - it's the level argument when going through kvm_init_shadow_ept_mmu
      
      - it's assigned directly from new_role.base.level when going
        through shadow_mmu_init_context
      
      Remove the duplication and get the level directly from the role.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a972e29c
    • P
      KVM: x86/mmu: pull CPU mode computation to kvm_init_mmu · a7f1de9b
      Paolo Bonzini 提交于
      Do not lead init_kvm_*mmu into the temptation of poking
      into struct kvm_mmu_role_regs, by passing to it directly
      the CPU mode.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a7f1de9b
    • P
      KVM: x86/mmu: simplify and/or inline computation of shadow MMU roles · 56b321f9
      Paolo Bonzini 提交于
      Shadow MMUs compute their role from cpu_role.base, simply by adjusting
      the root level.  It's one line of code, so do not place it in a separate
      function.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      56b321f9
    • P
      KVM: x86/mmu: remove redundant bits from extended role · faf72962
      Paolo Bonzini 提交于
      Before the separation of the CPU and the MMU role, CR0.PG was not
      available in the base MMU role, because two-dimensional paging always
      used direct=1 in the MMU role.  However, now that the raw role is
      snapshotted in mmu->cpu_role, the value of CR0.PG always matches both
      !cpu_role.base.direct and cpu_role.base.level > 0.  There is no need to
      store it again in union kvm_mmu_extended_role; instead, write an is_cr0_pg
      accessor by hand that takes care of the conversion.  Use cpu_role.base.level
      since the future of the direct field is unclear.
      
      Likewise, CR4.PAE is now always present in the CPU role as
      !cpu_role.base.has_4_byte_gpte.  The inversion makes certain tests on
      the MMU role easier, and is easily hidden by the is_cr4_pae accessor
      when operating on the CPU role.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      faf72962
    • P
      KVM: x86/mmu: rename kvm_mmu_role union · 7a7ae829
      Paolo Bonzini 提交于
      It is quite confusing that the "full" union is called kvm_mmu_role
      but is used for the "cpu_role" field of struct kvm_mmu.  Rename it
      to kvm_cpu_role.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a7ae829
    • P
      KVM: x86/mmu: remove extended bits from mmu_role, rename field · 7a458f0e
      Paolo Bonzini 提交于
      mmu_role represents the role of the root of the page tables.
      It does not need any extended bits, as those govern only KVM's
      page table walking; the is_* functions used for page table
      walking always use the CPU role.
      
      ext.valid is not present anymore in the MMU role, but an
      all-zero MMU role is impossible because the level field is
      never zero in the MMU role.  So just zap the whole mmu_role
      in order to force invalidation after CPUID is updated.
      
      While making this change, which requires touching almost every
      occurrence of "mmu_role", rename it to "root_role".
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a458f0e
    • P
      KVM: x86/mmu: store shadow EFER.NX in the MMU role · 362505de
      Paolo Bonzini 提交于
      Now that the MMU role is separate from the CPU role, it can be a
      truthful description of the format of the shadow pages.  This includes
      whether the shadow pages use the NX bit; so force the efer_nx field
      of the MMU role when TDP is disabled, and remove the hardcoding it in
      the callers of reset_shadow_zero_bits_mask.
      
      In fact, the initialization of reserved SPTE bits can now be made common
      to shadow paging and shadow NPT; move it to shadow_mmu_init_context.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      362505de
    • P
      KVM: x86/mmu: cleanup computation of MMU roles for shadow paging · f417e145
      Paolo Bonzini 提交于
      Pass the already-computed CPU role, instead of redoing it.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f417e145
    • P
      KVM: x86/mmu: cleanup computation of MMU roles for two-dimensional paging · 2ba67677
      Paolo Bonzini 提交于
      Inline kvm_calc_mmu_role_common into its sole caller, and simplify it
      by removing the computation of unnecessary bits.
      
      Extended bits are unnecessary because page walking uses the CPU role,
      and EFER.NX/CR0.WP can be set to one unconditionally---matching the
      format of shadow pages rather than the format of guest pages.
      
      The MMU role for two dimensional paging does still depend on the CPU role,
      even if only barely so, due to SMM and guest mode; for consistency,
      pass it down to kvm_calc_tdp_mmu_root_page_role instead of querying
      the vcpu with is_smm or is_guest_mode.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2ba67677
    • P
      KVM: x86/mmu: remove kvm_calc_shadow_root_page_role_common · 19b5dcc3
      Paolo Bonzini 提交于
      kvm_calc_shadow_root_page_role_common is the same as
      kvm_calc_cpu_role except for the level, which is overwritten
      afterwards in kvm_calc_shadow_mmu_root_page_role
      and kvm_calc_shadow_npt_root_page_role.
      
      role.base.direct is already set correctly for the CPU role,
      and CR0.PG=1 is required for VMRUN so it will also be
      correct for nested NPT.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      19b5dcc3
    • P
      KVM: x86/mmu: remove ept_ad field · ec283cb1
      Paolo Bonzini 提交于
      The ept_ad field is used during page walk to determine if the guest PTEs
      have accessed and dirty bits.  In the MMU role, the ad_disabled
      bit represents whether the *shadow* PTEs have the bits, so it
      would be incorrect to replace PT_HAVE_ACCESSED_DIRTY with just
      !mmu->mmu_role.base.ad_disabled.
      
      However, the similar field in the CPU mode, ad_disabled, is initialized
      correctly: to the opposite value of ept_ad for shadow EPT, and zero
      for non-EPT guest paging modes (which always have A/D bits).  It is
      therefore possible to compute PT_HAVE_ACCESSED_DIRTY from the CPU mode,
      like other page-format fields; it just has to be inverted to account
      for the different polarity.
      
      In fact, now that the CPU mode is distinct from the MMU roles, it would
      even be possible to remove PT_HAVE_ACCESSED_DIRTY macro altogether, and
      use !mmu->cpu_role.base.ad_disabled instead.  I am not doing this because
      the macro has a small effect in terms of dead code elimination:
      
         text	   data	    bss	    dec	    hex
       103544	  16665	    112	 120321	  1d601    # as of this patch
       103746	  16665	    112	 120523	  1d6cb    # without PT_HAVE_ACCESSED_DIRTY
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ec283cb1
    • P
      KVM: x86/mmu: do not recompute root level from kvm_mmu_role_regs · 60f3cb60
      Paolo Bonzini 提交于
      The root_level can be found in the cpu_role (in fact the field
      is superfluous and could be removed, but one thing at a time).
      Since there is only one usage left of role_regs_to_root_level,
      inline it into kvm_calc_cpu_role.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      60f3cb60
    • P
      KVM: x86/mmu: split cpu_role from mmu_role · e5ed0fb0
      Paolo Bonzini 提交于
      Snapshot the state of the processor registers that govern page walk into
      a new field of struct kvm_mmu.  This is a more natural representation
      than having it *mostly* in mmu_role but not exclusively; the delta
      right now is represented in other fields, such as root_level.
      
      The nested MMU now has only the CPU role; and in fact the new function
      kvm_calc_cpu_role is analogous to the previous kvm_calc_nested_mmu_role,
      except that it has role.base.direct equal to !CR0.PG.  For a walk-only
      MMU, "direct" has no meaning, but we set it to !CR0.PG so that
      role.ext.cr0_pg can go away in a future patch.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e5ed0fb0
    • P
      KVM: x86/mmu: remove "bool base_only" arguments · b8980508
      Paolo Bonzini 提交于
      The argument is always false now that kvm_mmu_calc_root_page_role has
      been removed.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b8980508
    • P
      KVM: x86/mmu: pull computation of kvm_mmu_role_regs to kvm_init_mmu · 39e7e2bf
      Paolo Bonzini 提交于
      The init_kvm_*mmu functions, with the exception of shadow NPT,
      do not need to know the full values of CR0/CR4/EFER; they only
      need to know the bits that make up the "role".  This cleanup
      however will take quite a few incremental steps.  As a start,
      pull the common computation of the struct kvm_mmu_role_regs
      into their caller: all of them extract the struct from the vcpu
      as the very first step.
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      39e7e2bf
    • P
      KVM: x86/mmu: constify uses of struct kvm_mmu_role_regs · 82ffa13f
      Paolo Bonzini 提交于
      struct kvm_mmu_role_regs is computed just once and then accessed.  Use
      const to make this clearer, even though the const fields of struct
      kvm_mmu_role_regs already prevent (or make it harder...) to modify
      the contents of the struct.
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      82ffa13f
    • P
      KVM: x86/mmu: nested EPT cannot be used in SMM · daed87b8
      Paolo Bonzini 提交于
      The role.base.smm flag is always zero when setting up shadow EPT,
      do not bother copying it over from vcpu->arch.root_mmu.
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      daed87b8
    • S
      KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabled · 8b9e74bf
      Sean Christopherson 提交于
      Clear enable_mmio_caching if hardware can't support MMIO caching and use
      the dedicated flag to detect if MMIO caching is enabled instead of
      assuming shadow_mmio_value==0 means MMIO caching is disabled.  TDX will
      use a zero value even when caching is enabled, and is_mmio_spte() isn't
      so hot that it needs to avoid an extra memory access, i.e. there's no
      reason to be super clever.  And the clever approach may not even be more
      performant, e.g. gcc-11 lands the extra check on a non-zero value inline,
      but puts the enable_mmio_caching out-of-line, i.e. avoids the few extra
      uops for non-MMIO SPTEs.
      
      Cc: Isaku Yamahata <isaku.yamahata@intel.com>
      Cc: Kai Huang <kai.huang@intel.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220420002747.3287931-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8b9e74bf
    • M
      KVM: x86/mmu: fix potential races when walking host page table · 44187235
      Mingwei Zhang 提交于
      KVM uses lookup_address_in_mm() to detect the hugepage size that the host
      uses to map a pfn.  The function suffers from several issues:
      
       - no usage of READ_ONCE(*). This allows multiple dereference of the same
         page table entry. The TOCTOU problem because of that may cause KVM to
         incorrectly treat a newly generated leaf entry as a nonleaf one, and
         dereference the content by using its pfn value.
      
       - the information returned does not match what KVM needs; for non-present
         entries it returns the level at which the walk was terminated, as long
         as the entry is not 'none'.  KVM needs level information of only 'present'
         entries, otherwise it may regard a non-present PXE entry as a present
         large page mapping.
      
       - the function is not safe for mappings that can be torn down, because it
         does not disable IRQs and because it returns a PTE pointer which is never
         safe to dereference after the function returns.
      
      So implement the logic for walking host page tables directly in KVM, and
      stop using lookup_address_in_mm().
      
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      Message-Id: <20220429031757.2042406-1-mizhang@google.com>
      [Inline in host_pfn_mapping_level, ensure no semantic change for its
       callers. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      44187235
    • S
      KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR · 86931ff7
      Sean Christopherson 提交于
      Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
      MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
      The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
      guest, possibly with help from userspace, manages to coerce KVM into
      creating a SPTE for an "impossible" gfn, KVM will leak the associated
      shadow pages (page tables):
      
        WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
                                      kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Modules linked in: kvm_intel kvm irqbypass
        CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G        W         5.18.0-rc1+ #293
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Call Trace:
         <TASK>
         kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
         kvm_destroy_vm+0x162/0x2d0 [kvm]
         kvm_vm_release+0x1d/0x30 [kvm]
         __fput+0x82/0x240
         task_work_run+0x5b/0x90
         exit_to_user_mode_prepare+0xd2/0xe0
         syscall_exit_to_user_mode+0x1d/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      On bare metal, encountering an impossible gpa in the page fault path is
      well and truly impossible, barring CPU bugs, as the CPU will signal #PF
      during the gva=>gpa translation (or a similar failure when stuffing a
      physical address into e.g. the VMCS/VMCB).  But if KVM is running as a VM
      itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
      of the underlying hardware, in which case the hardware will not fault on
      the illegal-from-KVM's-perspective gpa.
      
      Alternatively, KVM could continue allowing the dodgy behavior and simply
      zap the max possible range.  But, for hosts with MAXPHYADDR < 52, that's
      a (minor) waste of cycles, and more importantly, KVM can't reasonably
      support impossible memslots when running on bare metal (or with an
      accurate MAXPHYADDR as a VM).  Note, limiting the overhead by checking if
      KVM is running as a guest is not a safe option as the host isn't required
      to announce itself to the guest in any way, e.g. doesn't need to set the
      HYPERVISOR CPUID bit.
      
      A second alternative to disallowing the memslot behavior would be to
      disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR.  That
      restriction is undesirable as there are legitimate use cases for doing
      so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
      systems so that VMs can be migrated between hosts with different
      MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
      
      Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
      even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
      any reserved physical address bits).
      
      The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
      The memslot and TDP MMU code want an exclusive value, but the name
      implies the returned value is inclusive, and the MMIO path needs an
      inclusive check.
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Fixes: 524a1e4e ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220428233416.2446833-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      86931ff7
  4. 05 4月, 2022 1 次提交
    • S
      KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded · 1d0e8480
      Sean Christopherson 提交于
      Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
      -1 is technically undefined behavior when its value is read out by
      param_get_bool(), as boolean values are supposed to be '0' or '1'.
      
      Alternatively, KVM could define a custom getter for the param, but the
      auto value doesn't depend on the vendor module in any way, and printing
      "auto" would be unnecessarily unfriendly to the user.
      
      In addition to fixing the undefined behavior, resolving the auto value
      also fixes the scenario where the auto value resolves to N and no vendor
      module is loaded.  Previously, -1 would result in Y being printed even
      though KVM would ultimately disable the mitigation.
      
      Rename the existing MMU module init/exit helpers to clarify that they're
      invoked with respect to the vendor module, and add comments to document
      why KVM has two separate "module init" flows.
      
        =========================================================================
        UBSAN: invalid-load in kernel/params.c:320:33
        load of value 255 is not a valid value for type '_Bool'
        CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         <TASK>
         dump_stack_lvl+0x34/0x44
         ubsan_epilogue+0x5/0x40
         __ubsan_handle_load_invalid_value.cold+0x43/0x48
         param_get_bool.cold+0xf/0x14
         param_attr_show+0x55/0x80
         module_attr_show+0x1c/0x30
         sysfs_kf_seq_show+0x93/0xc0
         seq_read_iter+0x11c/0x450
         new_sync_read+0x11b/0x1a0
         vfs_read+0xf0/0x190
         ksys_read+0x5f/0xe0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        =========================================================================
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
      Cc: stable@vger.kernel.org
      Reported-by: NBruno Goncalves <bgoncalv@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220331221359.3912754-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1d0e8480
  5. 02 4月, 2022 7 次提交
    • H
      KVM: x86/mmu: Don't rebuild page when the page is synced and no tlb flushing is required · 8d5678a7
      Hou Wenlong 提交于
      Before Commit c3e5e415 ("KVM: X86: Change kvm_sync_page()
      to return true when remote flush is needed"), the return value
      of kvm_sync_page() indicates whether the page is synced, and
      kvm_mmu_get_page() would rebuild page when the sync fails.
      But now, kvm_sync_page() returns false when the page is
      synced and no tlb flushing is required, which leads to
      rebuild page in kvm_mmu_get_page(). So return the return
      value of mmu->sync_page() directly and check it in
      kvm_mmu_get_page(). If the sync fails, the page will be
      zapped and the invalid_list is not empty, so set flush as
      true is accepted in mmu_sync_children().
      
      Cc: stable@vger.kernel.org
      Fixes: c3e5e415 ("KVM: X86: Change kvm_sync_page() to return true when remote flush is needed")
      Signed-off-by: NHou Wenlong <houwenlong.hwl@antgroup.com>
      Acked-by: NLai Jiangshan <jiangshanlai@gmail.com>
      Message-Id: <0dabeeb789f57b0d793f85d073893063e692032d.1647336064.git.houwenlong.hwl@antgroup.com>
      [mmu_sync_children should not flush if the page is zapped. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d5678a7
    • M
      KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set · 5959ff4a
      Maxim Levitsky 提交于
      It makes more sense to print new SPTE value than the
      old value.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220302102457.588450-1-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5959ff4a
    • L
      KVM: X86: Handle implicit supervisor access with SMAP · 4f4aa80e
      Lai Jiangshan 提交于
      There are two kinds of implicit supervisor access
      	implicit supervisor access when CPL = 3
      	implicit supervisor access when CPL < 3
      
      Current permission_fault() handles only the first kind for SMAP.
      
      But if the access is implicit when SMAP is on, data may not be read
      nor write from any user-mode address regardless the current CPL.
      
      So the second kind should be also supported.
      
      The first kind can be detect via CPL and access mode: if it is
      supervisor access and CPL = 3, it must be implicit supervisor access.
      
      But it is not possible to detect the second kind without extra
      information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
      into @access. This extra information also works for the first kind, so
      the logic is changed to use this information for both cases.
      
      The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
      which is in the most significant 16 bits of u64 and less likely to be
      forced to change due to future hardware uses it.
      
      This patch removes the call to ->get_cpl() for access mode is determined
      by @access.  Not only does it reduce a function call, but also remove
      confusions when the permission is checked for nested TDP.  The nested
      TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
      on it.  The original code works just because it is always user walk for
      NPT and SMAP fault is not set for EPT in update_permission_bitmask.
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f4aa80e
    • L
      KVM: X86: Fix comments in update_permission_bitmask · 94b4a2f1
      Lai Jiangshan 提交于
      The commit 09f037aa ("KVM: MMU: speedup update_permission_bitmask")
      refactored the code of update_permission_bitmask() and change the
      comments.  It added a condition into a list to match the new code,
      so the number/order for conditions in the comments should be updated
      too.
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-3-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      94b4a2f1
    • L
      KVM: X86: Change the type of access u32 to u64 · 5b22bbe7
      Lai Jiangshan 提交于
      Change the type of access u32 to u64 for FNAME(walk_addr) and
      ->gva_to_gpa().
      
      The kinds of accesses are usually combinations of UWX, and VMX/SVM's
      nested paging adds a new factor of access: is it an access for a guest
      page table or for a final guest physical address.
      
      And SMAP relies a factor for supervisor access: explicit or implicit.
      
      So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
      all these information to do the walk.
      
      Although @access(u32) has enough bits to encode all the kinds, this
      patch extends it to u64:
      	o Extra bits will be in the higher 32 bits, so that we can
      	  easily obtain the traditional access mode (UWX) by converting
      	  it to u32.
      	o Reuse the value for the access kind defined by SVM's nested
      	  paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
      	  @error_code in kvm_handle_page_fault().
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5b22bbe7
    • S
      KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap · f47e5bbb
      Sean Christopherson 提交于
      Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
      kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
      flush when processing multiple roots (including nested TDP shadow roots).
      Dropping the TLB flush resulted in random crashes when running Hyper-V
      Server 2019 in a guest with KSM enabled in the host (or any source of
      mmu_notifier invalidations, KSM is just the easiest to force).
      
      This effectively revert commits 873dd122
      and fcb93eb6, and thus restores commit
      cf3e2642, plus this delta on top:
      
      bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
              struct kvm_mmu_page *root;
      
              for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
      -               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
      +               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
      
              return flush;
       }
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220325230348.2587437-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f47e5bbb
    • P
      KVM: MMU: propagate alloc_workqueue failure · a1a39128
      Paolo Bonzini 提交于
      If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
      to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
      kvm_arch_init_vm also has to undo all the initialization, so
      group all the MMU initialization code at the beginning and
      handle cleaning up of kvm_page_track_init.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a1a39128
  6. 21 3月, 2022 1 次提交
  7. 08 3月, 2022 1 次提交
    • P
      KVM: x86/mmu: Zap invalidated roots via asynchronous worker · 22b94c4b
      Paolo Bonzini 提交于
      Use the system worker threads to zap the roots invalidated
      by the TDP MMU's "fast zap" mechanism, implemented by
      kvm_tdp_mmu_invalidate_all_roots().
      
      At this point, apart from allowing some parallelism in the zapping of
      roots, the workqueue is a glorified linked list: work items are added and
      flushed entirely within a single kvm->slots_lock critical section.  However,
      the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
      assumes that it owns a reference to all invalid roots; therefore, no
      one can set the invalid bit outside kvm_mmu_zap_all_fast().  Putting the
      invalidated roots on a linked list... erm, on a workqueue ensures that
      tdp_mmu_zap_root_work() only puts back those extra references that
      kvm_mmu_zap_all_invalidated_roots() had gifted to it.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      22b94c4b