1. 24 6月, 2022 4 次提交
  2. 20 6月, 2022 7 次提交
    • S
      KVM: x86/mmu: Shove refcounted page dependency into host_pfn_mapping_level() · 5d49f08c
      Sean Christopherson 提交于
      Move the check that restricts mapping huge pages into the guest to pfns
      that are backed by refcounted 'struct page' memory into the helper that
      actually "requires" a 'struct page', host_pfn_mapping_level().  In
      addition to deduplicating code, moving the check to the helper eliminates
      the subtle requirement that the caller check that the incoming pfn is
      backed by a refcounted struct page, and as an added bonus avoids an extra
      pfn_to_page() lookup.
      
      Note, the is_error_noslot_pfn() check in kvm_mmu_hugepage_adjust() needs
      to stay where it is, as it guards against dereferencing a NULL memslot in
      the kvm_slot_dirty_track_enabled() that follows.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-11-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5d49f08c
    • S
      KVM: Rename/refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page() · b14b2690
      Sean Christopherson 提交于
      Rename and refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()
      to better reflect what KVM is actually checking, and to eliminate extra
      pfn_to_page() lookups.  The kvm_release_pfn_*() an kvm_try_get_pfn()
      helpers in particular benefit from "refouncted" nomenclature, as it's not
      all that obvious why KVM needs to get/put refcounts for some PG_reserved
      pages (ZERO_PAGE and ZONE_DEVICE).
      
      Add a comment to call out that the list of exceptions to PG_reserved is
      all but guaranteed to be incomplete.  The list has mostly been compiled
      by people throwing noodles at KVM and finding out they stick a little too
      well, e.g. the ZERO_PAGE's refcount overflowed and ZONE_DEVICE pages
      didn't get freed.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-10-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b14b2690
    • S
      KVM: Take a 'struct page', not a pfn in kvm_is_zone_device_page() · 284dc493
      Sean Christopherson 提交于
      Operate on a 'struct page' instead of a pfn when checking if a page is a
      ZONE_DEVICE page, and rename the helper accordingly.  Generally speaking,
      KVM doesn't actually care about ZONE_DEVICE memory, i.e. shouldn't do
      anything special for ZONE_DEVICE memory.  Rather, KVM wants to treat
      ZONE_DEVICE memory like regular memory, and the need to identify
      ZONE_DEVICE memory only arises as an exception to PG_reserved pages. In
      other words, KVM should only ever check for ZONE_DEVICE memory after KVM
      has already verified that there is a struct page associated with the pfn.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220429010416.2788472-9-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      284dc493
    • S
      KVM: x86/mmu: Use common logic for computing the 32/64-bit base PA mask · 70e41c31
      Sean Christopherson 提交于
      Use common logic for computing PT_BASE_ADDR_MASK for 32-bit, 64-bit, and
      EPT paging.  Both PAGE_MASK and the new-common logic are supsersets of
      what is actually needed for 32-bit paging.  PAGE_MASK sets bits 63:12 and
      the former GUEST_PT64_BASE_ADDR_MASK sets bits 51:12, so regardless of
      which value is used, the result will always be bits 31:12.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220614233328.3896033-9-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      70e41c31
    • S
      KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs · 2ca3129e
      Sean Christopherson 提交于
      Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs
      (PT64).  SPTE and PT64 are _mostly_ the same, but the few differences are
      quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and
      guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is
      very much specific to SPTEs.
      
      Opportunistically (and temporarily) move most guest macros into paging.h
      to clearly associate them with shadow paging, and to ensure that they're
      not used as of this commit.  A future patch will eliminate them entirely.
      
      Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's
      needed for the quadrant calculation in kvm_mmu_get_page().  The quadrant
      calculation is hot enough (when using shadow paging with 32-bit guests)
      that adding a per-context helper is undesirable, and burying the
      computation in paging_tmpl.h with a forward declaration isn't exactly an
      improvement.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220614233328.3896033-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2ca3129e
    • S
      KVM: x86/mmu: Dedup macros for computing various page table masks · 42c88ff8
      Sean Christopherson 提交于
      Provide common helper macros to generate various masks, shifts, etc...
      for 32-bit vs. 64-bit page tables.  Only the inputs differ, the actual
      calculations are identical.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220614233328.3896033-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      42c88ff8
    • S
      KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.h · b3fcdb04
      Sean Christopherson 提交于
      Move a handful of one-off macros and helpers for 32-bit PSE paging into
      paging_tmpl.h and hide them behind "PTTYPE == 32".  Under no circumstance
      should anything but 32-bit shadow paging care about PSE paging.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220614233328.3896033-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b3fcdb04
  3. 15 6月, 2022 3 次提交
  4. 09 6月, 2022 1 次提交
    • Y
      KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs · d2263de1
      Yuan Yao 提交于
      Assign shadow_me_value, not shadow_me_mask, to PAE root entries,
      a.k.a. shadow PDPTRs, when host memory encryption is supported.  The
      "mask" is the set of all possible memory encryption bits, e.g. MKTME
      KeyIDs, whereas "value" holds the actual value that needs to be
      stuffed into host page tables.
      
      Using shadow_me_mask results in a failed VM-Entry due to setting
      reserved PA bits in the PDPTRs, and ultimately causes an OOPS due to
      physical addresses with non-zero MKTME bits sending to_shadow_page()
      into the weeds:
      
      set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
      BUG: unable to handle page fault for address: ffd43f00063049e8
      PGD 86dfd8067 P4D 0
      Oops: 0000 [#1] PREEMPT SMP
      RIP: 0010:mmu_free_root_page+0x3c/0x90 [kvm]
       kvm_mmu_free_roots+0xd1/0x200 [kvm]
       __kvm_mmu_unload+0x29/0x70 [kvm]
       kvm_mmu_unload+0x13/0x20 [kvm]
       kvm_arch_destroy_vm+0x8a/0x190 [kvm]
       kvm_put_kvm+0x197/0x2d0 [kvm]
       kvm_vm_release+0x21/0x30 [kvm]
       __fput+0x8e/0x260
       ____fput+0xe/0x10
       task_work_run+0x6f/0xb0
       do_exit+0x327/0xa90
       do_group_exit+0x35/0xa0
       get_signal+0x911/0x930
       arch_do_signal_or_restart+0x37/0x720
       exit_to_user_mode_prepare+0xb2/0x140
       syscall_exit_to_user_mode+0x16/0x30
       do_syscall_64+0x4e/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: e54f1ff2 ("KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask")
      Signed-off-by: NYuan Yao <yuan.yao@intel.com>
      Reviewed-by: NKai Huang <kai.huang@intel.com>
      Message-Id: <20220608012015.19566-1-yuan.yao@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d2263de1
  5. 07 6月, 2022 1 次提交
  6. 21 5月, 2022 1 次提交
    • P
      KVM: x86/mmu: fix NULL pointer dereference on guest INVPCID · 9f46c187
      Paolo Bonzini 提交于
      With shadow paging enabled, the INVPCID instruction results in a call
      to kvm_mmu_invpcid_gva.  If INVPCID is executed with CR0.PG=0, the
      invlpg callback is not set and the result is a NULL pointer dereference.
      Fix it trivially by checking for mmu->invlpg before every call.
      
      There are other possibilities:
      
      - check for CR0.PG, because KVM (like all Intel processors after P5)
        flushes guest TLB on CR0.PG changes so that INVPCID/INVLPG are a
        nop with paging disabled
      
      - check for EFER.LMA, because KVM syncs and flushes when switching
        MMU contexts outside of 64-bit mode
      
      All of these are tricky, go for the simple solution.  This is CVE-2022-1789.
      Reported-by: NYongkang Jia <kangel@zju.edu.cn>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9f46c187
  7. 12 5月, 2022 9 次提交
    • S
      KVM: x86/mmu: Update number of zapped pages even if page list is stable · b28cb0cd
      Sean Christopherson 提交于
      When zapping obsolete pages, update the running count of zapped pages
      regardless of whether or not the list has become unstable due to zapping
      a shadow page with its own child shadow pages.  If the VM is backed by
      mostly 4kb pages, KVM can zap an absurd number of SPTEs without bumping
      the batch count and thus without yielding.  In the worst case scenario,
      this can cause a soft lokcup.
      
       watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [dirty_log_perf_:13020]
         RIP: 0010:workingset_activation+0x19/0x130
         mark_page_accessed+0x266/0x2e0
         kvm_set_pfn_accessed+0x31/0x40
         mmu_spte_clear_track_bits+0x136/0x1c0
         drop_spte+0x1a/0xc0
         mmu_page_zap_pte+0xef/0x120
         __kvm_mmu_prepare_zap_page+0x205/0x5e0
         kvm_mmu_zap_all_fast+0xd7/0x190
         kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
         kvm_page_track_flush_slot+0x5c/0x80
         kvm_arch_flush_shadow_memslot+0xe/0x10
         kvm_set_memslot+0x1a8/0x5d0
         __kvm_set_memory_region+0x337/0x590
         kvm_vm_ioctl+0xb08/0x1040
      
      Fixes: fbb158cb ("KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""")
      Reported-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220511145122.3133334-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b28cb0cd
    • V
      KVM: x86/mmu: Speed up slot_rmap_walk_next for sparsely populated rmaps · 6ba1e04f
      Vipin Sharma 提交于
      Avoid calling handlers on empty rmap entries and skip to the next non
      empty rmap entry.
      
      Empty rmap entries are noop in handlers.
      Signed-off-by: NVipin Sharma <vipinsh@google.com>
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220502220347.174664-1-vipinsh@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ba1e04f
    • K
      KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask · e54f1ff2
      Kai Huang 提交于
      Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
      high bits of physical address bits as 'KeyID' bits.  Intel Trust Domain
      Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
      KeyID bits.  TDX private KeyID bits cannot be set in any mapping in the
      host kernel since they can only be accessed by software running inside a
      new CPU isolated mode.  And unlike to AMD's SME, host kernel doesn't set
      any legacy MKTME KeyID bits to any mapping either.  Therefore, it's not
      legitimate for KVM to set any KeyID bits in SPTE which maps guest
      memory.
      
      KVM maintains shadow_zero_check bits to represent which bits must be
      zero for SPTE which maps guest memory.  MKTME KeyID bits should be set
      to shadow_zero_check.  Currently, shadow_me_mask is used by AMD to set
      the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
      shadow_zero_check.  So initializing shadow_me_mask to represent all
      MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
      to shadow_zero_check).
      
      Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
      and repurpose shadow_me_mask as 'all possible memory encryption bits'.
      The new schematic of them will be:
      
       - shadow_me_value: the memory encryption bit(s) that will be set to the
         SPTE (the original shadow_me_mask).
       - shadow_me_mask: all possible memory encryption bits (which is a super
         set of shadow_me_value).
       - For now, shadow_me_value is supposed to be set by SVM and VMX
         respectively, and it is a constant during KVM's life time.  This
         perhaps doesn't fit MKTME but for now host kernel doesn't support it
         (and perhaps will never do).
       - Bits in shadow_me_mask are set to shadow_zero_check, except the bits
         in shadow_me_value.
      
      Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
      Replace shadow_me_mask with shadow_me_value in almost all code paths,
      except the one in PT64_PERM_MASK, which is used by need_remote_flush()
      to determine whether remote TLB flush is needed.  This should still use
      shadow_me_mask as any encryption bit change should need a TLB flush.
      And for AMD, move initializing shadow_me_value/shadow_me_mask from
      kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
      Signed-off-by: NKai Huang <kai.huang@intel.com>
      Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e54f1ff2
    • K
      KVM: x86/mmu: Rename reset_rsvds_bits_mask() · c919e881
      Kai Huang 提交于
      Rename reset_rsvds_bits_mask() to reset_guest_rsvds_bits_mask() to make
      it clearer that it resets the reserved bits check for guest's page table
      entries.
      Signed-off-by: NKai Huang <kai.huang@intel.com>
      Message-Id: <efdc174b85d55598880064b8bf09245d3791031d.1650363789.git.kai.huang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c919e881
    • S
      KVM: x86/mmu: Expand and clean up page fault stats · 1075d41e
      Sean Christopherson 提交于
      Expand and clean up the page fault stats.  The current stats are at best
      incomplete, and at worst misleading.  Differentiate between faults that
      are actually fixed vs those that result in an MMIO SPTE being created,
      track faults that are spurious, faults that trigger emulation, faults
      that that are fixed in the fast path, and last but not least, track the
      number of faults that are taken.
      
      Note, the number of faults that require emulation for write-protected
      shadow pages can roughly be calculated by subtracting the number of MMIO
      SPTEs created from the overall number of faults that trigger emulation.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-10-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1075d41e
    • S
      KVM: x86/mmu: Make all page fault handlers internal to the MMU · 8a009d5b
      Sean Christopherson 提交于
      Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
      of the page fault handling collateral that was in mmu.h purely for the
      async #PF handler into mmu_internal.h, where it belongs.  This will allow
      kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
      expose those enums outside of the MMU.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8a009d5b
    • S
      KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns" · 5276c616
      Sean Christopherson 提交于
      Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
      kvm_faultin_pfn() to signal that the page fault handler should continue
      doing its thing.  Aside from being gross and inefficient, using a boolean
      return to signal continue vs. stop makes it extremely difficult to add
      more helpers and/or move existing code to a helper.
      
      E.g. hypothetically, if nested MMUs were to gain a separate page fault
      handler in the future, everything up to the "is self-modifying PTE" check
      can be shared by all shadow MMUs, but communicating up the stack whether
      to continue on or stop becomes a nightmare.
      
      More concretely, proposed support for private guest memory ran into a
      similar issue, where it'll be forced to forego a helper in order to yield
      sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.
      
      No functional change intended.
      
      Cc: David Matlack <dmatlack@google.com>
      Cc: Chao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5276c616
    • S
      KVM: x86/mmu: Drop exec/NX check from "page fault can be fast" · 5c64aba5
      Sean Christopherson 提交于
      Tweak the "page fault can be fast" logic to explicitly check for !PRESENT
      faults in the access tracking case, and drop the exec/NX check that
      becomes redundant as a result.  No sane hardware will generate an access
      that is both an instruct fetch and a write, i.e. it's a waste of cycles.
      If hardware goes off the rails, or KVM runs under a misguided hypervisor,
      spuriously running throught fast path is benign (KVM has been uknowingly
      being doing exactly that for years).
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5c64aba5
    • S
      KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use · 54275f74
      Sean Christopherson 提交于
      Check for A/D bits being disabled instead of the access tracking mask
      being non-zero when deciding whether or not to attempt to fix a page
      fault vian the fast path.  Originally, the access tracking mask was
      non-zero if and only if A/D bits were disabled by _KVM_ (including not
      being supported by hardware), but that hasn't been true since nVMX was
      fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
      KVM to not use A/D bits while running L2 despite KVM using them while
      running L1.
      
      In other words, don't attempt the fast path just because EPT is enabled.
      
      Note, attempting the fast path for all !PRESENT faults can "fix" a very,
      _VERY_ tiny percentage of faults out of mmu_lock by detecting that the
      fault is spurious, i.e. has been fixed by a different vCPU, but again the
      odds of that happening are vanishingly small.  E.g. booting an 8-vCPU VM
      gets less than 10 successes out of 30k+ faults, and that's likely one of
      the more favorable scenarios.  Disabling dirty logging can likely lead to
      a rash of collisions between vCPUs for some workloads that operate on a
      common set of pages, but penalizing _all_ !PRESENT faults for that one
      case is unlikely to be a net positive, not to mention that that problem
      is best solved by not zapping in the first place.
      
      The number of spurious faults does scale with the number of vCPUs, e.g. a
      255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
      path (again out of 30k), but that's all of 0.2% of faults.  Using legacy
      shadow paging does get more spurious faults, and a few more detected out
      of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
      faults that are reflected into the guest), i.e. the extra detections are
      purely due to the sheer number of faults observed.
      
      On the other hand, getting a "negative" in the fast path takes in the
      neighborhood of 150-250 cycles.  So while it is tempting to keep/extend
      the current behavior, such a change needs to come with hard numbers
      showing that it's actually a win in the grand scheme, or any scheme for
      that matter.
      
      Fixes: 995f00a6 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54275f74
  8. 03 5月, 2022 2 次提交
    • S
      KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits() · 54eb3ef5
      Sean Christopherson 提交于
      Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
      and into its callers.  Well, caller, since only one of its two callers
      doesn't already do the shadow-present check.
      
      Opportunistically move the helper to spte.c/h so that it can be used by
      the TDP MMU, which is also the primary motivation for the shadow-present
      change.  Unlike the legacy MMU, the TDP MMU uses a single path for clear
      leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
      MMU will need to check is_last_spte() prior to calling
      spte_has_volatile_bits(), and calling is_last_spte() without first
      calling is_shadow_present_spte() is at best odd, and at worst a violation
      of KVM's loosely defines SPTE rules.
      
      Note, mmu_spte_clear_track_bits() could likely skip the write entirely
      for SPTEs that are not shadow-present.  Leave that cleanup for a future
      patch to avoid introducing a functional change, and because the
      shadow-present check can likely be moved further up the stack, e.g.
      drop_large_spte() appears to be the only path that doesn't already
      explicitly check for a shadow-present SPTE.
      
      No functional change intended.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54eb3ef5
    • S
      KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D) · 706c9c55
      Sean Christopherson 提交于
      Don't treat SPTEs that are truly writable, i.e. writable in hardware, as
      being volatile (unless they're volatile for other reasons, e.g. A/D bits).
      KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit
      out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get
      cleared just because the SPTE is MMU-writable.
      
      Rename the wrapper of MMU-writable to be more literal, the previous name
      of spte_can_locklessly_be_made_writable() is wrong and misleading.
      
      Fixes: c7ba5b48 ("KVM: MMU: fast path of handling guest page fault")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      706c9c55
  9. 30 4月, 2022 12 次提交
    • L
      KVM: X86/MMU: Fix shadowing 5-level NPT for 4-level NPT L1 guest · 84e5ffd0
      Lai Jiangshan 提交于
      When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is
      allocated with role.level = 5 and the guest pagetable's root gfn.
      
      And root_sp->spt[0] is also allocated with the same gfn and the same
      role except role.level = 4.  Luckily that they are different shadow
      pages, but only root_sp->spt[0] is the real translation of the guest
      pagetable.
      
      Here comes a problem:
      
      If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
      and uses the same gfn as the root page for nested NPT before and after
      switching gCR4_LA57.  The host (hCR4_LA57=1) might use the same root_sp
      for the guest even the guest switches gCR4_LA57.  The guest will see
      unexpected page mapped and L2 may exploit the bug and hurt L1.  It is
      lucky that the problem can't hurt L0.
      
      And three special cases need to be handled:
      
      The root_sp should be like role.direct=1 sometimes: its contents are
      not backed by gptes, root_sp->gfns is meaningless.  (For a normal high
      level sp in shadow paging, sp->gfns is often unused and kept zero, but
      it could be relevant and meaningful if sp->gfns is used because they
      are backed by concrete gptes.)
      
      For such root_sp in the case, root_sp is just a portal to contribute
      root_sp->spt[0], and root_sp->gfns should not be used and
      root_sp->spt[0] should not be dropped if gpte[0] of the guest root
      pagetable is changed.
      
      Such root_sp should not be accounted too.
      
      So add role.passthrough to distinguish the shadow pages in the hash
      when gCR4_LA57 is toggled and fix above special cases by using it in
      kvm_mmu_page_{get|set}_gfn() and sp_has_gptes().
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220420131204.2850-3-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      84e5ffd0
    • L
      KVM: X86/MMU: Add sp_has_gptes() · 767d8d8d
      Lai Jiangshan 提交于
      Add sp_has_gptes() which equals to !sp->role.direct currently.
      
      Shadow page having gptes needs to be write-protected, accounted and
      responded to kvm_mmu_pte_write().
      
      Use it in these places to replace !sp->role.direct and rename
      for_each_gfn_indirect_valid_sp.
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220420131204.2850-2-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      767d8d8d
    • P
      KVM: x86/mmu: replace direct_map with root_role.direct · 347a0d0d
      Paolo Bonzini 提交于
      direct_map is always equal to the direct field of the root page's role:
      
      - for shadow paging, direct_map is true if CR0.PG=0 and root_role.direct is
      copied from cpu_role.base.direct
      
      - for TDP, it is always true and root_role.direct is also always true
      
      - for shadow TDP, it is always false and root_role.direct is also always
      false
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      347a0d0d
    • P
      KVM: x86/mmu: replace root_level with cpu_role.base.level · 4d25502a
      Paolo Bonzini 提交于
      Remove another duplicate field of struct kvm_mmu.  This time it's
      the root level for page table walking; the separate field is
      always initialized as cpu_role.base.level, so its users can look
      up the CPU mode directly instead.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d25502a
    • P
      KVM: x86/mmu: replace shadow_root_level with root_role.level · a972e29c
      Paolo Bonzini 提交于
      root_role.level is always the same value as shadow_level:
      
      - it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu
      
      - it's the level argument when going through kvm_init_shadow_ept_mmu
      
      - it's assigned directly from new_role.base.level when going
        through shadow_mmu_init_context
      
      Remove the duplication and get the level directly from the role.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a972e29c
    • P
      KVM: x86/mmu: pull CPU mode computation to kvm_init_mmu · a7f1de9b
      Paolo Bonzini 提交于
      Do not lead init_kvm_*mmu into the temptation of poking
      into struct kvm_mmu_role_regs, by passing to it directly
      the CPU mode.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a7f1de9b
    • P
      KVM: x86/mmu: simplify and/or inline computation of shadow MMU roles · 56b321f9
      Paolo Bonzini 提交于
      Shadow MMUs compute their role from cpu_role.base, simply by adjusting
      the root level.  It's one line of code, so do not place it in a separate
      function.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      56b321f9
    • P
      KVM: x86/mmu: remove redundant bits from extended role · faf72962
      Paolo Bonzini 提交于
      Before the separation of the CPU and the MMU role, CR0.PG was not
      available in the base MMU role, because two-dimensional paging always
      used direct=1 in the MMU role.  However, now that the raw role is
      snapshotted in mmu->cpu_role, the value of CR0.PG always matches both
      !cpu_role.base.direct and cpu_role.base.level > 0.  There is no need to
      store it again in union kvm_mmu_extended_role; instead, write an is_cr0_pg
      accessor by hand that takes care of the conversion.  Use cpu_role.base.level
      since the future of the direct field is unclear.
      
      Likewise, CR4.PAE is now always present in the CPU role as
      !cpu_role.base.has_4_byte_gpte.  The inversion makes certain tests on
      the MMU role easier, and is easily hidden by the is_cr4_pae accessor
      when operating on the CPU role.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      faf72962
    • P
      KVM: x86/mmu: rename kvm_mmu_role union · 7a7ae829
      Paolo Bonzini 提交于
      It is quite confusing that the "full" union is called kvm_mmu_role
      but is used for the "cpu_role" field of struct kvm_mmu.  Rename it
      to kvm_cpu_role.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a7ae829
    • P
      KVM: x86/mmu: remove extended bits from mmu_role, rename field · 7a458f0e
      Paolo Bonzini 提交于
      mmu_role represents the role of the root of the page tables.
      It does not need any extended bits, as those govern only KVM's
      page table walking; the is_* functions used for page table
      walking always use the CPU role.
      
      ext.valid is not present anymore in the MMU role, but an
      all-zero MMU role is impossible because the level field is
      never zero in the MMU role.  So just zap the whole mmu_role
      in order to force invalidation after CPUID is updated.
      
      While making this change, which requires touching almost every
      occurrence of "mmu_role", rename it to "root_role".
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a458f0e
    • P
      KVM: x86/mmu: store shadow EFER.NX in the MMU role · 362505de
      Paolo Bonzini 提交于
      Now that the MMU role is separate from the CPU role, it can be a
      truthful description of the format of the shadow pages.  This includes
      whether the shadow pages use the NX bit; so force the efer_nx field
      of the MMU role when TDP is disabled, and remove the hardcoding it in
      the callers of reset_shadow_zero_bits_mask.
      
      In fact, the initialization of reserved SPTE bits can now be made common
      to shadow paging and shadow NPT; move it to shadow_mmu_init_context.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      362505de
    • P
      KVM: x86/mmu: cleanup computation of MMU roles for shadow paging · f417e145
      Paolo Bonzini 提交于
      Pass the already-computed CPU role, instead of redoing it.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f417e145