提交 · fee060cd52d69c114b62d1a2948ea9648b5131f9 · openeuler / Kernel

25 5月, 2022 3 次提交

KVM: x86: avoid calling x86 emulator without a decoded instruction · fee060cd

由 Sean Christopherson 提交于 3月 11, 2022

Whenever x86_decode_emulated_instruction() detects a breakpoint, it
returns the value that kvm_vcpu_check_breakpoint() writes into its
pass-by-reference second argument.  Unfortunately this is completely
bogus because the expected outcome of x86_decode_emulated_instruction
is an EMULATION_* value.

Then, if kvm_vcpu_check_breakpoint() does "*r = 0" (corresponding to
a KVM_EXIT_DEBUG userspace exit), it is misunderstood as EMULATION_OK
and x86_emulate_instruction() is called without having decoded the
instruction.  This causes various havoc from running with a stale
emulation context.

The fix is to move the call to kvm_vcpu_check_breakpoint() where it was
before commit 4aa2691d ("KVM: x86: Factor out x86 instruction
emulation with decoding") introduced x86_decode_emulated_instruction().
The other caller of the function does not need breakpoint checks,
because it is invoked as part of a vmexit and the processor has already
checked those before executing the instruction that #GP'd.

This fixes CVE-2022-1852.
Reported-by: NQiuhao Li <qiuhao@sysec.org>
Reported-by: NGaoning Pan <pgn@zju.edu.cn>
Reported-by: NYongkang Jia <kangel@zju.edu.cn>
Fixes: 4aa2691d ("KVM: x86: Factor out x86 instruction emulation with decoding")
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220311032801.3467418-2-seanjc@google.com>
[Rewrote commit message according to Qiuhao's report, since a patch
 already existed to fix the bug. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fee060cd

KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak · d22d2474

由 Ashish Kalra 提交于 5月 16, 2022

For some sev ioctl interfaces, the length parameter that is passed maybe
less than or equal to SEV_FW_BLOB_MAX_SIZE, but larger than the data
that PSP firmware returns. In this case, kmalloc will allocate memory
that is the size of the input rather than the size of the data.
Since PSP firmware doesn't fully overwrite the allocated buffer, these
sev ioctl interface may return uninitialized kernel slab memory.
Reported-by: NAndy Nguyen <theflow@google.com>
Suggested-by: NDavid Rientjes <rientjes@google.com>
Suggested-by: NPeter Gonda <pgonda@google.com>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: eaf78265 ("KVM: SVM: Move SEV code to separate file")
Fixes: 2c07ded0 ("KVM: SVM: add support for SEV attestation command")
Fixes: 4cfdd47d ("KVM: SVM: Add KVM_SEV SEND_START command")
Fixes: d3d1af85 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Fixes: eba04b20 ("KVM: x86: Account a variety of miscellaneous allocations")
Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
Reviewed-by: NPeter Gonda <pgonda@google.com>
Message-Id: <20220516154310.3685678-1-Ashish.Kalra@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d22d2474

KVM: LAPIC: Trace LAPIC timer expiration on every vmentry · e0ac5351

由 Wanpeng Li 提交于 4月 26, 2022

In commit ec0671d5 ("KVM: LAPIC: Delay trace_kvm_wait_lapic_expire
tracepoint to after vmexit", 2019-06-04), trace_kvm_wait_lapic_expire
was moved after guest_exit_irqoff() because invoking tracepoints within
kvm_guest_enter/kvm_guest_exit caused a lockdep splat.

These days this is not necessary, because commit 87fa7f3e ("x86/kvm:
Move context tracking where it belongs", 2020-07-09) restricted
the RCU extended quiescent state to be closer to vmentry/vmexit.
Moving the tracepoint back to __kvm_wait_lapic_expire is more accurate,
because it will be reported even if vcpu_enter_guest causes multiple
vmentries via the IPI/Timer fast paths, and it allows the removal of
advance_expire_delta.
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Message-Id: <1650961551-38390-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e0ac5351

12 5月, 2022 13 次提交

KVM: x86/mmu: Speed up slot_rmap_walk_next for sparsely populated rmaps · 6ba1e04f

由 Vipin Sharma 提交于 5月 02, 2022

Avoid calling handlers on empty rmap entries and skip to the next non
empty rmap entry.

Empty rmap entries are noop in handlers.
Signed-off-by: NVipin Sharma <vipinsh@google.com>
Suggested-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220502220347.174664-1-vipinsh@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6ba1e04f

KVM: VMX: Include MKTME KeyID bits in shadow_zero_check · 3c5c3245

由 Kai Huang 提交于 4月 19, 2022

Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should
never be set to SPTE. Set shadow_me_value to 0 and shadow_me_mask to
include all MKTME KeyID bits to include them to shadow_zero_check.
Signed-off-by: NKai Huang <kai.huang@intel.com>
Message-Id: <27bc10e97a3c0b58a4105ff9107448c190328239.1650363789.git.kai.huang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3c5c3245

KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask · e54f1ff2

由 Kai Huang 提交于 4月 19, 2022

Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
high bits of physical address bits as 'KeyID' bits.  Intel Trust Domain
Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
KeyID bits.  TDX private KeyID bits cannot be set in any mapping in the
host kernel since they can only be accessed by software running inside a
new CPU isolated mode.  And unlike to AMD's SME, host kernel doesn't set
any legacy MKTME KeyID bits to any mapping either.  Therefore, it's not
legitimate for KVM to set any KeyID bits in SPTE which maps guest
memory.

KVM maintains shadow_zero_check bits to represent which bits must be
zero for SPTE which maps guest memory.  MKTME KeyID bits should be set
to shadow_zero_check.  Currently, shadow_me_mask is used by AMD to set
the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
shadow_zero_check.  So initializing shadow_me_mask to represent all
MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
to shadow_zero_check).

Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
and repurpose shadow_me_mask as 'all possible memory encryption bits'.
The new schematic of them will be:

 - shadow_me_value: the memory encryption bit(s) that will be set to the
   SPTE (the original shadow_me_mask).
 - shadow_me_mask: all possible memory encryption bits (which is a super
   set of shadow_me_value).
 - For now, shadow_me_value is supposed to be set by SVM and VMX
   respectively, and it is a constant during KVM's life time.  This
   perhaps doesn't fit MKTME but for now host kernel doesn't support it
   (and perhaps will never do).
 - Bits in shadow_me_mask are set to shadow_zero_check, except the bits
   in shadow_me_value.

Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
Replace shadow_me_mask with shadow_me_value in almost all code paths,
except the one in PT64_PERM_MASK, which is used by need_remote_flush()
to determine whether remote TLB flush is needed.  This should still use
shadow_me_mask as any encryption bit change should need a TLB flush.
And for AMD, move initializing shadow_me_value/shadow_me_mask from
kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
Signed-off-by: NKai Huang <kai.huang@intel.com>
Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e54f1ff2

KVM: x86/mmu: Rename reset_rsvds_bits_mask() · c919e881

由 Kai Huang 提交于 4月 19, 2022

Rename reset_rsvds_bits_mask() to reset_guest_rsvds_bits_mask() to make
it clearer that it resets the reserved bits check for guest's page table
entries.
Signed-off-by: NKai Huang <kai.huang@intel.com>
Message-Id: <efdc174b85d55598880064b8bf09245d3791031d.1650363789.git.kai.huang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c919e881

P
KVM: x86: a vCPU with a pending triple fault is runnable · c9f3d9fb
由 Paolo Bonzini 提交于 4月 27, 2022
```
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
```
c9f3d9fb

KVM: x86/mmu: Expand and clean up page fault stats · 1075d41e

由 Sean Christopherson 提交于 4月 23, 2022

Expand and clean up the page fault stats. The current stats are at best
incomplete, and at worst misleading. Differentiate between faults that
are actually fixed vs those that result in an MMIO SPTE being created,
track faults that are spurious, faults that trigger emulation, faults
that that are fixed in the fast path, and last but not least, track the
number of faults that are taken.

Note, the number of faults that require emulation for write-protected
shadow pages can roughly be calculated by subtracting the number of MMIO
SPTEs created from the overall number of faults that trigger emulation.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-10-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1075d41e

KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faults · 8d5265b1

由 Sean Christopherson 提交于 4月 23, 2022

Use IS_ENABLED() instead of an #ifdef to activate the anti-RETPOLINE fast
path for TDP page faults.  The generated code is identical, and the #ifdef
makes it dangerously difficult to extend the logic (guess who forgot to
add an "else" inside the #ifdef and ran through the page fault handler
twice).

No functional or binary change intented.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-9-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8d5265b1

KVM: x86/mmu: Make all page fault handlers internal to the MMU · 8a009d5b

由 Sean Christopherson 提交于 4月 23, 2022

Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
of the page fault handling collateral that was in mmu.h purely for the
async #PF handler into mmu_internal.h, where it belongs.  This will allow
kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
expose those enums outside of the MMU.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-8-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8a009d5b

KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns" · 5276c616

由 Sean Christopherson 提交于 4月 23, 2022

Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
kvm_faultin_pfn() to signal that the page fault handler should continue
doing its thing.  Aside from being gross and inefficient, using a boolean
return to signal continue vs. stop makes it extremely difficult to add
more helpers and/or move existing code to a helper.

E.g. hypothetically, if nested MMUs were to gain a separate page fault
handler in the future, everything up to the "is self-modifying PTE" check
can be shared by all shadow MMUs, but communicating up the stack whether
to continue on or stop becomes a nightmare.

More concretely, proposed support for private guest memory ran into a
similar issue, where it'll be forced to forego a helper in order to yield
sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.

No functional change intended.

Cc: David Matlack <dmatlack@google.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-7-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5276c616

KVM: x86/mmu: Drop exec/NX check from "page fault can be fast" · 5c64aba5

由 Sean Christopherson 提交于 4月 23, 2022

Tweak the "page fault can be fast" logic to explicitly check for !PRESENT
faults in the access tracking case, and drop the exec/NX check that
becomes redundant as a result. No sane hardware will generate an access
that is both an instruct fetch and a write, i.e. it's a waste of cycles.
If hardware goes off the rails, or KVM runs under a misguided hypervisor,
spuriously running throught fast path is benign (KVM has been uknowingly
being doing exactly that for years).
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-6-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5c64aba5

KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use · 54275f74

由 Sean Christopherson 提交于 4月 23, 2022

Check for A/D bits being disabled instead of the access tracking mask
being non-zero when deciding whether or not to attempt to fix a page
fault vian the fast path. Originally, the access tracking mask was
non-zero if and only if A/D bits were disabled by _KVM_ (including not
being supported by hardware), but that hasn't been true since nVMX was
fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
KVM to not use A/D bits while running L2 despite KVM using them while
running L1.

In other words, don't attempt the fast path just because EPT is enabled.

Note, attempting the fast path for all !PRESENT faults can "fix" a very,
_VERY_ tiny percentage of faults out of mmu_lock by detecting that the
fault is spurious, i.e. has been fixed by a different vCPU, but again the
odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM
gets less than 10 successes out of 30k+ faults, and that's likely one of
the more favorable scenarios. Disabling dirty logging can likely lead to
a rash of collisions between vCPUs for some workloads that operate on a
common set of pages, but penalizing _all_ !PRESENT faults for that one
case is unlikely to be a net positive, not to mention that that problem
is best solved by not zapping in the first place.

The number of spurious faults does scale with the number of vCPUs, e.g. a
255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
path (again out of 30k), but that's all of 0.2% of faults. Using legacy
shadow paging does get more spurious faults, and a few more detected out
of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
faults that are reflected into the guest), i.e. the extra detections are
purely due to the sheer number of faults observed.

On the other hand, getting a "negative" in the fast path takes in the
neighborhood of 150-250 cycles. So while it is tempting to keep/extend
the current behavior, such a change needs to come with hard numbers
showing that it's actually a win in the grand scheme, or any scheme for
that matter.

Fixes: 995f00a6 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

54275f74

KVM: VMX: clean up pi_wakeup_handler · 91ab933f

由 Li RongQing 提交于 4月 06, 2022

Passing per_cpu() to list_for_each_entry() causes the macro to be
evaluated N+1 times for N sleeping vCPUs. This is a very small
inefficiency, and the code is cleaner if the address of the per-CPU
variable is loaded earlier. Do this for both the list and the spinlock.
Signed-off-by: NLi RongQing <lirongqing@baidu.com>
Message-Id: <1649244302-6777-1-git-send-email-lirongqing@baidu.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

91ab933f

KVM: x86: fix typo in __try_cmpxchg_user causing non-atomicness · 33fbe6be

由 Maxim Levitsky 提交于 5月 12, 2022

This shows up as a TDP MMU leak when running nested.  Non-working cmpxchg on L0
relies makes L1 install two different shadow pages under same spte, and one of
them is leaked.

Fixes: 1c2361f6 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses")
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220512101420.306759-1-mlevitsk@redhat.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

33fbe6be

07 5月, 2022 2 次提交

KVM: VMX: Exit to userspace if vCPU has injected exception and invalid state · 053d2290

由 Sean Christopherson 提交于 5月 02, 2022

Exit to userspace with an emulation error if KVM encounters an injected
exception with invalid guest state, in addition to the existing check of
bailing if there's a pending exception (KVM doesn't support emulating
exceptions except when emulating real mode via vm86).

In theory, KVM should never get to such a situation as KVM is supposed to
exit to userspace before injecting an exception with invalid guest state.
But in practice, userspace can intervene and manually inject an exception
and/or stuff registers to force invalid guest state while a previously
injected exception is awaiting reinjection.

Fixes: fc4fad79 ("KVM: VMX: Reject KVM_RUN if emulation is required with pending exception")
Reported-by: syzbot+cfafed3bb76d3e37581b@syzkaller.appspotmail.com
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220502221850.131873-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

053d2290

KVM: SEV: Mark nested locking of vcpu->lock · 0c2c7c06

由 Peter Gonda 提交于 5月 02, 2022

svm_vm_migrate_from() uses sev_lock_vcpus_for_migration() to lock all
source and target vcpu->locks. Unfortunately there is an 8 subclass
limit, so a new subclass cannot be used for each vCPU. Instead maintain
ownership of the first vcpu's mutex.dep_map using a role specific
subclass: source vs target. Release the other vcpu's mutex.dep_maps.

Fixes: b5663931 ("KVM: SEV: Add support for SEV intra host migration")
Reported-by: John Sperbeck<jsperbeck@google.com>
Suggested-by: NDavid Rientjes <rientjes@google.com>
Suggested-by: NSean Christopherson <seanjc@google.com>
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NPeter Gonda <pgonda@google.com>

Message-Id: <20220502165807.529624-1-pgonda@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0c2c7c06

03 5月, 2022 5 次提交

kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU · 5a1bde46

由 Sandipan Das 提交于 4月 27, 2022

On some x86 processors, CPUID leaf 0xA provides information
on Architectural Performance Monitoring features. It
advertises a PMU version which Qemu uses to determine the
availability of additional MSRs to manage the PMCs.

Upon receiving a KVM_GET_SUPPORTED_CPUID ioctl request for
the same, the kernel constructs return values based on the
x86_pmu_capability irrespective of the vendor.

This leaf and the additional MSRs are not supported on AMD
and Hygon processors. If AMD PerfMonV2 is detected, the PMU
version is set to 2 and guest startup breaks because of an
attempt to access a non-existent MSR. Return zeros to avoid
this.

Fixes: a6c06ed1 ("KVM: Expose the architectural performance monitoring CPUID leaf")
Reported-by: NVasant Hegde <vasant.hegde@amd.com>
Signed-off-by: NSandipan Das <sandipan.das@amd.com>
Message-Id: <3fef83d9c2b2f7516e8ff50d60851f29a4bcb716.1651058600.git.sandipan.das@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5a1bde46

KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_id · 5eb84932

由 Kyle Huey 提交于 5月 02, 2022

Zen renumbered some of the performance counters that correspond to the
well known events in perf_hw_id. This code in KVM was never updated for
that, so guest that attempt to use counters on Zen that correspond to the
pre-Zen perf_hw_id values will silently receive the wrong values.

This has been observed in the wild with rr[0] when running in Zen 3
guests. rr uses the retired conditional branch counter 00d1 which is
incorrectly recognized by KVM as PERF_COUNT_HW_STALLED_CYCLES_BACKEND.

[0] https://rr-project.org/Signed-off-by: NKyle Huey <me@kylehuey.com>
Message-Id: <20220503050136.86298-1-khuey@kylehuey.com>
Cc: stable@vger.kernel.org
[Check guest family, not host. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5eb84932

KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits · ba3a6120

由 Sean Christopherson 提交于 4月 23, 2022

Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock.  If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.

Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.

Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page.  In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush.  If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.

For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.

The Accessed bit is a complete non-issue.  Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.

Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE.  If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM.  But that all depends
on the Dirty bit being problematic in the first place.

Fixes: 2f2fad08 ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ba3a6120

KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits() · 54eb3ef5

由 Sean Christopherson 提交于 4月 23, 2022

Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
and into its callers.  Well, caller, since only one of its two callers
doesn't already do the shadow-present check.

Opportunistically move the helper to spte.c/h so that it can be used by
the TDP MMU, which is also the primary motivation for the shadow-present
change.  Unlike the legacy MMU, the TDP MMU uses a single path for clear
leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
MMU will need to check is_last_spte() prior to calling
spte_has_volatile_bits(), and calling is_last_spte() without first
calling is_shadow_present_spte() is at best odd, and at worst a violation
of KVM's loosely defines SPTE rules.

Note, mmu_spte_clear_track_bits() could likely skip the write entirely
for SPTEs that are not shadow-present.  Leave that cleanup for a future
patch to avoid introducing a functional change, and because the
shadow-present check can likely be moved further up the stack, e.g.
drop_large_spte() appears to be the only path that doesn't already
explicitly check for a shadow-present SPTE.

No functional change intended.

Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

54eb3ef5

KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D) · 706c9c55

由 Sean Christopherson 提交于 4月 23, 2022

Don't treat SPTEs that are truly writable, i.e. writable in hardware, as
being volatile (unless they're volatile for other reasons, e.g. A/D bits).
KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit
out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get
cleared just because the SPTE is MMU-writable.

Rename the wrapper of MMU-writable to be more literal, the previous name
of spte_can_locklessly_be_made_writable() is wrong and misleading.

Fixes: c7ba5b48 ("KVM: MMU: fast path of handling guest page fault")
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

706c9c55

02 5月, 2022 2 次提交

KVM: VMX: Use vcpu_to_pi_desc() uniformly in posted_intr.c · c180269d

由 Yuan Yao 提交于 3月 04, 2022

The helper function, vcpu_to_pi_desc(), is defined to get the posted
interrupt descriptor from vcpu.  There is one place that doesn't use
it, and instead references vmx_vcpu->pi_desc directly.  Remove the
inconsistency.
Signed-off-by: NYuan Yao <yuan.yao@intel.com>
Signed-off-by: NIsaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <ee7be7832bc424546fd4f05015a844a0205b5ba2.1646422845.git.isaku.yamahata@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c180269d

KVM: x86: avoid loading a vCPU after .vm_destroy was called · 6fcee03d

由 Maxim Levitsky 提交于 3月 22, 2022

This can cause various unexpected issues, since VM is partially
destroyed at that point.

For example when AVIC is enabled, this causes avic_vcpu_load to
access physical id page entry which is already freed by .vm_destroy.

Fixes: 8221c137 ("svm: Manage vcpu load/unload when enable AVIC")
Cc: stable@vger.kernel.org
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322172449.235575-2-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6fcee03d

30 4月, 2022 15 次提交

KVM: x86: work around QEMU issue with synthetic CPUID leaves · f751d8ea

由 Paolo Bonzini 提交于 4月 29, 2022

Synthesizing AMD leaves up to 0x80000021 caused problems with QEMU,
which assumes the *host* CPUID[0x80000000].EAX is higher or equal
to what KVM_GET_SUPPORTED_CPUID reports.

This causes QEMU to issue bogus host CPUIDs when preparing the input
to KVM_SET_CPUID2.  It can even get into an infinite loop, which is
only terminated by an abort():

   cpuid_data is full, no space for cpuid(eax:0x8000001d,ecx:0x3e)

To work around this, only synthesize those leaves if 0x8000001d exists
on the host.  The synthetic 0x80000021 leaf is mostly useful on Zen2,
which satisfies the condition.

Fixes: f144c49e ("KVM: x86: synthesize CPUID leaf 0x80000021h if useful")
Reported-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f751d8ea

KVM: X86/MMU: Fix shadowing 5-level NPT for 4-level NPT L1 guest · 84e5ffd0

由 Lai Jiangshan 提交于 4月 20, 2022

When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is
allocated with role.level = 5 and the guest pagetable's root gfn.

And root_sp->spt[0] is also allocated with the same gfn and the same
role except role.level = 4.  Luckily that they are different shadow
pages, but only root_sp->spt[0] is the real translation of the guest
pagetable.

Here comes a problem:

If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
and uses the same gfn as the root page for nested NPT before and after
switching gCR4_LA57.  The host (hCR4_LA57=1) might use the same root_sp
for the guest even the guest switches gCR4_LA57.  The guest will see
unexpected page mapped and L2 may exploit the bug and hurt L1.  It is
lucky that the problem can't hurt L0.

And three special cases need to be handled:

The root_sp should be like role.direct=1 sometimes: its contents are
not backed by gptes, root_sp->gfns is meaningless.  (For a normal high
level sp in shadow paging, sp->gfns is often unused and kept zero, but
it could be relevant and meaningful if sp->gfns is used because they
are backed by concrete gptes.)

For such root_sp in the case, root_sp is just a portal to contribute
root_sp->spt[0], and root_sp->gfns should not be used and
root_sp->spt[0] should not be dropped if gpte[0] of the guest root
pagetable is changed.

Such root_sp should not be accounted too.

So add role.passthrough to distinguish the shadow pages in the hash
when gCR4_LA57 is toggled and fix above special cases by using it in
kvm_mmu_page_{get|set}_gfn() and sp_has_gptes().
Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220420131204.2850-3-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

84e5ffd0

KVM: X86/MMU: Add sp_has_gptes() · 767d8d8d

由 Lai Jiangshan 提交于 4月 20, 2022

Add sp_has_gptes() which equals to !sp->role.direct currently.

Shadow page having gptes needs to be write-protected, accounted and
responded to kvm_mmu_pte_write().

Use it in these places to replace !sp->role.direct and rename
for_each_gfn_indirect_valid_sp.
Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220420131204.2850-2-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

767d8d8d

KVM: SVM: Introduce trace point for the slow-path of avic_kic_target_vcpus · 9f084f7c

由 Suravee Suthikulpanit 提交于 4月 20, 2022

This can help identify potential performance issues when handles
AVIC incomplete IPI due vCPU not running.
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220420154954.19305-3-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9f084f7c

KVM: SVM: Use target APIC ID to complete AVIC IRQs when possible · 7223fd2d

由 Suravee Suthikulpanit 提交于 4月 20, 2022

Currently, an AVIC-enabled VM suffers from performance bottleneck
when scaling to large number of vCPUs for I/O intensive workloads.

In such case, a vCPU often executes halt instruction to get into idle state
waiting for interrupts, in which KVM would de-schedule the vCPU from
physical CPU.

When AVIC HW tries to deliver interrupt to the halting vCPU, it would
result in AVIC incomplete IPI #vmexit to notify KVM to reschedule
the target vCPU into running state.

Investigation has shown the main hotspot is in the kvm_apic_match_dest()
in the following call stack where it tries to find target vCPUs
corresponding to the information in the ICRH/ICRL registers.

  - handle_exit
    - svm_invoke_exit_handler
      - avic_incomplete_ipi_interception
        - kvm_apic_match_dest

However, AVIC provides hints in the #vmexit info, which can be used to
retrieve the destination guest physical APIC ID.

In addition, since QEMU defines guest physical APIC ID to be the same as
vCPU ID, it can be used to quickly identify the target vCPU to deliver IPI,
and avoid the overhead from searching through all vCPUs to match the target
vCPU.
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220420154954.19305-2-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7223fd2d

KVM: x86/mmu: replace direct_map with root_role.direct · 347a0d0d

由 Paolo Bonzini 提交于 2月 10, 2022

direct_map is always equal to the direct field of the root page's role:

- for shadow paging, direct_map is true if CR0.PG=0 and root_role.direct is
copied from cpu_role.base.direct

- for TDP, it is always true and root_role.direct is also always true

- for shadow TDP, it is always false and root_role.direct is also always
false
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

347a0d0d

KVM: x86/mmu: replace root_level with cpu_role.base.level · 4d25502a

由 Paolo Bonzini 提交于 2月 10, 2022

Remove another duplicate field of struct kvm_mmu.  This time it's
the root level for page table walking; the separate field is
always initialized as cpu_role.base.level, so its users can look
up the CPU mode directly instead.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4d25502a

KVM: x86/mmu: replace shadow_root_level with root_role.level · a972e29c

由 Paolo Bonzini 提交于 2月 10, 2022

root_role.level is always the same value as shadow_level:

- it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu

- it's the level argument when going through kvm_init_shadow_ept_mmu

- it's assigned directly from new_role.base.level when going
  through shadow_mmu_init_context

Remove the duplication and get the level directly from the role.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a972e29c

KVM: x86/mmu: pull CPU mode computation to kvm_init_mmu · a7f1de9b

由 Paolo Bonzini 提交于 2月 10, 2022

Do not lead init_kvm_*mmu into the temptation of poking
into struct kvm_mmu_role_regs, by passing to it directly
the CPU mode.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a7f1de9b

KVM: x86/mmu: simplify and/or inline computation of shadow MMU roles · 56b321f9

由 Paolo Bonzini 提交于 2月 10, 2022

Shadow MMUs compute their role from cpu_role.base, simply by adjusting
the root level.  It's one line of code, so do not place it in a separate
function.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

56b321f9

KVM: x86/mmu: remove redundant bits from extended role · faf72962

由 Paolo Bonzini 提交于 2月 10, 2022

Before the separation of the CPU and the MMU role, CR0.PG was not
available in the base MMU role, because two-dimensional paging always
used direct=1 in the MMU role.  However, now that the raw role is
snapshotted in mmu->cpu_role, the value of CR0.PG always matches both
!cpu_role.base.direct and cpu_role.base.level > 0.  There is no need to
store it again in union kvm_mmu_extended_role; instead, write an is_cr0_pg
accessor by hand that takes care of the conversion.  Use cpu_role.base.level
since the future of the direct field is unclear.

Likewise, CR4.PAE is now always present in the CPU role as
!cpu_role.base.has_4_byte_gpte.  The inversion makes certain tests on
the MMU role easier, and is easily hidden by the is_cr4_pae accessor
when operating on the CPU role.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

faf72962

KVM: x86/mmu: rename kvm_mmu_role union · 7a7ae829

由 Paolo Bonzini 提交于 2月 10, 2022

It is quite confusing that the "full" union is called kvm_mmu_role
but is used for the "cpu_role" field of struct kvm_mmu.  Rename it
to kvm_cpu_role.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7a7ae829

KVM: x86/mmu: remove extended bits from mmu_role, rename field · 7a458f0e

由 Paolo Bonzini 提交于 2月 14, 2022

mmu_role represents the role of the root of the page tables.
It does not need any extended bits, as those govern only KVM's
page table walking; the is_* functions used for page table
walking always use the CPU role.

ext.valid is not present anymore in the MMU role, but an
all-zero MMU role is impossible because the level field is
never zero in the MMU role.  So just zap the whole mmu_role
in order to force invalidation after CPUID is updated.

While making this change, which requires touching almost every
occurrence of "mmu_role", rename it to "root_role".
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7a458f0e

KVM: x86/mmu: store shadow EFER.NX in the MMU role · 362505de

由 Paolo Bonzini 提交于 2月 10, 2022

Now that the MMU role is separate from the CPU role, it can be a
truthful description of the format of the shadow pages. This includes
whether the shadow pages use the NX bit; so force the efer_nx field
of the MMU role when TDP is disabled, and remove the hardcoding it in
the callers of reset_shadow_zero_bits_mask.

In fact, the initialization of reserved SPTE bits can now be made common
to shadow paging and shadow NPT; move it to shadow_mmu_init_context.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

362505de

P
KVM: x86/mmu: cleanup computation of MMU roles for shadow paging · f417e145
由 Paolo Bonzini 提交于 2月 10, 2022
```
Pass the already-computed CPU role, instead of redoing it.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
```
f417e145

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功