提交 · 2ca3129e8045b18eb15431b485d40c028fe8fb00 · openeuler / Kernel

20 6月, 2022 14 次提交

KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs · 2ca3129e

由 Sean Christopherson 提交于 6月 14, 2022

Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs
(PT64).  SPTE and PT64 are _mostly_ the same, but the few differences are
quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and
guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is
very much specific to SPTEs.

Opportunistically (and temporarily) move most guest macros into paging.h
to clearly associate them with shadow paging, and to ensure that they're
not used as of this commit.  A future patch will eliminate them entirely.

Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's
needed for the quadrant calculation in kvm_mmu_get_page().  The quadrant
calculation is hot enough (when using shadow paging with 32-bit guests)
that adding a per-context helper is undesirable, and burying the
computation in paging_tmpl.h with a forward declaration isn't exactly an
improvement.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-6-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2ca3129e

KVM: x86/mmu: Dedup macros for computing various page table masks · 42c88ff8

由 Sean Christopherson 提交于 6月 14, 2022

Provide common helper macros to generate various masks, shifts, etc...
for 32-bit vs. 64-bit page tables.  Only the inputs differ, the actual
calculations are identical.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

42c88ff8

KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.h · b3fcdb04

由 Sean Christopherson 提交于 6月 14, 2022

Move a handful of one-off macros and helpers for 32-bit PSE paging into
paging_tmpl.h and hide them behind "PTTYPE == 32".  Under no circumstance
should anything but 32-bit shadow paging care about PSE paging.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b3fcdb04

KVM: VMX: Refactor 32-bit PSE PT creation to avoid using MMU macro · 1ae20e0b

由 Sean Christopherson 提交于 6月 14, 2022

Compute the number of PTEs to be filled for the 32-bit PSE page tables
using the page size and the size of each entry.  While using the MMU's
PT32_ENT_PER_PAGE macro is arguably better in isolation, removing VMX's
usage will allow a future namespacing cleanup to move the guest page
table macros into paging_tmpl.h, out of the reach of code that isn't
directly related to shadow paging.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1ae20e0b

KVM: x86: Use lapic_in_kernel() to query in-kernel APIC in APICv helper · b8e1b962

由 Sean Christopherson 提交于 6月 14, 2022

Use lapic_in_kernel() in kvm_vcpu_apicv_active() to take advantage of the
kvm_has_noapic_vcpu static branch.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614230548.3852141-6-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b8e1b962

KVM: x86: Move "apicv_active" into "struct kvm_lapic" · ce0a58f4

由 Sean Christopherson 提交于 6月 14, 2022

Move the per-vCPU apicv_active flag into KVM's local APIC instance.
APICv is fully dependent on an in-kernel local APIC, but that's not at
all clear when reading the current code due to the flag being stored in
the generic kvm_vcpu_arch struct.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614230548.3852141-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ce0a58f4

KVM: x86: Check for in-kernel xAPIC when querying APICv for directed yield · ae801e13

由 Sean Christopherson 提交于 6月 14, 2022

Use kvm_vcpu_apicv_active() to check if APICv is active when seeing if a
vCPU is a candidate for directed yield due to a pending ACPIv interrupt.
This will allow moving apicv_active into kvm_lapic without introducing a
potential NULL pointer deref (kvm_vcpu_apicv_active() effectively adds a
pre-check on the vCPU having an in-kernel APIC).

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614230548.3852141-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ae801e13

KVM: x86: Drop @vcpu parameter from kvm_x86_ops.hwapic_isr_update() · d39850f5

由 Sean Christopherson 提交于 6月 14, 2022

Drop the unused @vcpu parameter from hwapic_isr_update().  AMD/AVIC is
unlikely to implement the helper, and VMX/APICv doesn't need the vCPU as
it operates on the current VMCS.  The result is somewhat odd, but allows
for a decent amount of (future) cleanup in the APIC code.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614230548.3852141-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d39850f5

KVM: SVM: Drop unused AVIC / kvm_x86_ops declarations · ec1d7e6a

由 Sean Christopherson 提交于 6月 14, 2022

Drop a handful of unused AVIC function declarations whose implementations
were removed during the conversion to optional static calls.

No functional change intended.

Fixes: abb6d479 ("KVM: x86: make several APIC virtualization callbacks optional")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614230548.3852141-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ec1d7e6a

KVM: nVMX: Update vmcs12 on BNDCFGS write, not at vmcs02=>vmcs12 sync · 913d6c9b

由 Sean Christopherson 提交于 6月 14, 2022

Update vmcs12->guest_bndcfgs on intercepted writes to BNDCFGS from L2
instead of waiting until vmcs02 is synchronized to vmcs12.  KVM always
intercepts BNDCFGS accesses, so the only way the value in vmcs02 can
change is via KVM's explicit VMWRITE during emulation.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614215831.3762138-6-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

913d6c9b

KVM: nVMX: Save BNDCFGS to vmcs12 iff relevant controls are exposed to L1 · 308a4fff

由 Sean Christopherson 提交于 6月 14, 2022

Save BNDCFGS to vmcs12 (from vmcs02) if and only if at least of one of
the load-on-entry or clear-on-exit fields for BNDCFGS is enumerated as an
allowed-1 bit in vmcs12.  Skipping the field avoids an unnecessary VMREAD
when MPX is supported but not exposed to L1.

Per Intel's SDM:

  If the processor supports either the 1-setting of the "load IA32_BNDCFGS"
  VM-entry control or that of the "clear IA32_BNDCFGS" VM-exit control, the
  contents of the IA32_BNDCFGS MSR are saved into the corresponding field.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614215831.3762138-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

308a4fff

KVM: nVMX: Rename nested.vmcs01_* fields to nested.pre_vmenter_* · 5d76b1f8

由 Sean Christopherson 提交于 6月 14, 2022

Rename the fields in struct nested_vmx used to snapshot pre-VM-Enter
values to reflect that they can hold L2's values when restoring nested
state, e.g. if userspace restores MSRs before nested state. As crazy as
it seems, restoring MSRs before nested state actually works (because KVM
goes out if it's way to make it work), even though the initial MSR writes
will hit vmcs01 despite holding L2 values.

Add a related comment to vmx_enter_smm() to call out that using the
common VM-Exit and VM-Enter helpers to emulate SMI and RSM is wrong and
broken. The few MSRs that have snapshots _could_ be fixed by taking a
snapshot prior to the forced VM-Exit instead of at forced VM-Enter, but
that's just the tip of the iceberg as the rather long list of MSRs that
aren't snapshotted (hello, VM-Exit MSR load list) can't be handled this
way.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614215831.3762138-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5d76b1f8

KVM: nVMX: Snapshot pre-VM-Enter DEBUGCTL for !nested_run_pending case · 764643a6

由 Sean Christopherson 提交于 6月 14, 2022

If a nested run isn't pending, snapshot vmcs01.GUEST_IA32_DEBUGCTL
irrespective of whether or not VM_ENTRY_LOAD_DEBUG_CONTROLS is set in
vmcs12. When restoring nested state, e.g. after migration, without a
nested run pending, prepare_vmcs02() will propagate
nested.vmcs01_debugctl to vmcs02, i.e. will load garbage/zeros into
vmcs02.GUEST_IA32_DEBUGCTL.

If userspace restores nested state before MSRs, then loading garbage is a
non-issue as loading DEBUGCTL will also update vmcs02. But if usersepace
restores MSRs first, then KVM is responsible for propagating L2's value,
which is actually thrown into vmcs01, into vmcs02.

Restoring L2 MSRs into vmcs01, i.e. loading all MSRs before nested state
is all kinds of bizarre and ideally would not be supported. Sadly, some
VMMs do exactly that and rely on KVM to make things work.

Note, there's still a lurking SMM bug, as propagating vmcs01's DEBUGCTL
to vmcs02 across RSM may corrupt L2's DEBUGCTL. But KVM's entire VMX+SMM
emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the
"default treatment of SMIs", i.e. when not using an SMI Transfer Monitor.

Link: https://lore.kernel.org/all/Yobt1XwOfb5M6Dfa@google.com
Fixes: 8fcc4b59 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614215831.3762138-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

764643a6

KVM: nVMX: Snapshot pre-VM-Enter BNDCFGS for !nested_run_pending case · fa578398

由 Sean Christopherson 提交于 6月 14, 2022

If a nested run isn't pending, snapshot vmcs01.GUEST_BNDCFGS irrespective
of whether or not VM_ENTRY_LOAD_BNDCFGS is set in vmcs12. When restoring
nested state, e.g. after migration, without a nested run pending,
prepare_vmcs02() will propagate nested.vmcs01_guest_bndcfgs to vmcs02,
i.e. will load garbage/zeros into vmcs02.GUEST_BNDCFGS.

If userspace restores nested state before MSRs, then loading garbage is a
non-issue as loading BNDCFGS will also update vmcs02. But if usersepace
restores MSRs first, then KVM is responsible for propagating L2's value,
which is actually thrown into vmcs01, into vmcs02.

Note, there's still a lurking SMM bug, as propagating vmcs01.GUEST_BNDFGS
to vmcs02 across RSM may corrupt L2's BNDCFGS. But KVM's entire VMX+SMM
emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the
"default treatment of SMIs", i.e. when not using an SMI Transfer Monitor.

Link: https://lore.kernel.org/all/Yobt1XwOfb5M6Dfa@google.com
Fixes: 62cf9bd8 ("KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS")
Cc: stable@vger.kernel.org
Cc: Lei Wang <lei4.wang@intel.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614215831.3762138-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fa578398

15 6月, 2022 10 次提交

KVM: x86/mmu: Use try_cmpxchg64 in fast_pf_fix_direct_spte · 2db2f46f

由 Uros Bizjak 提交于 5月 20, 2022

Use try_cmpxchg64 instead of cmpxchg64 (*ptr, old, new) != old in
fast_pf_fix_direct_spte. cmpxchg returns success in ZF flag, so this
change saves a compare after cmpxchg (and related move instruction
in front of cmpxchg).
Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Message-Id: <20220520144635.63134-1-ubizjak@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2db2f46f

KVM: VMX: Use try_cmpxchg64 in pi_try_set_control · 0ac304de

由 Uros Bizjak 提交于 5月 20, 2022

Use try_cmpxchg64 instead of cmpxchg64 (*ptr, old, new) != old
in pi_try_set_control.  cmpxchg returns success in ZF flag, so this
change saves a compare after cmpxchg (and related move instruction
in front of cmpxchg):

  b9:   88 44 24 60             mov    %al,0x60(%rsp)
  bd:   48 89 c8                mov    %rcx,%rax
  c0:   c6 44 24 62 f2          movb   $0xf2,0x62(%rsp)
  c5:   48 8b 74 24 60          mov    0x60(%rsp),%rsi
  ca:   f0 49 0f b1 34 24       lock cmpxchg %rsi,(%r12)
  d0:   48 39 c1                cmp    %rax,%rcx
  d3:   75 cf                   jne    a4 <vmx_vcpu_pi_load+0xa4>

patched:

  c1:   88 54 24 60             mov    %dl,0x60(%rsp)
  c5:   c6 44 24 62 f2          movb   $0xf2,0x62(%rsp)
  ca:   48 8b 54 24 60          mov    0x60(%rsp),%rdx
  cf:   f0 48 0f b1 13          lock cmpxchg %rdx,(%rbx)
  d4:   75 d5                   jne    ab <vmx_vcpu_pi_load+0xab>
Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Reported-by: Nkernel test robot <lkp@intel.com>
Message-Id: <20220520143737.62513-1-ubizjak@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0ac304de

KVM: x86/mmu: Use try_cmpxchg64 in tdp_mmu_set_spte_atomic · aee98a68

由 Uros Bizjak 提交于 5月 18, 2022

Use try_cmpxchg64 instead of cmpxchg64 (*ptr, old, new) != old in
tdp_mmu_set_spte_atomic.  cmpxchg returns success in ZF flag, so this
change saves a compare after cmpxchg (and related move instruction
in front of cmpxchg). Also, remove explicit assignment to iter->old_spte
when cmpxchg fails, this is what try_cmpxchg does implicitly.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220518135111.3535-1-ubizjak@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

aee98a68

KVM: VMX: Skip filter updates for MSRs that KVM is already intercepting · d895f28e

由 Sean Christopherson 提交于 6月 10, 2022

When handling userspace MSR filter updates, recompute interception for
possible passthrough MSRs if and only if KVM wants to disabled
interception.  If KVM wants to intercept accesses, i.e. the associated
bit is set in vmx->shadow_msr_intercept, then there's no need to set the
intercept again as KVM will intercept the MSR regardless of userspace's
wants.

No functional change intended, the call to vmx_enable_intercept_for_msr()
really is just a gigantic nop.
Suggested-by: NAaron Lewis <aaronlewis@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220610214140.612025-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d895f28e

KVM: x86/mmu: Drop unused CMPXCHG macro from paging_tmpl.h · 007a369f

由 Sean Christopherson 提交于 6月 13, 2022

Drop the CMPXCHG macro from paging_tmpl.h, it's no longer used now that
KVM uses a common uaccess helper to do 8-byte CMPXCHG.

Fixes: f122dfe4 ("KVM: x86: Use __try_cmpxchg_user() to update guest PTE A/D bits")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220613225723.2734132-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

007a369f

KVM: X86/SVM: Use root_level in svm_load_mmu_pgd() · 78c7d900

由 Lai Jiangshan 提交于 6月 05, 2022

Use root_level in svm_load_mmu_pg() rather that looking up the root
level in vcpu->arch.mmu->root_role.level. svm_load_mmu_pgd() has only
one caller, kvm_mmu_load_pgd(), which always passes
vcpu->arch.mmu->root_role.level as root_level.
Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220605063417.308311-7-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

78c7d900

KVM: X86/MMU: Remove useless mmu_topup_memory_caches() in kvm_mmu_pte_write() · 024c3c33

由 Lai Jiangshan 提交于 6月 05, 2022

Since the commit c5e2184d("KVM: x86/mmu: Remove the defunct
update_pte() paging hook"), kvm_mmu_pte_write() no longer uses the rmap
cache.

So remove mmu_topup_memory_caches() in it.

Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220605063417.308311-6-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

024c3c33

KVM: X86/MMU: Remove unused PT32_DIR_BASE_ADDR_MASK from mmu.c · fc10020a

由 Lai Jiangshan 提交于 6月 05, 2022

It is unused.
Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220605063417.308311-3-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fc10020a

KVM: SVM: Hide SEV migration lockdep goo behind CONFIG_PROVE_LOCKING · e5380f6d

由 Sean Christopherson 提交于 6月 13, 2022

Wrap the manipulation of @role and the manual mutex_{release,acquire}()
invocations in CONFIG_PROVE_LOCKING=y to squash a clang-15 warning. When
building with -Wunused-but-set-parameter and CONFIG_DEBUG_LOCK_ALLOC=n,
clang-15 seees there's no usage of @role in mutex_lock_killable_nested()
and yells. PROVE_LOCKING selects DEBUG_LOCK_ALLOC, and the only reason
KVM manipulates @role is to make PROVE_LOCKING happy.

To avoid true ugliness, use "i" and "j" to detect the first pass in the
loops; the "idx" field that's used by kvm_for_each_vcpu() is guaranteed
to be '0' on the first pass as it's simply the first entry in the vCPUs
XArray, which is fully KVM controlled. kvm_for_each_vcpu() passes '0'
for xa_for_each_range()'s "start", and xa_for_each_range() will not enter
the loop if there's no entry at '0'.

Fixes: 0c2c7c06 ("KVM: SEV: Mark nested locking of vcpu->lock")
Reported-by: Nkernel test robot <lkp@intel.com>
Cc: Peter Gonda <pgonda@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220613214237.2538266-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e5380f6d

KVM: SEV: fix misplaced closing parenthesis · 5bdae49f

由 Paolo Bonzini 提交于 6月 15, 2022

This caused a warning on 32-bit systems, but undoubtedly would have acted
funny on 64-bit as well.

The fix was applied directly on merge in 5.19, see commit 24625f7d ("Merge
tag for-linus of git://git.kernel.org/pub/scm/virt/kvm/kvm").

Fixes: 3743c2f0 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base")
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5bdae49f

10 6月, 2022 8 次提交

KVM: x86: Bug the VM on an out-of-bounds data read · d38ea957

由 Sean Christopherson 提交于 5月 26, 2022

Bug the VM and terminate emulation if an out-of-bounds read into the
emulator's data cache occurs.  Knowingly contuining on all but guarantees
that KVM will overwrite random kernel data, which is far, far worse than
killing the VM.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220526210817.3428868-9-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d38ea957

KVM: x86: Bug the VM if the emulator generates a bogus exception vector · 49a1431d

由 Sean Christopherson 提交于 5月 26, 2022

Bug the VM if KVM's emulator attempts to inject a bogus exception vector.
The guest is likely doomed even if KVM continues on, and propagating a
bad vector to the rest of KVM runs the risk of breaking other assumptions
in KVM and thus triggering a more egregious bug.

All existing users of emulate_exception() have hardcoded vector numbers
(__load_segment_descriptor() uses a few different vectors, but they're
all hardcoded), and future users are likely to follow suit, i.e. the
change to emulate_exception() is a glorified nop.

As for the ctxt->exception.vector check in x86_emulate_insn(), the few
known times the WARN has been triggered in the past is when the field was
not set when synthesizing a fault, i.e. for all intents and purposes the
check protects against consumption of uninitialized data.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220526210817.3428868-8-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

49a1431d

KVM: x86: Bug the VM if the emulator accesses a non-existent GPR · 1cca2f8c

由 Sean Christopherson 提交于 5月 26, 2022

Bug the VM, i.e. kill it, if the emulator accesses a non-existent GPR,
i.e. generates an out-of-bounds GPR index.  Continuing on all but
gaurantees some form of data corruption in the guest, e.g. even if KVM
were to redirect to a dummy register, KVM would be incorrectly read zeros
and drop writes.

Note, bugging the VM doesn't completely prevent data corruption, e.g. the
current round of emulation will complete before the vCPU bails out to
userspace.  But, the very act of killing the guest can also cause data
corruption, e.g. due to lack of file writeback before termination, so
taking on additional complexity to cleanly bail out of the emulator isn't
justified, the goal is purely to stem the bleeding and alert userspace
that something has gone horribly wrong, i.e. to avoid _silent_ data
corruption.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220526210817.3428868-7-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1cca2f8c

KVM: x86: Reduce the number of emulator GPRs to '8' for 32-bit KVM · b443183a

由 Sean Christopherson 提交于 5月 26, 2022

Reduce the number of GPRs emulated by 32-bit KVM from 16 to 8.  KVM does
not support emulating 64-bit mode on 32-bit host kernels, and so should
never generate accesses to R8-15.

Opportunistically use NR_EMULATOR_GPRS in rsm_load_state_{32,64}() now
that it is precise and accurate for both flavors.

Wrap the definition with full #ifdef ugliness; sadly, IS_ENABLED()
doesn't guarantee a compile-time constant as far as BUILD_BUG_ON() is
concerned.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Message-Id: <20220526210817.3428868-6-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b443183a

KVM: x86: Use 16-bit fields to track dirty/valid emulator GPRs · 0cbc60d4

由 Sean Christopherson 提交于 5月 26, 2022

Use a u16 instead of a u32 to track the dirty/valid status of GPRs in the
emulator.  Unlike struct kvm_vcpu_arch, x86_emulate_ctxt tracks only the
"true" GPRs, i.e. doesn't include RIP in its array, and so only needs to
track 16 registers.

Note, maxing out at 16 GPRs is a fundamental property of x86-64 and will
not change barring a massive architecture update.  Legacy x86 ModRM and
SIB encodings use 3 bits for GPRs, i.e. support 8 registers.  x86-64 uses
a single bit in the REX prefix for each possible reference type to double
the number of supported GPRs to 16 registers (4 bits).
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220526210817.3428868-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0cbc60d4

KVM: x86: Omit VCPU_REGS_RIP from emulator's _regs array · a5ba67b4

由 Sean Christopherson 提交于 5月 26, 2022

Omit RIP from the emulator's _regs array, which is used only for GPRs,
i.e. registers that can be referenced via ModRM and/or SIB bytes.  The
emulator uses the dedicated _eip field for RIP, and manually reads from
_eip to handle RIP-relative addressing.

To avoid an even bigger, slightly more dangerous change, hardcode the
number of GPRs to 16 for the time being even though 32-bit KVM's emulator
technically should only have 8 GPRs.  Add a TODO to address that in a
future commit.

See also the comments above the read_gpr() and write_gpr() declarations,
and obviously the handling in writeback_registers().

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Message-Id: <20220526210817.3428868-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a5ba67b4

KVM: x86: Harden _regs accesses to guard against buggy input · dfe21e6b

由 Sean Christopherson 提交于 5月 26, 2022

WARN and truncate the incoming GPR number/index when reading/writing GPRs
in the emulator to guard against KVM bugs, e.g. to avoid out-of-bounds
accesses to ctxt->_regs[] if KVM generates a bogus index.  Truncate the
index instead of returning e.g. zero, as reg_write() returns a pointer
to the register, i.e. returning zero would result in a NULL pointer
dereference.  KVM could also force the index to any arbitrary GPR, but
that's no better or worse, just different.

Open code the restriction to 16 registers; RIP is handled via _eip and
should never be accessed through reg_read() or reg_write().  See the
comments above the declarations of reg_read() and reg_write(), and the
behavior of writeback_registers().  The horrific open coded mess will be
cleaned up in a future commit.

There are no such bugs known to exist in the emulator, but determining
that KVM is bug-free is not at all simple and requires a deep dive into
the emulator.  The code is so convoluted that GCC-12 with the recently
enable -Warray-bounds spits out a false-positive due to a GCC bug:

  arch/x86/kvm/emulate.c:254:27: warning: array subscript 32 is above array
                                 bounds of 'long unsigned int[17]' [-Warray-bounds]
    254 |         return ctxt->_regs[nr];
        |                ~~~~~~~~~~~^~~~
  In file included from arch/x86/kvm/emulate.c:23:
  arch/x86/kvm/kvm_emulate.h: In function 'reg_rmw':
  arch/x86/kvm/kvm_emulate.h:366:23: note: while referencing '_regs'
    366 |         unsigned long _regs[NR_VCPU_REGS];
        |                       ^~~~~

Link: https://lore.kernel.org/all/YofQlBrlx18J7h9Y@google.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216026
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105679Reported-and-tested-by: NRobert Dinse <nanook@eskimo.com>
Reported-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220526210817.3428868-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

dfe21e6b

KVM: x86: Grab regs_dirty in local 'unsigned long' · 61d9c412

由 Sean Christopherson 提交于 5月 26, 2022

Capture ctxt->regs_dirty in a local 'unsigned long' instead of casting it
to an 'unsigned long *' for use in for_each_set_bit(). The bitops helpers
really do read the entire 'unsigned long', even though the walking of the
read value is capped at the specified size. I.e. 64-bit KVM is reading
memory beyond ctxt->regs_dirty, which is a u32 and thus 4 bytes, whereas
an unsigned long is 8 bytes. Functionally it's not an issue because
regs_dirty is in the middle of x86_emulate_ctxt, i.e. KVM is just reading
its own memory, but relying on that coincidence is gross and unsafe.
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220526210817.3428868-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

61d9c412

09 6月, 2022 8 次提交

KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSE · e3cdaab5

由 Paolo Bonzini 提交于 5月 31, 2022

Commit 74fd41ed ("KVM: x86: nSVM: support PAUSE filtering when L0
doesn't intercept PAUSE") introduced passthrough support for nested pause
filtering, (when the host doesn't intercept PAUSE) (either disabled with
kvm module param, or disabled with '-overcommit cpu-pm=on')

Before this commit, L1 KVM didn't intercept PAUSE at all; afterwards,
the feature was exposed as supported by KVM cpuid unconditionally, thus
if L1 could try to use it even when the L0 KVM can't really support it.

In this case the fallback caused KVM to intercept each PAUSE instruction;
in some cases, such intercept can slow down the nested guest so much
that it can fail to boot. Instead, before the problematic commit KVM
was already setting both thresholds to 0 in vmcb02, but after the first
userspace VM exit shrink_ple_window was called and would reset the
pause_filter_count to the default value.

To fix this, change the fallback strategy - ignore the guest threshold
values, but use/update the host threshold values unless the guest
specifically requests disabling PAUSE filtering (either simple or
advanced).

Also fix a minor bug: on nested VM exit, when PAUSE filter counter
were copied back to vmcb01, a dirty bit was not set.

Thanks a lot to Suravee Suthikulpanit for debugging this!

Fixes: 74fd41ed ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE")
Reported-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Co-developed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220518072709.730031-1-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e3cdaab5

KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/put · ba8ec273

由 Maxim Levitsky 提交于 6月 06, 2022

Now that these functions are always called with preemption disabled,
remove the preempt_disable()/preempt_enable() pair inside them.

No functional change intended.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-8-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ba8ec273

KVM: x86: disable preemption while updating apicv inhibition · 66c768d3

由 Maxim Levitsky 提交于 6月 06, 2022

Currently nothing prevents preemption in kvm_vcpu_update_apicv.

On SVM, If the preemption happens after we update the
vcpu->arch.apicv_active, the preemption itself will
'update' the inhibition since the AVIC will be first disabled
on vCPU unload and then enabled, when the current task
is loaded again.

Then we will try to update it again, which will lead to a warning
in __avic_vcpu_load, that the AVIC is already enabled.

Fix this by disabling preemption in this code.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-6-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

66c768d3

KVM: x86: SVM: fix avic_kick_target_vcpus_fast · 603ccef4

由 Maxim Levitsky 提交于 6月 06, 2022

There are two issues in avic_kick_target_vcpus_fast

1. It is legal to issue an IPI request with APIC_DEST_NOSHORT
   and a physical destination of 0xFF (or 0xFFFFFFFF in case of x2apic),
   which must be treated as a broadcast destination.

   Fix this by explicitly checking for it.
   Also donâ€™t use â€˜indexâ€™ in this case as it gives no new information.

2. It is legal to issue a logical IPI request to more than one target.
   Index field only provides index in physical id table of first
   such target and therefore can't be used before we are sure
   that only a single target was addressed.

   Instead, parse the ICRL/ICRH, double check that a unicast interrupt
   was requested, and use that info to figure out the physical id
   of the target vCPU.
   At that point there is no need to use the index field as well.

In addition to fixing the above	issues,	also skip the call to
kvm_apic_match_dest.

It is possible to do this now, because now as long as AVIC is not
inhibited, it is guaranteed that none of the vCPUs changed their
apic id from its default value.

This fixes boot of windows guest with AVIC enabled because it uses
IPI with 0xFF destination and no destination shorthand.

Fixes: 7223fd2d ("KVM: SVM: Use target APIC ID to complete AVIC IRQs when possible")
Cc: stable@vger.kernel.org
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-5-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

603ccef4

KVM: x86: SVM: remove avic's broken code that updated APIC ID · f5f9089f

由 Maxim Levitsky 提交于 6月 06, 2022

AVIC is now inhibited if the guest changes the apic id,
and therefore this code is no longer needed.

There are several ways this code was broken, including:

1. a vCPU was only allowed to change its apic id to an apic id
of an existing vCPU.

2. After such change, the vCPU whose apic id entry was overwritten,
could not correctly change its own apic id, because its own
entry is already overwritten.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-4-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f5f9089f

KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base · 3743c2f0

由 Maxim Levitsky 提交于 6月 06, 2022

Neither of these settings should be changed by the guest and it is
a burden to support it in the acceleration code, so just inhibit
this code instead.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-3-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3743c2f0

KVM: x86: document AVIC/APICv inhibit reasons · a9603ae0

由 Maxim Levitsky 提交于 6月 06, 2022

These days there are too many AVIC/APICv inhibit
reasons, and it doesn't hurt to have some documentation
for them.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-2-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a9603ae0

KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs · d2263de1

由 Yuan Yao 提交于 6月 08, 2022

Assign shadow_me_value, not shadow_me_mask, to PAE root entries,
a.k.a. shadow PDPTRs, when host memory encryption is supported.  The
"mask" is the set of all possible memory encryption bits, e.g. MKTME
KeyIDs, whereas "value" holds the actual value that needs to be
stuffed into host page tables.

Using shadow_me_mask results in a failed VM-Entry due to setting
reserved PA bits in the PDPTRs, and ultimately causes an OOPS due to
physical addresses with non-zero MKTME bits sending to_shadow_page()
into the weeds:

set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
BUG: unable to handle page fault for address: ffd43f00063049e8
PGD 86dfd8067 P4D 0
Oops: 0000 [#1] PREEMPT SMP
RIP: 0010:mmu_free_root_page+0x3c/0x90 [kvm]
 kvm_mmu_free_roots+0xd1/0x200 [kvm]
 __kvm_mmu_unload+0x29/0x70 [kvm]
 kvm_mmu_unload+0x13/0x20 [kvm]
 kvm_arch_destroy_vm+0x8a/0x190 [kvm]
 kvm_put_kvm+0x197/0x2d0 [kvm]
 kvm_vm_release+0x21/0x30 [kvm]
 __fput+0x8e/0x260
 ____fput+0xe/0x10
 task_work_run+0x6f/0xb0
 do_exit+0x327/0xa90
 do_group_exit+0x35/0xa0
 get_signal+0x911/0x930
 arch_do_signal_or_restart+0x37/0x720
 exit_to_user_mode_prepare+0xb2/0x140
 syscall_exit_to_user_mode+0x16/0x30
 do_syscall_64+0x4e/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: e54f1ff2 ("KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask")
Signed-off-by: NYuan Yao <yuan.yao@intel.com>
Reviewed-by: NKai Huang <kai.huang@intel.com>
Message-Id: <20220608012015.19566-1-yuan.yao@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d2263de1

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功