提交 · 4f117ce4aefca0e90cd44680219d4c261c1381b9 · openeuler / Kernel

02 8月, 2021 16 次提交

KVM: SVM: Zero out GDTR.base and IDTR.base on INIT · 4f117ce4

由 Sean Christopherson 提交于 7月 13, 2021

Explicitly set GDTR.base and IDTR.base to zero when intializing the VMCB.
Functionally this only affects INIT, as the bases are implicitly set to
zero on RESET by virtue of the VMCB being zero allocated.

Per AMD's APM, GDTR.base and IDTR.base are zeroed after RESET and INIT.

Fixes: 04d2cc77 ("KVM: Move main vcpu loop into subarch independent code")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4f117ce4

KVM: nVMX: Set LDTR to its architecturally defined value on nested VM-Exit · afc8de01

由 Sean Christopherson 提交于 7月 13, 2021

Set L1's LDTR on VM-Exit per the Intel SDM:

  The host-state area does not contain a selector field for LDTR. LDTR is
  established as follows on all VM exits: the selector is cleared to
  0000H, the segment is marked unusable and is otherwise undefined
  (although the base address is always canonical).

This is likely a benign bug since the LDTR is unusable, as it means the
L1 VMM is conditioned to reload its LDTR in order to function properly on
bare metal.

Fixes: 4704d0be ("KVM: nVMX: Exiting from L2 to L1")
Reviewed-by: NReiji Watanabe <reijiw@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

afc8de01

KVM: x86: Flush the guest's TLB on INIT · df37ed38

由 Sean Christopherson 提交于 7月 13, 2021

Flush the guest's TLB on INIT, as required by Intel's SDM.  Although
AMD's APM states that the TLBs are unchanged by INIT, it's not clear that
that's correct as the APM also states that the TLB is flush on "External
initialization of the processor."  Regardless, relying on the guest to be
paranoid is unnecessarily risky, while an unnecessary flush is benign
from a functional perspective and likely has no measurable impact on
guest performance.

Note, as of the April 2021 version of Intels' SDM, it also contradicts
itself with respect to TLB flushing.  The overview of INIT explicitly
calls out the TLBs as being invalidated, while a table later in the same
section says they are unchanged.

  9.1 INITIALIZATION OVERVIEW:
    The major difference is that during an INIT, the internal caches, MSRs,
    MTRRs, and x87 FPU state are left unchanged (although, the TLBs and BTB
    are invalidated as with a hardware reset)

  Table 9-1:

  Register                    Power up    Reset      INIT
  Data and Code Cache, TLBs:  Invalid[6]  Invalid[6] Unchanged

Given Core2's erratum[*] about global TLB entries not being flush on INIT,
it's safe to assume that the table is simply wrong.

  AZ28. INIT Does Not Clear Global Entries in the TLB
  Problem: INIT may not flush a TLB entry when:
    • The processor is in protected mode with paging enabled and the page global enable
      flag is set (PGE bit of CR4 register)
    • G bit for the page table entry is set
    • TLB entry is present in TLB when INIT occurs
    • Software may encounter unexpected page fault or incorrect address translation due
      to a TLB entry erroneously left in TLB after INIT.

  Workaround: Write to CR3, CR4 (setting bits PSE, PGE or PAE) or CR0 (setting
              bits PG or PE) registers before writing to memory early in BIOS
              code to clear all the global entries from TLB.

  Status: For the steppings affected, see the Summary Tables of Changes.

[*] https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf

Fixes: 6aa8b732 ("[PATCH] kvm: userspace interface")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

df37ed38

KVM: x86: APICv: drop immediate APICv disablement on current vCPU · df63202f

由 Maxim Levitsky 提交于 7月 13, 2021

Special case of disabling the APICv on the current vCPU right away in
kvm_request_apicv_update doesn't bring much benefit vs raising
KVM_REQ_APICV_UPDATE on it instead, since this request will be processed
on the next entry to the guest.
(the comment about having another #VMEXIT is wrong).

It also hides various assumptions that APIVc enable state matches
the APICv inhibit state, as this special case only makes those states
match on the current vCPU.

Previous patches fixed few such assumptions so now it should be safe
to drop this special case.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210713142023.106183-5-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

df63202f

KVM: x86: enable TDP MMU by default · 71ba3f31

由 Paolo Bonzini 提交于 7月 26, 2021

With the addition of fast page fault support, the TDP-specific MMU has reached
feature parity with the original MMU. All my testing in the last few months
has been done with the TDP MMU; switch the default on 64-bit machines.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

71ba3f31

KVM: x86/mmu: fast_page_fault support for the TDP MMU · 6e8eb206

由 David Matlack 提交于 7月 13, 2021

Make fast_page_fault interoperate with the TDP MMU by leveraging
walk_shadow_page_lockless_{begin,end} to acquire the RCU read lock and
introducing a new helper function kvm_tdp_mmu_fast_pf_get_last_sptep to
grab the lowest level sptep.
Suggested-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20210713220957.3493520-5-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6e8eb206

KVM: x86/mmu: Make walk_shadow_page_lockless_{begin,end} interoperate with the TDP MMU · c5c8c7c5

由 David Matlack 提交于 7月 13, 2021

Acquire the RCU read lock in walk_shadow_page_lockless_begin and release
it in walk_shadow_page_lockless_end when the TDP MMU is enabled.  This
should not introduce any functional changes but is used in the following
commit to make fast_page_fault interoperate with the TDP MMU.
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20210713220957.3493520-4-dmatlack@google.com>
[Use if...else instead of if(){return;}]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c5c8c7c5

KVM: x86/mmu: Fix use of enums in trace_fast_page_fault · 61bcd360

由 David Matlack 提交于 7月 13, 2021

Enum values have to be exported to userspace since the formatting is not
done in the kernel. Without doing this perf maps RET_PF_FIXED and
RET_PF_SPURIOUS to 0, which results in incorrect output:

  $ perf record -a -e kvmmmu:fast_page_fault --filter "ret==3" -- ./access_tracking_perf_test
  $ perf script | head -1
   [...] new 610006048d25877 spurious 0 fixed 0  <------ should be 1

Fix this by exporting the enum values to userspace with TRACE_DEFINE_ENUM.

Fixes: c4371c2a ("KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed")
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20210713220957.3493520-3-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

61bcd360

KVM: x86/mmu: Rename cr2_or_gpa to gpa in fast_page_fault · 76cd325e

由 David Matlack 提交于 7月 13, 2021

fast_page_fault is only called from direct_page_fault where we know the
address is a gpa.

Fixes: 736c291c ("KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM")
Reviewed-by: NBen Gardon <bgardon@google.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20210713220957.3493520-2-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

76cd325e

x86/kvm: remove non-x86 stuff from arch/x86/kvm/ioapic.h · 1694caef

由 Juergen Gross 提交于 7月 01, 2021

The file has been moved to arch/x86 long time ago. Time to get rid of
non-x86 stuff.
Signed-off-by: NJuergen Gross <jgross@suse.com>
Message-Id: <20210701154105.23215-3-jgross@suse.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1694caef

KVM: X86: Add per-vm stat for max rmap list size · ec1cf69c

由 Peter Xu 提交于 6月 25, 2021

Add a new statistic max_mmu_rmap_size, which stores the maximum size of rmap
for the vm.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20210625153214.43106-2-peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ec1cf69c

KVM: x86/mmu: Return old SPTE from mmu_spte_clear_track_bits() · 7fa2a347

由 Sean Christopherson 提交于 7月 02, 2021

Return the old SPTE when clearing a SPTE and push the "old SPTE present"
check to the caller.  Private shadow page support will use the old SPTE
in rmap_remove() to determine whether or not there is a linked private
shadow page.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NIsaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <b16bac1fd1357aaf39e425aab2177d3f89ee8318.1625186503.git.isaku.yamahata@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7fa2a347

KVM: x86/mmu: Refactor shadow walk in __direct_map() to reduce indentation · 03fffc54

由 Sean Christopherson 提交于 7月 02, 2021

Employ a 'continue' to reduce the indentation for linking a new shadow
page during __direct_map() in preparation for linking private pages.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NIsaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <702419686d5700373123f6ea84e7a946c2cad8b4.1625186503.git.isaku.yamahata@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

03fffc54

KVM: x86: Hoist kvm_dirty_regs check out of sync_regs() · e489a4a6

由 Sean Christopherson 提交于 7月 02, 2021

Move the kvm_dirty_regs vs. KVM_SYNC_X86_VALID_FIELDS check out of
sync_regs() and into its sole caller, kvm_arch_vcpu_ioctl_run().  This
allows a future patch to allow synchronizing select state for protected
VMs.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NIsaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <889017a8d31cea46472e0c64b234ef5919278ed9.1625186503.git.isaku.yamahata@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e489a4a6

KVM: x86/mmu: Mark VM as bugged if page fault returns RET_PF_INVALID · 19025e7b

由 Sean Christopherson 提交于 7月 02, 2021

Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NIsaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <298980aa5fc5707184ac082287d13a800cd9c25f.1625186503.git.isaku.yamahata@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

19025e7b

KVM: x86: Use KVM_BUG/KVM_BUG_ON to handle bugs that are fatal to the VM · 67369273

由 Sean Christopherson 提交于 7月 02, 2021

Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NIsaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <0e8760a26151f47dc47052b25ca8b84fffe0641e.1625186503.git.isaku.yamahata@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

67369273

30 7月, 2021 1 次提交

KVM: x86: accept userspace interrupt only if no event is injected · fa7a549d

由 Paolo Bonzini 提交于 7月 14, 2021

Once an exception has been injected, any side effects related to
the exception (such as setting CR2 or DR6) have been taked place.
Therefore, once KVM sets the VM-entry interruption information
field or the AMD EVENTINJ field, the next VM-entry must deliver that
exception.

Pending interrupts are processed after injected exceptions, so
in theory it would not be a problem to use KVM_INTERRUPT when
an injected exception is present.  However, DOSEMU is using
run->ready_for_interrupt_injection to detect interrupt windows
and then using KVM_SET_SREGS/KVM_SET_REGS to inject the
interrupt manually.  For this to work, the interrupt window
must be delayed after the completion of the previous event
injection.

Cc: stable@vger.kernel.org
Reported-by: NStas Sergeev <stsp2@yandex.ru>
Tested-by: NStas Sergeev <stsp2@yandex.ru>
Fixes: 71cc849b ("KVM: x86: Fix split-irqchip vs interrupt injection window request")
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fa7a549d

28 7月, 2021 5 次提交

KVM: SVM: use vmcb01 in svm_refresh_apicv_exec_ctrl · 5868b822

由 Maxim Levitsky 提交于 7月 13, 2021

Currently when SVM is enabled in guest CPUID, AVIC is inhibited as soon
as the guest CPUID is set.

AVIC happens to be fully disabled on all vCPUs by the time any guest
entry starts (if after migration the entry can be nested).

The reason is that currently we disable avic right away on vCPU from which
the kvm_request_apicv_update was called and for this case, it happens to be
called on all vCPUs (by svm_vcpu_after_set_cpuid).

After we stop doing this, AVIC will end up being disabled only when
KVM_REQ_APICV_UPDATE is processed which is after we done switching to the
nested guest.

Fix this by just using vmcb01 in svm_refresh_apicv_exec_ctrl for avic
(which is a right thing to do anyway).
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210713142023.106183-4-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5868b822

KVM: SVM: tweak warning about enabled AVIC on nested entry · feea0136

由 Maxim Levitsky 提交于 7月 13, 2021

It is possible that AVIC was requested to be disabled but
not yet disabled, e.g if the nested entry is done right
after svm_vcpu_after_set_cpuid.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210713142023.106183-3-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

feea0136

KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to be deactivated · f1577ab2

由 Maxim Levitsky 提交于 7月 13, 2021

It is possible for AVIC inhibit and AVIC active state to be mismatched.
Currently we disable AVIC right away on vCPU which started the AVIC inhibit
request thus this warning doesn't trigger but at least in theory,
if svm_set_vintr is called at the same time on multiple vCPUs,
the warning can happen.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210713142023.106183-2-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f1577ab2

KVM: SVM: delay svm_vcpu_init_msrpm after svm->vmcb is initialized · 3fa5e8fd

由 Paolo Bonzini 提交于 7月 26, 2021

Right now, svm_hv_vmcb_dirty_nested_enlightenments has an incorrect
dereference of vmcb->control.reserved_sw before the vmcb is checked
for being non-NULL.  The compiler is usually sinking the dereference
after the check; instead of doing this ourselves in the source,
ensure that svm_hv_vmcb_dirty_nested_enlightenments is only called
with a non-NULL VMCB.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Cc: Vineeth Pillai <viremana@linux.microsoft.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
[Untested for now due to issues with my AMD machine. - Paolo]

3fa5e8fd

x86/kvm: fix vcpu-id indexed array sizes · 76b4f357

由 Juergen Gross 提交于 7月 01, 2021

KVM_MAX_VCPU_ID is the maximum vcpu-id of a guest, and not the number
of vcpu-ids. Fix array indexed by vcpu-id to have KVM_MAX_VCPU_ID+1
elements.

Note that this is currently no real problem, as KVM_MAX_VCPU_ID is
an odd number, resulting in always enough padding being available at
the end of those arrays.

Nevertheless this should be fixed in order to avoid rare problems in
case someone is using an even number for KVM_MAX_VCPU_ID.
Signed-off-by: NJuergen Gross <jgross@suse.com>
Message-Id: <20210701154105.23215-2-jgross@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

76b4f357

26 7月, 2021 3 次提交

KVM: x86: Check the right feature bit for MSR_KVM_ASYNC_PF_ACK access · 0a31df68

由 Vitaly Kuznetsov 提交于 7月 22, 2021

MSR_KVM_ASYNC_PF_ACK MSR is part of interrupt based asynchronous page fault
interface and not the original (deprecated) KVM_FEATURE_ASYNC_PF. This is
stated in Documentation/virt/kvm/msr.rst.

Fixes: 66570e96 ("kvm: x86: only provide PV features if enabled in guest's CPUID")
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Message-Id: <20210722123018.260035-1-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0a31df68

KVM: nSVM: Swap the parameter order for svm_copy_vmrun_state()/svm_copy_vmloadsave_state() · 2bb16bea

由 Vitaly Kuznetsov 提交于 7月 19, 2021

Make svm_copy_vmrun_state()/svm_copy_vmloadsave_state() interface match
'memcpy(dest, src)' to avoid any confusion.

No functional change intended.
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210719090322.625277-1-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2bb16bea

KVM: nSVM: Rename nested_svm_vmloadsave() to svm_copy_vmloadsave_state() · 9a9e7481

由 Vitaly Kuznetsov 提交于 7月 16, 2021

To match svm_copy_vmrun_state(), rename nested_svm_vmloadsave() to
svm_copy_vmloadsave_state().

Opportunistically add missing braces to 'else' branch in
vmload_vmsave_interception().

No functional change intended.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210716144104.465269-1-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9a9e7481

15 7月, 2021 15 次提交

KVM: nSVM: Restore nested control upon leaving SMM · bb00bd9c

由 Vitaly Kuznetsov 提交于 6月 28, 2021

If the VM was migrated while in SMM, no nested state was saved/restored,
and therefore svm_leave_smm has to load both save and control area
of the vmcb12. Save area is already loaded from HSAVE area,
so now load the control area as well from the vmcb12.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210628104425.391276-6-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bb00bd9c

KVM: nSVM: Fix L1 state corruption upon return from SMM · 37be407b

由 Vitaly Kuznetsov 提交于 6月 28, 2021

VMCB split commit 4995a368 ("KVM: SVM: Use a separate vmcb for the
nested L2 guest") broke return from SMM when we entered there from guest
(L2) mode. Gen2 WS2016/Hyper-V is known to do this on boot. The problem
manifests itself like this:

  kvm_exit:             reason EXIT_RSM rip 0x7ffbb280 info 0 0
  kvm_emulate_insn:     0:7ffbb280: 0f aa
  kvm_smm_transition:   vcpu 0: leaving SMM, smbase 0x7ffb3000
  kvm_nested_vmrun:     rip: 0x000000007ffbb280 vmcb: 0x0000000008224000
    nrip: 0xffffffffffbbe119 int_ctl: 0x01020000 event_inj: 0x00000000
    npt: on
  kvm_nested_intercepts: cr_read: 0000 cr_write: 0010 excp: 40060002
    intercepts: fd44bfeb 0000217f 00000000
  kvm_entry:            vcpu 0, rip 0xffffffffffbbe119
  kvm_exit:             reason EXIT_NPF rip 0xffffffffffbbe119 info
    200000006 1ab000
  kvm_nested_vmexit:    vcpu 0 reason npf rip 0xffffffffffbbe119 info1
    0x0000000200000006 info2 0x00000000001ab000 intr_info 0x00000000
    error_code 0x00000000
  kvm_page_fault:       address 1ab000 error_code 6
  kvm_nested_vmexit_inject: reason EXIT_NPF info1 200000006 info2 1ab000
    int_info 0 int_info_err 0
  kvm_entry:            vcpu 0, rip 0x7ffbb280
  kvm_exit:             reason EXIT_EXCP_GP rip 0x7ffbb280 info 0 0
  kvm_emulate_insn:     0:7ffbb280: 0f aa
  kvm_inj_exception:    #GP (0x0)

Note: return to L2 succeeded but upon first exit to L1 its RIP points to
'RSM' instruction but we're not in SMM.

The problem appears to be that VMCB01 gets irreversibly destroyed during
SMM execution. Previously, we used to have 'hsave' VMCB where regular
(pre-SMM) L1's state was saved upon nested_svm_vmexit() but now we just
switch to VMCB01 from VMCB02.

Pre-split (working) flow looked like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() restores L1's state from 'hsave'
- SMM -> RSM
- enter_svm_guest_mode() switches to L2 but keeps 'hsave' intact so we have
  pre-SMM (and pre L2 VMRUN) L1's state there
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from L1.

This was always broken with regards to svm_get_nested_state()/
svm_set_nested_state(): 'hsave' was never a part of what's being
save and restored so migration happening during SMM triggered from L2 would
never restore L1's state correctly.

Post-split flow (broken) looks like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() switches to VMCB01 from VMCB02
- SMM -> RSM
- enter_svm_guest_mode() switches from VMCB01 to VMCB02 but pre-SMM VMCB01
  is already lost.
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from VMCB01 but it is corrupted
 (reflects the state during 'RSM' execution).

VMX doesn't have this problem because unlike VMCB, VMCS keeps both guest
and host state so when we switch back to VMCS02 L1's state is intact there.

To resolve the issue we need to save L1's state somewhere. We could've
created a third VMCB for SMM but that would require us to modify saved
state format. L1's architectural HSAVE area (pointed by MSR_VM_HSAVE_PA)
seems appropriate: L0 is free to save any (or none) of L1's state there.
Currently, KVM does 'none'.

Note, for nested state migration to succeed, both source and destination
hypervisors must have the fix. We, however, don't need to create a new
flag indicating the fact that HSAVE area is now populated as migration
during SMM triggered from L2 was always broken.

Fixes: 4995a368 ("KVM: SVM: Use a separate vmcb for the nested L2 guest")
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

37be407b

KVM: nSVM: Introduce svm_copy_vmrun_state() · 0a758290

由 Vitaly Kuznetsov 提交于 6月 28, 2021

Separate the code setting non-VMLOAD-VMSAVE state from
svm_set_nested_state() into its own function. This is going to be
re-used from svm_enter_smm()/svm_leave_smm().
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210628104425.391276-4-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0a758290

KVM: nSVM: Check that VM_HSAVE_PA MSR was set before VMRUN · fb79f566

由 Vitaly Kuznetsov 提交于 6月 28, 2021

APM states that "The address written to the VM_HSAVE_PA MSR, which holds
the address of the page used to save the host state on a VMRUN, must point
to a hypervisor-owned page. If this check fails, the WRMSR will fail with
a #GP(0) exception. Note that a value of 0 is not considered valid for the
VM_HSAVE_PA MSR and a VMRUN that is attempted while the HSAVE_PA is 0 will
fail with a #GP(0) exception."

svm_set_msr() already checks that the supplied address is valid, so only
check for '0' is missing. Add it to nested_svm_vmrun().
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210628104425.391276-3-vkuznets@redhat.com>
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fb79f566

KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA · fce7e152

由 Vitaly Kuznetsov 提交于 6月 28, 2021

APM states that #GP is raised upon write to MSR_VM_HSAVE_PA when
the supplied address is not page-aligned or is outside of "maximum
supported physical address for this implementation".
page_address_valid() check seems suitable. Also, forcefully page-align
the address when it's written from VMM.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210628104425.391276-2-vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
[Add comment about behavior for host-provided values. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fce7e152

KVM: SVM: Fix sev_pin_memory() error checks in SEV migration utilities · c7a1b2b6

由 Sean Christopherson 提交于 5月 06, 2021

Use IS_ERR() instead of checking for a NULL pointer when querying for
sev_pin_memory() failures.  sev_pin_memory() always returns an error code
cast to a pointer, or a valid pointer; it never returns NULL.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Cc: Steve Rutherford <srutherford@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Fixes: d3d1af85 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Fixes: 15fb7de1 ("KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210506175826.2166383-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c7a1b2b6

KVM: SVM: Return -EFAULT if copy_to_user() for SEV mig packet header fails · b4a69392

由 Sean Christopherson 提交于 5月 06, 2021

Return -EFAULT if copy_to_user() fails; if accessing user memory faults,
copy_to_user() returns the number of bytes remaining, not an error code.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Cc: Steve Rutherford <srutherford@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Fixes: d3d1af85 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210506175826.2166383-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b4a69392

KVM: SVM: add module param to control the #SMI interception · 4b639a9f

由 Maxim Levitsky 提交于 7月 07, 2021

In theory there are no side effects of not intercepting #SMI,
because then #SMI becomes transparent to the OS and the KVM.

Plus an observation on recent Zen2 CPUs reveals that these
CPUs ignore #SMI interception and never deliver #SMI VMexits.

This is also useful to test nested KVM to see that L1
handles #SMIs correctly in case when L1 doesn't intercept #SMI.

Finally the default remains the same, the SMI are intercepted
by default thus this patch doesn't have any effect unless
non default module param value is used.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210707125100.677203-4-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4b639a9f

KVM: SVM: remove INIT intercept handler · 896707c2

由 Maxim Levitsky 提交于 7月 07, 2021

Kernel never sends real INIT even to CPUs, other than on boot.

Thus INIT interception is an error which should be caught
by a check for an unknown VMexit reason.

On top of that, the current INIT VM exit handler skips
the current instruction which is wrong.
That was added in commit 5ff3a351 ("KVM: x86: Move trivial
instruction-based exit handlers to common code").

Fixes: 5ff3a351 ("KVM: x86: Move trivial instruction-based exit handlers to common code")
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210707125100.677203-3-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

896707c2

KVM: SVM: #SMI interception must not skip the instruction · 991afbbe

由 Maxim Levitsky 提交于 7月 07, 2021

Commit 5ff3a351 ("KVM: x86: Move trivial instruction-based
exit handlers to common code"), unfortunately made a mistake of
treating nop_on_interception and nop_interception in the same way.

Former does truly nothing while the latter skips the instruction.

SMI VM exit handler should do nothing.
(SMI itself is handled by the host when we do STGI)

Fixes: 5ff3a351 ("KVM: x86: Move trivial instruction-based exit handlers to common code")
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210707125100.677203-2-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

991afbbe

KVM: VMX: Remove vmx_msr_index from vmx.h · c0e1303e

由 Yu Zhang 提交于 7月 08, 2021

vmx_msr_index was used to record the list of MSRs which can be lazily
restored when kvm returns to userspace. It is now reimplemented as
kvm_uret_msrs_list, a common x86 list which is only used inside x86.c.
So just remove the obsolete declaration in vmx.h.
Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
Message-Id: <20210707235702.31595-1-yu.c.zhang@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c0e1303e

KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run() · f85d4016

由 Lai Jiangshan 提交于 6月 29, 2021

When the host is using debug registers but the guest is not using them
nor is the guest in guest-debug state, the kvm code does not reset
the host debug registers before kvm_x86->run().  Rather, it relies on
the hardware vmentry instruction to automatically reset the dr7 registers
which ensures that the host breakpoints do not affect the guest.

This however violates the non-instrumentable nature around VM entry
and exit; for example, when a host breakpoint is set on vcpu->arch.cr2,

Another issue is consistency.  When the guest debug registers are active,
the host breakpoints are reset before kvm_x86->run(). But when the
guest debug registers are inactive, the host breakpoints are delayed to
be disabled.  The host tracing tools may see different results depending
on what the guest is doing.

To fix the problems, we clear %db7 unconditionally before kvm_x86->run()
if the host has set any breakpoints, no matter if the guest is using
them or not.
Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210628172632.81029-1-jiangshanlai@gmail.com>
Cc: stable@vger.kernel.org
[Only clear %db7 instead of reloading all debug registers. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f85d4016

KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM · 7234c362

由 Like Xu 提交于 6月 28, 2021

The AMD platform does not support the functions Ah CPUID leaf. The returned
results for this entry should all remain zero just like the native does:

AMD host:
   0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
(uncanny) AMD guest:
   0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00008000

Fixes: cadbaa03 ("perf/x86/intel: Make anythread filter support conditional")
Signed-off-by: NLike Xu <likexu@tencent.com>
Message-Id: <20210628074354.33848-1-likexu@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7234c362

KVM: SVM: Revert clearing of C-bit on GPA in #NPF handler · 76ff371b

由 Sean Christopherson 提交于 6月 24, 2021

Don't clear the C-bit in the #NPF handler, as it is a legal GPA bit for
non-SEV guests, and for SEV guests the C-bit is dropped before the GPA
hits the NPT in hardware.  Clearing the bit for non-SEV guests causes KVM
to mishandle #NPFs with that collide with the host's C-bit.

Although the APM doesn't explicitly state that the C-bit is not reserved
for non-SEV, Tom Lendacky confirmed that the following snippet about the
effective reduction due to the C-bit does indeed apply only to SEV guests.

  Note that because guest physical addresses are always translated
  through the nested page tables, the size of the guest physical address
  space is not impacted by any physical address space reduction indicated
  in CPUID 8000_001F[EBX]. If the C-bit is a physical address bit however,
  the guest physical address space is effectively reduced by 1 bit.

And for SEV guests, the APM clearly states that the bit is dropped before
walking the nested page tables.

  If the C-bit is an address bit, this bit is masked from the guest
  physical address when it is translated through the nested page tables.
  Consequently, the hypervisor does not need to be aware of which pages
  the guest has chosen to mark private.

Note, the bogus C-bit clearing was removed from legacy #PF handler in
commit 6d1b867d ("KVM: SVM: Don't strip the C-bit from CR2 on #PF
interception").

Fixes: 0ede79e1 ("KVM: SVM: Clear C-bit from the page fault address")
Cc: Peter Gonda <pgonda@google.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210625020354.431829-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

76ff371b

KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs · fc9bf2e0

由 Sean Christopherson 提交于 6月 23, 2021

Ignore "dynamic" host adjustments to the physical address mask when
generating the masks for guest PTEs, i.e. the guest PA masks. The host
physical address space and guest physical address space are two different
beasts, e.g. even though SEV's C-bit is the same bit location for both
host and guest, disabling SME in the host (which clears shadow_me_mask)
does not affect the guest PTE->GPA "translation".

For non-SEV guests, not dropping bits is the correct behavior. Assuming
KVM and userspace correctly enumerate/configure guest MAXPHYADDR, bits
that are lost as collateral damage from memory encryption are treated as
reserved bits, i.e. KVM will never get to the point where it attempts to
generate a gfn using the affected bits. And if userspace wants to create
a bogus vCPU, then userspace gets to deal with the fallout of hardware
doing odd things with bad GPAs.

For SEV guests, not dropping the C-bit is technically wrong, but it's a
moot point because KVM can't read SEV guest's page tables in any case
since they're always encrypted. Not to mention that the current KVM code
is also broken since sme_me_mask does not have to be non-zero for SEV to
be supported by KVM. The proper fix would be to teach all of KVM to
correctly handle guest private memory, but that's a task for the future.

Fixes: d0ec49d4 ("kvm/x86/svm: Support Secure Memory Encryption within KVM")
Cc: stable@vger.kernel.org
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210623230552.4027702-5-seanjc@google.com>
[Use a new header instead of adding header guards to paging_tmpl.h. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fc9bf2e0

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功