提交 · a8ebfcd33caf29592957229c8350f67b48b8efce · openeuler / Kernel

14 7月, 2022 1 次提交

KVM: nVMX: Always enable TSC scaling for L2 when it was enabled for L1 · 99482726

由 Vitaly Kuznetsov 提交于 7月 12, 2022

Windows 10/11 guests with Hyper-V role (WSL2) enabled are observed to
hang upon boot or shortly after when a non-default TSC frequency was
set for L1. The issue is observed on a host where TSC scaling is
supported. The problem appears to be that Windows doesn't use TSC
frequency for its guests even when the feature is advertised and KVM
filters SECONDARY_EXEC_TSC_SCALING out when creating L2 controls from
L1's. This leads to L2 running with the default frequency (matching
host's) while L1 is running with an altered one.

Keep SECONDARY_EXEC_TSC_SCALING in secondary exec controls for L2 when
it was set for L1. TSC_MULTIPLIER is already correctly computed and
written by prepare_vmcs02().
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220712135009.952805-1-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

99482726

27 6月, 2022 1 次提交

KVM: VMX: Convert launched argument to flags · bb066506

由 Josh Poimboeuf 提交于 6月 14, 2022

Convert __vmx_vcpu_run()'s 'launched' argument to 'flags', in
preparation for doing SPEC_CTRL handling immediately after vmexit, which
will need another flag.

This is much easier than adding a fourth argument, because this code
supports both 32-bit and 64-bit, and the fourth argument on 32-bit would
have to be pushed on the stack.

Note that __vmx_vcpu_run_flags() is called outside of the noinstr
critical section because it will soon start calling potentially
traceable functions.
Signed-off-by: NJosh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>

bb066506

30 4月, 2022 1 次提交

KVM: x86: Clean up and document nested #PF workaround · 6819af75

由 Sean Christopherson 提交于 3月 03, 2022

Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround
with an explicit, common workaround in kvm_inject_emulated_page_fault().
Aside from being a hack, the current approach is brittle and incomplete,
e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(),
and nVMX fails to apply the workaround when VMX is intercepting #PF due
to allow_smaller_maxphyaddr=1.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6819af75

22 4月, 2022 1 次提交

KVM: nVMX: Defer APICv updates while L2 is active until L1 is active · 7c69661e

由 Sean Christopherson 提交于 4月 20, 2022

Defer APICv updates that occur while L2 is active until nested VM-Exit,
i.e. until L1 regains control. vmx_refresh_apicv_exec_ctrl() assumes L1
is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
vmcs01. E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
becomes unhibited while L2 is active, KVM will set various APICv controls
in vmcs02 and trigger a failed VM-Entry. The kicker is that, unless
running with nested_early_check=1, KVM blames L1 and chaos ensues.

In all cases, ignoring vmcs02 and always deferring the inhibition change
to vmcs01 is correct (or at least acceptable). The ABSENT and DISABLE
inhibitions cannot truly change while L2 is active (see below).

IRQ_BLOCKING can change, but it is firmly a best effort debug feature.
Furthermore, only L2's APIC is accelerated/virtualized to the full extent
possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR
interception will apply to the virtual APIC managed by KVM.
The exception is the SELF_IPI register when x2APIC is enabled, but that's
an acceptable hole.

Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the
MSRs to L2, but for that to work in any sane capacity, L1 would need to
pass through IRQs to L2 as well, and IRQs must be intercepted to enable
virtual interrupt delivery. I.e. exposing Auto EOI to L2 and enabling
VID for L2 are, for all intents and purposes, mutually exclusive.

Lack of dynamic toggling is also why this scenario is all but impossible
to encounter in KVM's current form. But a future patch will pend an
APICv update request _during_ vCPU creation to plug a race where a vCPU
that's being created doesn't get included in the "all vCPUs request"
because it's not yet visible to other vCPUs. If userspaces restores L2
after VM creation (hello, KVM selftests), the first KVM_RUN will occur
while L2 is active and thus service the APICv update request made during
VM creation.

Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220420013732.3308816-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7c69661e

14 4月, 2022 3 次提交

KVM: nVMX: Clear IDT vectoring on nested VM-Exit for double/triple fault · 9bd1f0ef

由 Sean Christopherson 提交于 4月 07, 2022

Clear the IDT vectoring field in vmcs12 on next VM-Exit due to a double
or triple fault.  Per the SDM, a VM-Exit isn't considered to occur during
event delivery if the exit is due to an intercepted double fault or a
triple fault.  Opportunistically move the default clearing (no event
"pending") into the helper so that it's more obvious that KVM does indeed
handle this case.

Note, the double fault case is worded rather wierdly in the SDM:

  The original event results in a double-fault exception that causes the
  VM exit directly.

Temporarily ignoring injected events, double faults can _only_ occur if
an exception occurs while attempting to deliver a different exception,
i.e. there's _always_ an original event.  And for injected double fault,
while there's no original event, injected events are never subject to
interception.

Presumably the SDM is calling out that a the vectoring info will be valid
if a different exit occurs after a double fault, e.g. if a #PF occurs and
is intercepted while vectoring #DF, then the vectoring info will show the
double fault.  In other words, the clause can simply be read as:

  The VM exit is caused by a double-fault exception.

Fixes: 4704d0be ("KVM: nVMX: Exiting from L2 to L1")
Cc: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220407002315.78092-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9bd1f0ef

KVM: nVMX: Leave most VM-Exit info fields unmodified on failed VM-Entry · c3634d25

由 Sean Christopherson 提交于 4月 07, 2022

Don't modify vmcs12 exit fields except EXIT_REASON and EXIT_QUALIFICATION
when performing a nested VM-Exit due to failed VM-Entry.  Per the SDM,
only the two aformentioned fields are filled and "All other VM-exit
information fields are unmodified".

Fixes: 4704d0be ("KVM: nVMX: Exiting from L2 to L1")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220407002315.78092-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c3634d25

KVM: x86: Drop WARNs that assert a triple fault never "escapes" from L2 · 45846661

由 Sean Christopherson 提交于 4月 07, 2022

Remove WARNs that sanity check that KVM never lets a triple fault for L2
escape and incorrectly end up in L1.  In normal operation, the sanity
check is perfectly valid, but it incorrectly assumes that it's impossible
for userspace to induce KVM_REQ_TRIPLE_FAULT without bouncing through
KVM_RUN (which guarantees kvm_check_nested_state() will see and handle
the triple fault).

The WARN can currently be triggered if userspace injects a machine check
while L2 is active and CR4.MCE=0.  And a future fix to allow save/restore
of KVM_REQ_TRIPLE_FAULT, e.g. so that a synthesized triple fault isn't
lost on migration, will make it trivially easy for userspace to trigger
the WARN.

Clearing KVM_REQ_TRIPLE_FAULT when forcibly leaving guest mode is
tempting, but wrong, especially if/when the request is saved/restored,
e.g. if userspace restores events (including a triple fault) and then
restores nested state (which may forcibly leave guest mode).  Ignoring
the fact that KVM doesn't currently provide the necessary APIs, it's
userspace's responsibility to manage pending events during save/restore.

  ------------[ cut here ]------------
  WARNING: CPU: 7 PID: 1399 at arch/x86/kvm/vmx/nested.c:4522 nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
  Modules linked in: kvm_intel kvm irqbypass
  CPU: 7 PID: 1399 Comm: state_test Not tainted 5.17.0-rc3+ #808
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
  Call Trace:
   <TASK>
   vmx_leave_nested+0x30/0x40 [kvm_intel]
   vmx_set_nested_state+0xca/0x3e0 [kvm_intel]
   kvm_arch_vcpu_ioctl+0xf49/0x13e0 [kvm]
   kvm_vcpu_ioctl+0x4b9/0x660 [kvm]
   __x64_sys_ioctl+0x83/0xb0
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  ---[ end trace 0000000000000000 ]---

Fixes: cb6a32c2 ("KVM: x86: Handle triple fault in L2 without killing L1")
Cc: stable@vger.kernel.org
Cc: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220407002315.78092-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

45846661

25 2月, 2022 5 次提交

KVM: x86/mmu: load new PGD after the shadow MMU is initialized · 3cffc89d

由 Paolo Bonzini 提交于 2月 04, 2022

Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
shadow_root_level anymore, pull the PGD load after the initialization of
the shadow MMUs.

Besides being more intuitive, this enables future simplifications
and optimizations because it's not necessary anymore to compute the
role outside kvm_init_mmu.  In particular, kvm_mmu_reset_context was not
attempting to use a cached PGD to avoid having to figure out the new role.
With this change, it could follow what nested_{vmx,svm}_load_cr3 are doing,
and avoid unloading all the cached roots.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3cffc89d

KVM: x86/mmu: do not pass vcpu to root freeing functions · 0c1c92f1

由 Paolo Bonzini 提交于 2月 21, 2022

These functions only operate on a given MMU, of which there is more
than one in a vCPU (we care about two, because the third does not have
any roots and is only used to walk guest page tables). They do need a
struct kvm in order to lock the mmu_lock, but they do not needed anything
else in the struct kvm_vcpu. So, pass the vcpu->kvm directly to them.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0c1c92f1

KVM: x86: use struct kvm_mmu_root_info for mmu->root · b9e5603c

由 Paolo Bonzini 提交于 2月 21, 2022

The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
Use the struct to have more consistency between mmu->root and
mmu->prev_roots.

The patch is entirely search and replace except for cached_root_available,
which does not need a temporary struct kvm_mmu_root_info anymore.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b9e5603c

Revert "KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()" · 1a715810

由 Sean Christopherson 提交于 2月 24, 2022

Revert back to refreshing vmcs.HOST_CR3 immediately prior to VM-Enter.
The PCID (ASID) part of CR3 can be bumped without KVM being scheduled
out, as the kernel will switch CR3 during __text_poke(), e.g. in response
to a static key toggling.  If switch_mm_irqs_off() chooses a new ASID for
the mm associate with KVM, KVM will do VM-Enter => VM-Exit with a stale
vmcs.HOST_CR3.

Add a comment to explain why KVM must wait until VM-Enter is imminent to
refresh vmcs.HOST_CR3.

The following splat was captured by stashing vmcs.HOST_CR3 in kvm_vcpu
and adding a WARN in load_new_mm_cr3() to fire if a new ASID is being
loaded for the KVM-associated mm while KVM has a "running" vCPU:

  static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
  {
	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();

	...

	WARN(vcpu && (vcpu->cr3 & GENMASK(11, 0)) != (new_mm_cr3 & GENMASK(11, 0)) &&
	     (vcpu->cr3 & PHYSICAL_PAGE_MASK) == (new_mm_cr3 & PHYSICAL_PAGE_MASK),
	     "KVM is hosed, loading CR3 = %lx, vmcs.HOST_CR3 = %lx", new_mm_cr3, vcpu->cr3);
  }

  ------------[ cut here ]------------
  KVM is hosed, loading CR3 = 8000000105393004, vmcs.HOST_CR3 = 105393003
  WARNING: CPU: 4 PID: 20717 at arch/x86/mm/tlb.c:291 load_new_mm_cr3+0x82/0xe0
  Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel
  CPU: 4 PID: 20717 Comm: stable Tainted: G        W         5.17.0-rc3+ #747
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:load_new_mm_cr3+0x82/0xe0
  RSP: 0018:ffffc9000489fa98 EFLAGS: 00010082
  RAX: 0000000000000000 RBX: 8000000105393004 RCX: 0000000000000027
  RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277d1b788
  RBP: 0000000000000004 R08: ffff888277d1b780 R09: ffffc9000489f8b8
  R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
  R13: ffff88810678a800 R14: 0000000000000004 R15: 0000000000000c33
  FS:  00007fa9f0e72700(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 00000001001b5003 CR4: 0000000000172ea0
  Call Trace:
   <TASK>
   switch_mm_irqs_off+0x1cb/0x460
   __text_poke+0x308/0x3e0
   text_poke_bp_batch+0x168/0x220
   text_poke_finish+0x1b/0x30
   arch_jump_label_transform_apply+0x18/0x30
   static_key_slow_inc_cpuslocked+0x7c/0x90
   static_key_slow_inc+0x16/0x20
   kvm_lapic_set_base+0x116/0x190
   kvm_set_apic_base+0xa5/0xe0
   kvm_set_msr_common+0x2f4/0xf60
   vmx_set_msr+0x355/0xe70 [kvm_intel]
   kvm_set_msr_ignored_check+0x91/0x230
   kvm_emulate_wrmsr+0x36/0x120
   vmx_handle_exit+0x609/0x6c0 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0x146f/0x1b80
   kvm_vcpu_ioctl+0x279/0x690
   __x64_sys_ioctl+0x83/0xb0
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  ---[ end trace 0000000000000000 ]---

This reverts commit 15ad9762.

Fixes: 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
Reported-by: NWanpeng Li <kernellwp@gmail.com>
Cc: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Acked-by: NLai Jiangshan <jiangshanlai@gmail.com>
Message-Id: <20220224191917.3508476-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1a715810

Revert "KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()" · bca06b85

由 Sean Christopherson 提交于 2月 24, 2022

Undo a nested VMX fix as a step toward reverting the commit it fixed,
15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"),
as the underlying premise that "host CR3 in the vcpu thread can only be
changed when scheduling" is wrong.

This reverts commit a9f2705e.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220224191917.3508476-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bca06b85

11 2月, 2022 1 次提交

KVM: nVMX: Refactor PMU refresh to avoid referencing kvm_x86_ops.pmu_ops · 0bcd556e

由 Sean Christopherson 提交于 1月 28, 2022

Refactor the nested VMX PMU refresh helper to pass it a flag stating
whether or not the vCPU has PERF_GLOBAL_CTRL instead of having the nVMX
helper query the information by bouncing through kvm_x86_ops.pmu_ops.
This will allow a future patch to use static_call() for the PMU ops
without having to export any static call definitions from common x86, and
it is also a step toward unexported kvm_x86_ops.

Alternatively, nVMX could call kvm_pmu_is_valid_msr() to indirectly use
kvm_x86_ops.pmu_ops, but that would incur an extra layer of indirection
and would require exporting kvm_pmu_is_valid_msr().

Opportunistically rename the helper to keep line lengths somewhat
reasonable, and to better capture its high-level role.

No functional change intended.

Cc: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-9-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0bcd556e

28 1月, 2022 2 次提交

KVM: nVMX: Allow VMREAD when Enlightened VMCS is in use · 6cbbaab6

由 Vitaly Kuznetsov 提交于 1月 12, 2022

Hyper-V TLFS explicitly forbids VMREAD and VMWRITE instructions when
Enlightened VMCS interface is in use:

"Any VMREAD or VMWRITE instructions while an enlightened VMCS is
active is unsupported and can result in unexpected behavior.""

Windows 11 + WSL2 seems to ignore this, attempts to VMREAD VMCS field
0x4404 ("VM-exit interruption information") are observed. Failing
these attempts with nested_vmx_failInvalid() makes such guests
unbootable.

Microsoft confirms this is a Hyper-V bug and claims that it'll get fixed
eventually but for the time being we need a workaround. (Temporary) allow
VMREAD to get data from the currently loaded Enlightened VMCS.

Note: VMWRITE instructions remain forbidden, it is not clear how to
handle them properly and hopefully won't ever be needed.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-6-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6cbbaab6

KVM: nVMX: Rename vmcs_to_field_offset{,_table} · 2423a4c0

由 Vitaly Kuznetsov 提交于 1月 12, 2022

vmcs_to_field_offset{,_table} may sound misleading as VMCS is an opaque
blob which is not supposed to be accessed directly. In fact,
vmcs_to_field_offset{,_table} are related to KVM defined VMCS12 structure.

Rename vmcs_field_to_offset() to get_vmcs12_field_offset() for clarity.

No functional change intended.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-4-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2423a4c0

27 1月, 2022 2 次提交

KVM: nVMX: WARN on any attempt to allocate shadow VMCS for vmcs02 · d6e656cd

由 Sean Christopherson 提交于 1月 25, 2022

WARN if KVM attempts to allocate a shadow VMCS for vmcs02.  KVM emulates
VMCS shadowing but doesn't virtualize it, i.e. KVM should never allocate
a "real" shadow VMCS for L2.

The previous code WARNed but continued anyway with the allocation,
presumably in an attempt to avoid NULL pointer dereference.
However, alloc_vmcs (and hence alloc_shadow_vmcs) can fail, and
indeed the sole caller does:

	if (enable_shadow_vmcs && !alloc_shadow_vmcs(vcpu))
		goto out_shadow_vmcs;

which makes it not a useful attempt.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220125220527.2093146-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d6e656cd

KVM: x86: Forcibly leave nested virt when SMM state is toggled · f7e57078

由 Sean Christopherson 提交于 1月 25, 2022

Forcibly leave nested virtualization operation if userspace toggles SMM
state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS.  If userspace
forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
vmxon=false and smm.vmxon=false, but all other nVMX state allocated.

Don't attempt to gracefully handle the transition as (a) most transitions
are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
sufficient information to handle all transitions, e.g. SVM wants access
to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
KVM_SET_NESTED_STATE during state restore as the latter disallows putting
the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
being post-VMXON in SMM if SMM is not active.

Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
in an architecturally impossible state.

  WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
  WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
  Modules linked in:
  CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
  RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
  Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
  Call Trace:
   <TASK>
   kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
   kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
   kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
   kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
   kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
   kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
   kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
   kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
   __fput+0x286/0x9f0 fs/file_table.c:311
   task_work_run+0xdd/0x1a0 kernel/task_work.c:164
   exit_task_work include/linux/task_work.h:32 [inline]
   do_exit+0xb29/0x2a30 kernel/exit.c:806
   do_group_exit+0xd2/0x2f0 kernel/exit.c:935
   get_signal+0x4b0/0x28c0 kernel/signal.c:2862
   arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
   handle_signal_work kernel/entry/common.c:148 [inline]
   exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
   exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
   __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
   syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
   do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>

Cc: stable@vger.kernel.org
Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220125220358.2091737-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f7e57078

07 1月, 2022 2 次提交

KVM: x86: Update vPMCs when retiring branch instructions · 018d70ff

由 Eric Hankland 提交于 11月 30, 2021

When KVM retires a guest branch instruction through emulation,
increment any vPMCs that are configured to monitor "branch
instructions retired," and update the sample period of those counters
so that they will overflow at the right time.
Signed-off-by: NEric Hankland <ehankland@google.com>
[jmattson:
  - Split the code to increment "branch instructions retired" into a
    separate commit.
  - Moved/consolidated the calls to kvm_pmu_trigger_event() in the
    emulation of VMLAUNCH/VMRESUME to accommodate the evolution of
    that code.
]
Fixes: f5132b01 ("KVM: Expose a version 2 architectural PMU to a guests")
Signed-off-by: NJim Mattson <jmattson@google.com>
Message-Id: <20211130074221.93635-7-likexu@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

018d70ff

KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs() · a9f2705e

由 Lai Jiangshan 提交于 12月 16, 2021

The host CR3 in the vcpu thread can only be changed when scheduling,
so commit 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
changed vmx.c to only save it in vmx_prepare_switch_to_guest().

However, it also has to be synced in vmx_sync_vmcs_host_state() when switching VMCS.
vmx_set_host_fs_gs() is called in both places, so rename it to
vmx_set_vmcs_host_state() and make it update HOST_CR3.

Fixes: 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211216021938.11752-2-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a9f2705e

08 12月, 2021 8 次提交

KVM: nVMX: Implement Enlightened MSR Bitmap feature · 502d2bf5

由 Vitaly Kuznetsov 提交于 11月 29, 2021

Updating MSR bitmap for L2 is not cheap and rearly needed. TLFS for Hyper-V
offers 'Enlightened MSR Bitmap' feature which allows L1 hypervisor to
inform L0 when it changes MSR bitmap, this eliminates the need to examine
L1's MSR bitmap for L2 every time when 'real' MSR bitmap for L2 gets
constructed.

Use 'vmx->nested.msr_bitmap_changed' flag to implement the feature.

Note, KVM already uses 'Enlightened MSR bitmap' feature when it runs as a
nested hypervisor on top of Hyper-V. The newly introduced feature is going
to be used by Hyper-V guests on KVM.

When the feature is enabled for Win10+WSL2, it shaves off around 700 CPU
cycles from a nested vmexit cost (tight cpuid loop test).
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211129094704.326635-5-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

502d2bf5

KVM: nVMX: Track whether changes in L0 require MSR bitmap for L2 to be rebuilt · ed2a4800

由 Vitaly Kuznetsov 提交于 11月 29, 2021

Introduce a flag to keep track of whether MSR bitmap for L2 needs to be
rebuilt due to changes in MSR bitmap for L1 or switching to a different
L2. This information will be used for Enlightened MSR Bitmap feature for
Hyper-V guests.

Note, setting msr_bitmap_changed to 'true' from set_current_vmptr() is
not really needed for Enlightened MSR Bitmap as the feature can only
be used in conjunction with Enlightened VMCS but let's keep tracking
information complete, it's cheap and in the future similar PV feature can
easily be implemented for KVM on KVM too.

No functional change intended.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211129094704.326635-4-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ed2a4800

KVM: X86: Remove mmu parameter from load_pdptrs() · 2df4a5eb

由 Lai Jiangshan 提交于 11月 24, 2021

It uses vcpu->arch.walk_mmu always; nested EPT does not have PDPTRs,
and nested NPT treats them like all other non-leaf page table levels
instead of caching them.
Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-11-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2df4a5eb

KVM: X86: Add parameter huge_page_level to kvm_init_shadow_ept_mmu() · cc022ae1

由 Lai Jiangshan 提交于 11月 24, 2021

The level of supported large page on nEPT affects the rsvds_bits_mask.
Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-8-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cc022ae1

KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest() · 15ad9762

由 Lai Jiangshan 提交于 11月 18, 2021

The host CR3 in the vcpu thread can only be changed when scheduling.
Moving the code in vmx_prepare_switch_to_guest() makes the code
simpler.
Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-5-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

15ad9762

KVM: vmx, svm: clean up mass updates to regs_avail/regs_dirty bits · 41e68b69

由 Paolo Bonzini 提交于 11月 26, 2021

Document the meaning of the three combinations of regs_avail and
regs_dirty.  Update regs_dirty just after writeback instead of
doing it later after vmexit.  After vmexit, instead, we clear the
regs_avail bits corresponding to lazily-loaded registers.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

41e68b69

KVM: X86: Mark CR3 dirty when vcpu->arch.cr3 is changed · 3883bc9d

由 Lai Jiangshan 提交于 11月 08, 2021

When vcpu->arch.cr3 is changed, it should be marked dirty unless it
is being updated to the value of the architecture guest CR3 (i.e.
VMX.GUEST_CR3 or vmcb->save.cr3 when tdp is enabled).

This patch has no functionality changed because
kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3) is superset of
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3) with additional
change to vcpu->arch.regs_dirty, but no code uses regs_dirty for
VCPU_EXREG_CR3.  (vmx_load_mmu_pgd() uses vcpu->arch.regs_avail instead
to test if VCPU_EXREG_CR3 dirty which means current code (ab)uses
regs_avail for VCPU_EXREG_CR3 dirty information.)
Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-11-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3883bc9d

KVM: x86: Tweak halt emulation helper names to free up kvm_vcpu_halt() · 1460179d

由 Sean Christopherson 提交于 10月 08, 2021

Rename a variety of HLT-related helpers to free up the function name
"kvm_vcpu_halt" for future use in generic KVM code, e.g. to differentiate
between "block" and "halt".

No functional change intended.
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-13-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1460179d

02 12月, 2021 1 次提交

KVM: VMX: Set failure code in prepare_vmcs02() · bfbb307c

由 Dan Carpenter 提交于 11月 30, 2021

The error paths in the prepare_vmcs02() function are supposed to set
*entry_failure_code but this path does not.  It leads to using an
uninitialized variable in the caller.

Fixes: 71f73470 ("KVM: nVMX: Load GUEST_IA32_PERF_GLOBAL_CTRL MSR on VM-Entry")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Message-Id: <20211130125337.GB24578@kili>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bfbb307c

26 11月, 2021 3 次提交

KVM: nVMX: Emulate guest TLB flush on nested VM-Enter with new vpid12 · 712494de

由 Sean Christopherson 提交于 11月 25, 2021

Fully emulate a guest TLB flush on nested VM-Enter which changes vpid12,
i.e. L2's VPID, instead of simply doing INVVPID to flush real hardware's
TLB entries for vpid02. From L1's perspective, changing L2's VPID is
effectively a TLB flush unless "hardware" has previously cached entries
for the new vpid12. Because KVM tracks only a single vpid12, KVM doesn't
know if the new vpid12 has been used in the past and so must treat it as
a brand new, never been used VPID, i.e. must assume that the new vpid12
represents a TLB flush from L1's perspective.

For example, if L1 and L2 share a CR3, the first VM-Enter to L2 (with a
VPID) is effectively a TLB flush as hardware/KVM has never seen vpid12
and thus can't have cached entries in the TLB for vpid12.
Reported-by: NLai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 5c614b35 ("KVM: nVMX: nested VPID emulation")
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

712494de

KVM: nVMX: Abide to KVM_REQ_TLB_FLUSH_GUEST request on nested vmentry/vmexit · 40e5f908

由 Sean Christopherson 提交于 11月 25, 2021

Like KVM_REQ_TLB_FLUSH_CURRENT, the GUEST variant needs to be serviced at
nested transitions, as KVM doesn't track requests for L1 vs L2. E.g. if
there's a pending flush when a nested VM-Exit occurs, then the flush was
requested in the context of L2 and needs to be handled before switching
to L1, otherwise the flush for L2 would effectiely be lost.

Opportunistically add a helper to handle CURRENT and GUEST as a pair, the
logic for when they need to be serviced is identical as both requests are
tied to L1 vs. L2, the only difference is the scope of the flush.
Reported-by: NLai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 07ffaf34 ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

40e5f908

KVM: VMX: do not use uninitialized gfn_to_hva_cache · 8503fea6

由 Paolo Bonzini 提交于 11月 22, 2021

An uninitialized gfn_to_hva_cache has ghc->len == 0, which causes
the accessors to croak very loudly.  While a BUG_ON is definitely
_too_ loud and a bug on its own, there is indeed an issue of using
the caches in such a way that they could not have been initialized,
because ghc->gpa == 0 might match and thus kvm_gfn_to_hva_cache_init
would not be called.

For the vmcs12_cache, the solution is simply to invoke
kvm_gfn_to_hva_cache_init unconditionally: we already know
that the cache does not match the current VMCS pointer.
For the shadow_vmcs12_cache, there is no similar condition
that checks the VMCS link pointer, so invalidate the cache
on VMXON.

Fixes: cee66664 ("KVM: nVMX: Use a gfn_to_hva_cache for vmptrld")
Acked-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Reported-by: syzbot+7b7db8bb4db6fd5e157b@syzkaller.appspotmail.com
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8503fea6

18 11月, 2021 4 次提交

KVM: nVMX: Use a gfn_to_hva_cache for vmptrld · cee66664

由 David Woodhouse 提交于 11月 15, 2021

And thus another call to kvm_vcpu_map() can die.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211115165030.7422-7-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cee66664

KVM: nVMX: Use kvm_read_guest_offset_cached() for nested VMCS check · 7d0172b3

由 David Woodhouse 提交于 11月 15, 2021

Kill another mostly gratuitous kvm_vcpu_map() which could just use the
userspace HVA for it.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211115165030.7422-6-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7d0172b3

KVM: nVMX: Use kvm_{read,write}_guest_cached() for shadow_vmcs12 · 297d597a

由 David Woodhouse 提交于 11月 15, 2021

Using kvm_vcpu_map() for reading from the guest is entirely gratuitous,
when all we do is a single memcpy and unmap it again. Fix it up to use
kvm_read_guest()... but in fact I couldn't bring myself to do that
without also making it use a gfn_to_hva_cache for both that *and* the
copy in the other direction.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211115165030.7422-5-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

297d597a

KVM: nVMX: don't use vcpu->arch.efer when checking host state on nested state load · af957eeb

由 Maxim Levitsky 提交于 11月 15, 2021

When loading nested state, don't use check vcpu->arch.efer to get the
L1 host's 64-bit vs. 32-bit state and don't check it for consistency
with respect to VM_EXIT_HOST_ADDR_SPACE_SIZE, as register state in vCPU
may be stale when KVM_SET_NESTED_STATE is called---and architecturally
does not exist. When restoring L2 state in KVM, the CPU is placed in
non-root where nested VMX code has no snapshot of L1 host state: VMX
(conditionally) loads host state fields loaded on VM-exit, but they need
not correspond to the state before entry. A simple case occurs in KVM
itself, where the host RIP field points to vmx_vmexit rather than the
instruction following vmlaunch/vmresume.

However, for the particular case of L1 being in 32- or 64-bit mode
on entry, the exit controls can be treated instead as the source of
truth regarding the state of L1 on entry, and can be used to check
that vmcs12.VM_EXIT_HOST_ADDR_SPACE_SIZE matches vmcs12.HOST_EFER if
vmcs12.VM_EXIT_LOAD_IA32_EFER is set. The consistency check on CPU
EFER vs. vmcs12.VM_EXIT_HOST_ADDR_SPACE_SIZE, instead, happens only
on VM-Enter. That's because, again, there's conceptually no "current"
L1 EFER to check on KVM_SET_NESTED_STATE.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211115131837.195527-2-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

af957eeb

11 11月, 2021 3 次提交

KVM: VMX: Add a helper function to retrieve the GPR index for INVPCID, INVVPID, and INVEPT · 329bd56c

由 Vipin Sharma 提交于 11月 09, 2021

handle_invept(), handle_invvpid(), handle_invpcid() read the same reg2
field in vmcs.VMX_INSTRUCTION_INFO to get the index of the GPR that
holds the invalidation type. Add a helper to retrieve reg2 from VMX
instruction info to consolidate and document the shift+mask magic.
Signed-off-by: NVipin Sharma <vipinsh@google.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211109174426.2350547-2-vipinsh@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

329bd56c

KVM: nVMX: Clean up x2APIC MSR handling for L2 · a5e0c252

由 Sean Christopherson 提交于 11月 09, 2021

Clean up the x2APIC MSR bitmap intereption code for L2, which is the last
holdout of open coded bitmap manipulations.  Freshen up the SDM/PRM
comment, rename the function to make it abundantly clear the funky
behavior is x2APIC specific, and explain _why_ vmcs01's bitmap is ignored
(the previous comment was flat out wrong for x2APIC behavior).

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211109013047.2041518-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a5e0c252

KVM: nVMX: Handle dynamic MSR intercept toggling · 67f4b996

由 Sean Christopherson 提交于 11月 09, 2021

Always check vmcs01's MSR bitmap when merging L0 and L1 bitmaps for L2,
and always update the relevant bits in vmcs02.  This fixes two distinct,
but intertwined bugs related to dynamic MSR bitmap modifications.

The first issue is that KVM fails to enable MSR interception in vmcs02
for the FS/GS base MSRs if L1 first runs L2 with interception disabled,
and later enables interception.

The second issue is that KVM fails to honor userspace MSR filtering when
preparing vmcs02.

Fix both issues simultaneous as fixing only one of the issues (doesn't
matter which) would create a mess that no one should have to bisect.
Fixing only the first bug would exacerbate the MSR filtering issue as
userspace would see inconsistent behavior depending on the whims of L1.
Fixing only the second bug (MSR filtering) effectively requires fixing
the first, as the nVMX code only knows how to transition vmcs02's
bitmap from 1->0.

Move the various accessor/mutators that are currently buried in vmx.c
into vmx.h so that they can be shared by the nested code.

Fixes: 1a155254 ("KVM: x86: Introduce MSR filtering")
Fixes: d69129b4 ("KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible")
Cc: stable@vger.kernel.org
Cc: Alexander Graf <graf@amazon.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211109013047.2041518-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

67f4b996

25 10月, 2021 1 次提交

KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_info · 0a62a031

由 David Edmondson 提交于 9月 20, 2021

Extend the get_exit_info static call to provide the reason for the VM
exit. Modify relevant trace points to use this rather than extracting
the reason in the caller.
Signed-off-by: NDavid Edmondson <david.edmondson@oracle.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210920103737.2696756-3-david.edmondson@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0a62a031

30 9月, 2021 1 次提交

KVM: nVMX: Reset vmxon_ptr upon VMXOFF emulation. · feb3162f

由 Vitaly Kuznetsov 提交于 9月 30, 2021

Currently, 'vmx->nested.vmxon_ptr' is not reset upon VMXOFF
emulation. This is not a problem per se as we never access
it when !vmx->nested.vmxon. But this should be done to avoid
any issue in the future.

Also, initialize the vmxon_ptr when vcpu is created.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
Message-Id: <20210929175154.11396-3-yu.c.zhang@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

feb3162f

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功