提交 · b96e6506c2ea7c8e3ef0860169b6d0ba2f1aca9f · openeuler / Kernel

23 9月, 2020 1 次提交

KVM: x86: VMX: Make smaller physical guest address space support user-configurable · b96e6506

由 Mohammed Gamal 提交于 9月 03, 2020

This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
Since smaller physical address spaces are only supported on VMX, the
parameter is only exposed in the kvm_intel module.

For now disable support by default, and let the user decide if they want
to enable it.

Modifications to VMX page fault and EPT violation handling will depend
on whether that parameter is enabled.
Signed-off-by: NMohammed Gamal <mgamal@redhat.com>
Message-Id: <20200903141122.72908-1-mgamal@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b96e6506

12 9月, 2020 1 次提交

KVM: x86: always allow writing '0' to MSR_KVM_ASYNC_PF_EN · d831de17

由 Vitaly Kuznetsov 提交于 9月 11, 2020

Even without in-kernel LAPIC we should allow writing '0' to
MSR_KVM_ASYNC_PF_EN as we're not enabling the mechanism. In
particular, QEMU with 'kernel-irqchip=off' fails to start
a guest with

qemu-system-x86_64: error: failed to set MSR 0x4b564d02 to 0x0

Fixes: 9d3c447c ("KVM: X86: Fix async pf caused null-ptr-deref")
Reported-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200911093147.484565-1-vkuznets@redhat.com>
[Actually commit the version proposed by Sean Christopherson. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d831de17

21 8月, 2020 1 次提交

arm64/x86: KVM: Introduce steal-time cap · 004a0124

由 Andrew Jones 提交于 8月 04, 2020

arm64 requires a vcpu fd (KVM_HAS_DEVICE_ATTR vcpu ioctl) to probe
support for steal-time. However this is unnecessary, as only a KVM
fd is required, and it complicates userspace (userspace may prefer
delaying vcpu creation until after feature probing). Introduce a cap
that can be checked instead. While x86 can already probe steal-time
support with a kvm fd (KVM_GET_SUPPORTED_CPUID), we add the cap there
too for consistency.
Signed-off-by: NAndrew Jones <drjones@redhat.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Reviewed-by: NSteven Price <steven.price@arm.com>
Link: https://lore.kernel.org/r/20200804170604.42662-7-drjones@redhat.com

004a0124

18 8月, 2020 3 次提交

kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode · cb957adb

由 Jim Mattson 提交于 8月 17, 2020

See the SDM, volume 3, section 4.4.1:

If PAE paging would be in use following an execution of MOV to CR0 or
MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
the PDPTEs are loaded from the address in CR3.

Fixes: b9baba86 ("KVM, pkeys: expose CPUID/CR4 to guest")
Cc: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NPeter Shier <pshier@google.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Message-Id: <20200817181655.3716509-1-jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cb957adb

kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode · 427890af

由 Jim Mattson 提交于 8月 17, 2020

See the SDM, volume 3, section 4.4.1:

If PAE paging would be in use following an execution of MOV to CR0 or
MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
the PDPTEs are loaded from the address in CR3.

Fixes: 0be0226f ("KVM: MMU: fix SMAP virtualization")
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NPeter Shier <pshier@google.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Message-Id: <20200817181655.3716509-2-jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

427890af

KVM: x86: fix access code passed to gva_to_gpa · 19cf4b7e

由 Paolo Bonzini 提交于 8月 17, 2020

The PK bit of the error code is computed dynamically in permission_fault
and therefore need not be passed to gva_to_gpa: only the access bits
(fetch, user, write) need to be passed down.

Not doing so causes a splat in the pku test:

   WARNING: CPU: 25 PID: 5465 at arch/x86/kvm/mmu.h:197 paging64_walk_addr_generic+0x594/0x750 [kvm]
   Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0014.D62.2001092233 01/09/2020
   RIP: 0010:paging64_walk_addr_generic+0x594/0x750 [kvm]
   Code: <0f> 0b e9 db fe ff ff 44 8b 43 04 4c 89 6c 24 30 8b 13 41 39 d0 89
   RSP: 0018:ff53778fc623fb60 EFLAGS: 00010202
   RAX: 0000000000000001 RBX: ff53778fc623fbf0 RCX: 0000000000000007
   RDX: 0000000000000001 RSI: 0000000000000002 RDI: ff4501efba818000
   RBP: 0000000000000020 R08: 0000000000000005 R09: 00000000004000e7
   R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007
   R13: ff4501efba818388 R14: 10000000004000e7 R15: 0000000000000000
   FS:  00007f2dcf31a700(0000) GS:ff4501f1c8040000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000000000000000 CR3: 0000001dea475005 CR4: 0000000000763ee0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   PKRU: 55555554
   Call Trace:
    paging64_gva_to_gpa+0x3f/0xb0 [kvm]
    kvm_fixup_and_inject_pf_error+0x48/0xa0 [kvm]
    handle_exception_nmi+0x4fc/0x5b0 [kvm_intel]
    kvm_arch_vcpu_ioctl_run+0x911/0x1c10 [kvm]
    kvm_vcpu_ioctl+0x23e/0x5d0 [kvm]
    ksys_ioctl+0x92/0xb0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x3e/0xb0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
   ---[ end trace d17eb998aee991da ]---
Reported-by: NSean Christopherson <sean.j.christopherson@intel.com>
Fixes: 89786147 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection")
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

19cf4b7e

10 8月, 2020 1 次提交

KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabled · 05487215

由 Sean Christopherson 提交于 7月 13, 2020

Don't attempt to load PDPTRs if EFER.LME=1, i.e. if 64-bit mode is
enabled.  A recent change to reload the PDTPRs when CR0.CD or CR0.NW is
toggled botched the EFER.LME handling and sends KVM down the PDTPR path
when is_paging() is true, i.e. when the guest toggles CD/NW in 64-bit
mode.

Split the CR0 checks for 64-bit vs. 32-bit PAE into separate paths.  The
64-bit path is specifically checking state when paging is toggled on,
i.e. CR0.PG transititions from 0->1.  The PDPTR path now needs to run if
the new CR0 state has paging enabled, irrespective of whether paging was
already enabled.  Trying to shave a few cycles to make the PDPTR path an
"else if" case is a mess.

Fixes: d42e3fae ("kvm: x86: Read PDPTEs on CR0.CD and CR0.NW changes")
Cc: Jim Mattson <jmattson@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Peter Shier <pshier@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200714015732.32426-1-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

05487215

05 8月, 2020 1 次提交

kvm: detect assigned device via irqbypass manager · 2edd9cb7

由 Zhu Lingshan 提交于 7月 31, 2020

vDPA devices has dedicated backed hardware like
passthrough-ed devices. Then it is possible to setup irq
offloading to vCPU for vDPA devices. Thus this patch tries to
manipulated assigned device counters by
kvm_arch_start/end_assignment() in irqbypass manager, so that
assigned devices could be detected in update_pi_irte()

We will increase/decrease the assigned device counter in kvm/x86.
Both vDPA and VFIO would go through this code path.

Only X86 uses these counters and kvm_arch_start/end_assignment(),
so this code path only affect x86 for now.
Signed-off-by: NZhu Lingshan <lingshan.zhu@intel.com>
Suggested-by: NJason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20200731065533.4144-3-lingshan.zhu@intel.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>

2edd9cb7

31 7月, 2020 2 次提交

KVM: x86: Specify max TDP level via kvm_configure_mmu() · 83013059

由 Sean Christopherson 提交于 7月 15, 2020

Capture the max TDP level during kvm_configure_mmu() instead of using a
kvm_x86_ops hook to do it at every vCPU creation.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200716034122.5998-10-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

83013059

KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR · d468d94b

由 Sean Christopherson 提交于 7月 15, 2020

Calculate the desired TDP level on the fly using the max TDP level and
MAXPHYADDR instead of doing the same when CPUID is updated.  This avoids
the hidden dependency on cpuid_maxphyaddr() in vmx_get_tdp_level() and
also standardizes the "use 5-level paging iff MAXPHYADDR > 48" behavior
across x86.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200716034122.5998-8-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d468d94b

30 7月, 2020 1 次提交

x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu() · f3020b88

由 Thomas Gleixner 提交于 7月 30, 2020

The comments explicitely explain that the work flags check and handling in
kvm_run_vcpu() is done with preemption and interrupts enabled as KVM
invokes the check again right before entering guest mode with interrupts
disabled which guarantees that the work flags are observed and handled
before VMENTER.

Nevertheless the flag pending check in kvm_run_vcpu() uses the helper
variant which requires interrupts to be disabled triggering an instant
lockdep splat. This was caught in testing before and then not fixed up in
the patch before applying. :(

Use the relaxed and intentionally racy __xfer_to_guest_mode_work_pending()
instead.

Fixes: 72c3c0fe ("x86/kvm: Use generic xfer to guest work function")
Reported-by: Qian Cai <cai@lca.pw> writes:
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87bljxa2sa.fsf@nanos.tec.linutronix.de

f3020b88

24 7月, 2020 1 次提交

x86/kvm: Use generic xfer to guest work function · 72c3c0fe

由 Thomas Gleixner 提交于 7月 23, 2020

Use the generic infrastructure to check for and handle pending work before
transitioning into guest mode.

This now handles TIF_NOTIFY_RESUME as well which was ignored so
far. Handling it is important as this covers task work and task work will
be used to offload the heavy lifting of POSIX CPU timers to thread context.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200722220520.979724969@linutronix.de

72c3c0fe

17 7月, 2020 1 次提交

treewide: Remove uninitialized_var() usage · 3f649ab7

由 Kees Cook 提交于 6月 03, 2020

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
	xargs perl -pi -e \
		's/\buninitialized_var\(([^\)]+)\)/\1/g;
		 s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: NKees Cook <keescook@chromium.org>

3f649ab7

11 7月, 2020 4 次提交

KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support · 3edd6839

由 Mohammed Gamal 提交于 7月 10, 2020

This patch adds a new capability KVM_CAP_SMALLER_MAXPHYADDR which
allows userspace to query if the underlying architecture would
support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly
(e.g. qemu can decide if it should warn for -cpu ..,phys-bits=X)

The complications in this patch are due to unexpected (but documented)
behaviour we see with NPF vmexit handling in AMD processor.  If
SVM is modified to add guest physical address checks in the NPF
and guest #PF paths, we see the followning error multiple times in
the 'access' test in kvm-unit-tests:

            test pte.p pte.36 pde.p: FAIL: pte 2000021 expected 2000001
            Dump mapping: address: 0x123400000000
            ------L4: 24c3027
            ------L3: 24c4027
            ------L2: 24c5021
            ------L1: 1002000021

This is because the PTE's accessed bit is set by the CPU hardware before
the NPF vmexit. This is handled completely by hardware and cannot be fixed
in software.

Therefore, availability of the new capability depends on a boolean variable
allow_smaller_maxphyaddr which is set individually by VMX and SVM init
routines. On VMX it's always set to true, on SVM it's only set to true
when NPT is not enabled.

CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Babu Moger <babu.moger@amd.com>
Signed-off-by: NMohammed Gamal <mgamal@redhat.com>
Message-Id: <20200710154811.418214-10-mgamal@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3edd6839

KVM: x86: rename update_bp_intercept to update_exception_bitmap · 6986982f

由 Paolo Bonzini 提交于 7月 10, 2020

We would like to introduce a callback to update the #PF intercept
when CPUID changes.  Just reuse update_bp_intercept since VMX is
already using update_exception_bitmap instead of a bespoke function.

While at it, remove an unnecessary assignment in the SVM version,
which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug)
and has nothing to do with the exception bitmap.
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6986982f

KVM: x86: Add helper functions for illegal GPA checking and page fault injection · 89786147

由 Mohammed Gamal 提交于 7月 10, 2020

This patch adds two helper functions that will be used to support virtualizing
MAXPHYADDR in both kvm-intel.ko and kvm.ko.

kvm_fixup_and_inject_pf_error() injects a page fault for a user-specified GVA,
while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU address limits.
Signed-off-by: NMohammed Gamal <mgamal@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200710154811.418214-2-mgamal@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

89786147

KVM: x86: move MSR_IA32_PERF_CAPABILITIES emulation to common x86 code · d574c539

由 Vitaly Kuznetsov 提交于 7月 10, 2020

state_test/smm_test selftests are failing on AMD with:
"Unexpected result from KVM_GET_MSRS, r: 51 (failed MSR was 0x345)"

MSR_IA32_PERF_CAPABILITIES is an emulated MSR on Intel but it is not
known to AMD code, we can move the emulation to common x86 code. For
AMD, we basically just allow the host to read and write zero to the MSR.

Fixes: 27461da3 ("KVM: x86/pmu: Support full width counting")
Suggested-by: NJim Mattson <jmattson@google.com>
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200710152559.1645827-1-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d574c539

09 7月, 2020 12 次提交

x86/kvm/vmx: Move guest enter/exit into .noinstr.text · 3ebccdf3

由 Thomas Gleixner 提交于 7月 08, 2020

Move the functions which are inside the RCU off region into the
non-instrumentable text section.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>

Message-Id: <20200708195322.037311579@linutronix.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3ebccdf3

x86/kvm: Move context tracking where it belongs · 87fa7f3e

由 Thomas Gleixner 提交于 7月 08, 2020

Context tracking for KVM happens way too early in the vcpu_run()
code. Anything after guest_enter_irqoff() and before guest_exit_irqoff()
cannot use RCU and should also be not instrumented.

The current way of doing this covers way too much code. Move it closer to
the actual vmenter/exit code.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200708195321.724574345@linutronix.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

87fa7f3e

kvm: x86: replace kvm_spec_ctrl_test_value with runtime test on the host · 841c2be0

由 Maxim Levitsky 提交于 7月 08, 2020

To avoid complex and in some cases incorrect logic in
kvm_spec_ctrl_test_value, just try the guest's given value on the host
processor instead, and if it doesn't #GP, allow the guest to set it.

One such case is when host CPU supports STIBP mitigation
but doesn't support IBRS (as is the case with some Zen2 AMD cpus),
and in this case we were giving guest #GP when it tried to use STIBP

The reason why can can do the host test is that IA32_SPEC_CTRL msr is
passed to the guest, after the guest sets it to a non zero value
for the first time (due to performance reasons),
and as as result of this, it is pointless to emulate #GP condition on
this first access, in a different way than what the host CPU does.

This is based on a patch from Sean Christopherson, who suggested this idea.

Fixes: 6441fa61 ("KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL")
Cc: stable@vger.kernel.org
Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200708115731.180097-1-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

841c2be0

KVM: x86: Extract kvm_update_cpuid_runtime() from kvm_update_cpuid() · aedbaf4f

由 Xiaoyao Li 提交于 7月 09, 2020

Beside called in kvm_vcpu_ioctl_set_cpuid*(), kvm_update_cpuid() is also
called 5 places else in x86.c and 1 place else in lapic.c. All those 6
places only need the part of updating guest CPUIDs (OSXSAVE, OSPKE, APIC,
KVM_FEATURE_PV_UNHALT, ...) based on the runtime vcpu state, so extract
them as a separate kvm_update_cpuid_runtime().
Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20200709043426.92712-3-xiaoyao.li@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

aedbaf4f

KVM: nSVM: Check that MBZ bits in CR3 and CR4 are not set on vmrun of nested guests · 761e4169

由 Krish Sadhukhan 提交于 7月 08, 2020

According to section "Canonicalization and Consistency Checks" in APM vol. 2
the following guest state is illegal:

    "Any MBZ bit of CR3 is set."
    "Any MBZ bit of CR4 is set."
Suggeted-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <1594168797-29444-3-git-send-email-krish.sadhukhan@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

761e4169

KVM: x86: Create mask for guest CR4 reserved bits in kvm_update_cpuid() · b899c132

由 Krish Sadhukhan 提交于 7月 08, 2020

Instead of creating the mask for guest CR4 reserved bits in kvm_valid_cr4(),
do it in kvm_update_cpuid() so that it can be reused instead of creating it
each time kvm_valid_cr4() is called.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <1594168797-29444-2-git-send-email-krish.sadhukhan@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b899c132

kvm: x86: Read PDPTEs on CR0.CD and CR0.NW changes · d42e3fae

由 Jim Mattson 提交于 7月 07, 2020

According to the SDM, when PAE paging would be in use following a
MOV-to-CR0 that modifies any of CR0.CD, CR0.NW, or CR0.PG, then the
PDPTEs are loaded from the address in CR3. Previously, kvm only loaded
the PDPTEs when PAE paging would be in use following a MOV-to-CR0 that
modified CR0.PG.
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Reviewed-by: NPeter Shier <pshier@google.com>
Message-Id: <20200707223630.336700-1-jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d42e3fae

kvm: x86: Set last_vmentry_cpu in vcpu_enter_guest · c967118d

由 Jim Mattson 提交于 6月 03, 2020

Since this field is now in kvm_vcpu_arch, clean things up a little by
setting it in vendor-agnostic code: vcpu_enter_guest. Note that it
must be set after the call to kvm_x86_ops.run(), since it can't be
updated before pre_sev_run().
Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Reviewed-by: NPeter Shier <pshier@google.com>
Message-Id: <20200603235623.245638-7-jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c967118d

kvm: x86: Add "last CPU" to some KVM_EXIT information · 1aa561b1

由 Jim Mattson 提交于 6月 03, 2020

More often than not, a failed VM-entry in an x86 production
environment is induced by a defective CPU. To help identify the bad
hardware, include the id of the last logical CPU to run a vCPU in the
information provided to userspace on a KVM exit for failed VM-entry or
for KVM internal errors not associated with emulation. The presence of
this additional information is indicated by a new capability,
KVM_CAP_LAST_CPU.
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Reviewed-by: NPeter Shier <pshier@google.com>
Message-Id: <20200603235623.245638-5-jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1aa561b1

KVM: x86: Move the check for upper 32 reserved bits of DR6 to separate function · f5f6145e

由 Krish Sadhukhan 提交于 5月 22, 2020

Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <20200522221954.32131-2-krish.sadhukhan@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f5f6145e

KVM: X86: Do the same ignore_msrs check for feature msrs · 12bc2132

由 Peter Xu 提交于 6月 22, 2020

Logically the ignore_msrs and report_ignored_msrs should also apply to feature
MSRs.  Add them in.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20200622220442.21998-3-peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

12bc2132

KVM: X86: Move ignore_msrs handling upper the stack · 6abe9c13

由 Peter Xu 提交于 6月 22, 2020

MSR accesses can be one of:

  (1) KVM internal access,
  (2) userspace access (e.g., via KVM_SET_MSRS ioctl),
  (3) guest access.

The ignore_msrs was previously handled by kvm_get_msr_common() and
kvm_set_msr_common(), which is the bottom of the msr access stack.  It's
working in most cases, however it could dump unwanted warning messages to dmesg
even if kvm get/set the msrs internally when calling __kvm_set_msr() or
__kvm_get_msr() (e.g. kvm_cpuid()).  Ideally we only want to trap cases (2)
or (3), but not (1) above.

To achieve this, move the ignore_msrs handling upper until the callers of
__kvm_get_msr() and __kvm_set_msr().  To identify the "msr missing" event, a
new return value (KVM_MSR_RET_INVALID==2) is used for that.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20200622220442.21998-2-peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6abe9c13

04 7月, 2020 1 次提交

KVM: x86: Inject #GP if guest attempts to toggle CR4.LA57 in 64-bit mode · d74fcfc1

由 Sean Christopherson 提交于 7月 02, 2020

Inject a #GP on MOV CR4 if CR4.LA57 is toggled in 64-bit mode, which is
illegal per Intel's SDM:

  CR4.LA57
    57-bit linear addresses (bit 12 of CR4) ... blah blah blah ...
    This bit cannot be modified in IA-32e mode.

Note, the pseudocode for MOV CR doesn't call out the fault condition,
which is likely why the check was missed during initial development.
This is arguably an SDM bug and will hopefully be fixed in future
release of the SDM.

Fixes: fd8cb433 ("KVM: MMU: Expose the LA57 feature to VM.")
Cc: stable@vger.kernel.org
Reported-by: NSebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703021714.5549-1-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d74fcfc1

29 6月, 2020 1 次提交

KVM: X86: Fix async pf caused null-ptr-deref · 9d3c447c

由 Wanpeng Li 提交于 6月 29, 2020

Syzbot reported that:

  CPU: 1 PID: 6780 Comm: syz-executor153 Not tainted 5.7.0-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:__apic_accept_irq+0x46/0xb80
  Call Trace:
   kvm_arch_async_page_present+0x7de/0x9e0
   kvm_check_async_pf_completion+0x18d/0x400
   kvm_arch_vcpu_ioctl_run+0x18bf/0x69f0
   kvm_vcpu_ioctl+0x46a/0xe20
   ksys_ioctl+0x11a/0x180
   __x64_sys_ioctl+0x6f/0xb0
   do_syscall_64+0xf6/0x7d0
   entry_SYSCALL_64_after_hwframe+0x49/0xb3

The testcase enables APF mechanism in MSR_KVM_ASYNC_PF_EN with ASYNC_PF_INT
enabled w/o setting MSR_KVM_ASYNC_PF_INT before, what's worse, interrupt
based APF 'page ready' event delivery depends on in kernel lapic, however,
we didn't bail out when lapic is not in kernel during guest setting
MSR_KVM_ASYNC_PF_EN which causes the null-ptr-deref in host later.
This patch fixes it.

Reported-by: syzbot+1bf777dfdde86d64b89b@syzkaller.appspotmail.com
Fixes: 2635b5c4 (KVM: x86: interrupt based APF 'page ready' event delivery)
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Message-Id: <1593426391-8231-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9d3c447c

23 6月, 2020 2 次提交

KVM: x86: allow TSC to differ by NTP correction bounds without TSC scaling · 26769f96

由 Marcelo Tosatti 提交于 6月 16, 2020

The Linux TSC calibration procedure is subject to small variations
(its common to see +-1 kHz difference between reboots on a given CPU, for example).

So migrating a guest between two hosts with identical processor can fail, in case
of a small variation in calibrated TSC between them.

Without TSC scaling, the current kernel interface will either return an error
(if user_tsc_khz <= tsc_khz) or enable TSC catchup mode.

This change enables the following TSC tolerance check to
accept KVM_SET_TSC_KHZ within tsc_tolerance_ppm (which is 250ppm by default).

        /*
         * Compute the variation in TSC rate which is acceptable
         * within the range of tolerance and decide if the
         * rate being applied is within that bounds of the hardware
         * rate.  If so, no scaling or compensation need be done.
         */
        thresh_lo = adjust_tsc_khz(tsc_khz, -tsc_tolerance_ppm);
        thresh_hi = adjust_tsc_khz(tsc_khz, tsc_tolerance_ppm);
        if (user_tsc_khz < thresh_lo || user_tsc_khz > thresh_hi) {
                pr_debug("kvm: requested TSC rate %u falls outside tolerance [%u,%u]\n", user_tsc_khz, thresh_lo, thresh_hi);
                use_scaling = 1;
        }

NTP daemon in the guest can correct this difference (NTP can correct upto 500ppm).
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

Message-Id: <20200616114741.GA298183@fuller.cnet>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

26769f96

KVM: X86: Fix MSR range of APIC registers in X2APIC mode · bf10bd0b

由 Xiaoyao Li 提交于 6月 16, 2020

Only MSR address range 0x800 through 0x8ff is architecturally reserved
and dedicated for accessing APIC registers in x2APIC mode.

Fixes: 0105d1a5 ("KVM: x2apic interface to lapic")
Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20200616073307.16440-1-xiaoyao.li@intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bf10bd0b

12 6月, 2020 3 次提交

KVM: x86: do not pass poisoned hva to __kvm_set_memory_region · e0135a10

由 Paolo Bonzini 提交于 6月 11, 2020

__kvm_set_memory_region does not use the hva at all, so trying to
catch use-after-delete is pointless and, worse, it fails access_ok
now that we apply it to all memslots including private kernel ones.
This fixes an AVIC regression.

Fixes: 09d952c9 ("KVM: check userspace_addr for all memslots")
Reported-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e0135a10

KVM: async_pf: Inject 'page ready' event only if 'page not present' was previously injected · 2a18b7e7

由 Vitaly Kuznetsov 提交于 6月 10, 2020

'Page not present' event may or may not get injected depending on
guest's state. If the event wasn't injected, there is no need to
inject the corresponding 'page ready' event as the guest may get
confused. E.g. Linux thinks that the corresponding 'page not present'
event wasn't delivered *yet* and allocates a 'dummy entry' for it.
This entry is never freed.

Note, 'wakeup all' events have no corresponding 'page not present'
event and always get injected.

s390 seems to always be able to inject 'page not present', the
change is effectively a nop.
Suggested-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200610175532.779793-2-vkuznets@redhat.com>
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=208081Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2a18b7e7

KVM: x86: respect singlestep when emulating instruction · 384dea1c

由 Felipe Franciosi 提交于 5月 19, 2020

When userspace configures KVM_GUESTDBG_SINGLESTEP, KVM will manage the
presence of X86_EFLAGS_TF via kvm_set/get_rflags on vcpus. The actual
rflag bit is therefore hidden from callers.

That includes init_emulate_ctxt() which uses the value returned from
kvm_get_flags() to set ctxt->tf. As a result, x86_emulate_instruction()
will skip a single step, leaving singlestep_rip stale and not returning
to userspace.

This resolves the issue by observing the vcpu guest_debug configuration
alongside ctxt->tf in x86_emulate_instruction(), performing the single
step if set.

Cc: stable@vger.kernel.org
Signed-off-by: NFelipe Franciosi <felipe@nutanix.com>
Message-Id: <20200519081048.8204-1-felipe@nutanix.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

384dea1c

09 6月, 2020 1 次提交

KVM: x86: Unexport x86_fpu_cache and make it static · 80fbd280

由 Sean Christopherson 提交于 6月 08, 2020

Make x86_fpu_cache static now that FPU allocation and destruction is
handled entirely by common x86 code.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200608180218.20946-1-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

80fbd280

08 6月, 2020 2 次提交

KVM: x86: Fix APIC page invalidation race · e649b3f0

由 Eiichi Tsukata 提交于 6月 06, 2020

Commit b1394e74 ("KVM: x86: fix APIC page invalidation") tried
to fix inappropriate APIC page invalidation by re-introducing arch
specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
kvm_mmu_notifier_invalidate_range_start. However, the patch left a
possible race where the VMCS APIC address cache is updated *before*
it is unmapped:

  (Invalidator) kvm_mmu_notifier_invalidate_range_start()
  (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
  (KVM VCPU) vcpu_enter_guest()
  (KVM VCPU) kvm_vcpu_reload_apic_access_page()
  (Invalidator) actually unmap page

Because of the above race, there can be a mismatch between the
host physical address stored in the APIC_ACCESS_PAGE VMCS field and
the host physical address stored in the EPT entry for the APIC GPA
(0xfee0000).  When this happens, the processor will not trap APIC
accesses, and will instead show the raw contents of the APIC-access page.
Because Windows OS periodically checks for unexpected modifications to
the LAPIC register, this will show up as a BSOD crash with BugCheck
CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
https://bugzilla.redhat.com/show_bug.cgi?id=1751017.

The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
cannot guarantee that no additional references are taken to the pages in
the range before kvm_mmu_notifier_invalidate_range_end().  Fortunately,
this case is supported by the MMU notifier API, as documented in
include/linux/mmu_notifier.h:

	 * If the subsystem
         * can't guarantee that no additional references are taken to
         * the pages in the range, it has to implement the
         * invalidate_range() notifier to remove any references taken
         * after invalidate_range_start().

The fix therefore is to reload the APIC-access page field in the VMCS
from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().

Cc: stable@vger.kernel.org
Fixes: b1394e74 ("KVM: x86: fix APIC page invalidation")
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951Signed-off-by: NEiichi Tsukata <eiichi.tsukata@nutanix.com>
Message-Id: <20200606042627.61070-1-eiichi.tsukata@nutanix.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e649b3f0

Revert "KVM: x86: work around leak of uninitialized stack contents" · 25597f64

由 Vitaly Kuznetsov 提交于 6月 05, 2020

handle_vmptrst()/handle_vmread() stopped injecting #PF unconditionally
and switched to nested_vmx_handle_memory_failure() which just kills the
guest with KVM_EXIT_INTERNAL_ERROR in case of MMIO access, zeroing
'exception' in kvm_write_guest_virt_system() is not needed anymore.

This reverts commit 541ab2ae.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200605115906.532682-2-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

25597f64

05 6月, 2020 1 次提交

KVM: x86: minor code refactor and comments fixup around dirty logging · 3741679b

由 Anthony Yznaga 提交于 6月 02, 2020

Consolidate the code and correct the comments to show that the actions
taken to update existing mappings to disable or enable dirty logging
are not necessary when creating, moving, or deleting a memslot.
Signed-off-by: NAnthony Yznaga <anthony.yznaga@oracle.com>
Message-Id: <1591128450-11977-4-git-send-email-anthony.yznaga@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3741679b

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功