提交 · b80c76ec982c00f2a15668ed71c1d705b6ff95fd · openeuler / raspberrypi-kernel

01 8月, 2016 2 次提交

KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD · b80c76ec

由 Jim Mattson 提交于 7月 29, 2016

Kexec needs to know the addresses of all VMCSs that are active on
each CPU, so that it can flush them from the VMCS caches. It is
safe to record superfluous addresses that are not associated with
an active VMCS, but it is not safe to omit an address associated
with an active VMCS.

After a call to vmcs_load, the VMCS that was loaded is active on
the CPU. The VMCS should be added to the CPU's list of active
VMCSs before it is loaded.
Signed-off-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

b80c76ec

kvm: x86: nVMX: maintain internal copy of current VMCS · 4f2777bc

由 David Matlack 提交于 7月 13, 2016

KVM maintains L1's current VMCS in guest memory, at the guest physical
page identified by the argument to VMPTRLD. This makes hairy
time-of-check to time-of-use bugs possible,as VCPUs can be writing
the the VMCS page in memory while KVM is emulating VMLAUNCH and
VMRESUME.

The spec documents that writing to the VMCS page while it is loaded is
"undefined". Therefore it is reasonable to load the entire VMCS into
an internal cache during VMPTRLD and ignore writes to the VMCS page
-- the guest should be using VMREAD and VMWRITE to access the current
VMCS.

To adhere to the spec, KVM should flush the current VMCS during VMPTRLD,
and the target VMCS during VMCLEAR (as given by the operand to VMCLEAR).
Since this implementation of VMCS caching only maintains the the current
VMCS, VMCLEAR will only do a flush if the operand to VMCLEAR is the
current VMCS pointer.

KVM will also flush during VMXOFF, which is not mandated by the spec,
but also not in conflict with the spec.
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4f2777bc

14 7月, 2016 5 次提交

KVM: x86: add KVM_CAP_X2APIC_API · 37131313

由 Radim Krčmář 提交于 7月 12, 2016

KVM_CAP_X2APIC_API is a capability for features related to x2APIC
enablement.  KVM_X2APIC_API_32BIT_FORMAT feature can be enabled to
extend APIC ID in get/set ioctl and MSI addresses to 32 bits.
Both are needed to support x2APIC.

The feature has to be enableable and disabled by default, because
get/set ioctl shifted and truncated APIC ID to 8 bits by using a
non-standard protocol inspired by xAPIC and the change is not
backward-compatible.

Changes to MSI addresses follow the format used by interrupt remapping
unit.  The upper address word, that used to be 0, contains upper 24 bits
of the LAPIC address in its upper 24 bits.  Lower 8 bits are reserved as
0.  Using the upper address word is not backward-compatible either as we
didn't check that userspace zeroed the word.  Reserved bits are still
not explicitly checked, but non-zero data will affect LAPIC addresses,
which will cause a bug.
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

37131313

KVM: VMX: optimize APIC ID read with APICv · c93de59d

由 Radim Krčmář 提交于 7月 12, 2016

The register is in hardware-compatible format now, so there is not need
to intercept.
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c93de59d

kvm: vmx: advertise support for ept execute only · 02120c45

由 Bandan Das 提交于 7月 12, 2016

MMU now knows about execute only mappings, so
advertise the feature to L1 hypervisors
Signed-off-by: NBandan Das <bsd@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

02120c45

kvm: mmu: track read permission explicitly for shadow EPT page tables · d95c5568

由 Bandan Das 提交于 7月 12, 2016

To support execute only mappings on behalf of L1 hypervisors,
reuse ACC_USER_MASK to signify if the L1 hypervisor has the R bit
set.

For the nested EPT case, we assumed that the U bit was always set
since there was no equivalent in EPT page tables.  Strictly
speaking, this was not necessary because handle_ept_violation
never set PFERR_USER_MASK in the error code (uf=0 in the
parlance of update_permission_bitmask).  We now have to set
both U and UF correctly, respectively in FNAME(gpte_access)
and in handle_ept_violation.

Also in handle_ept_violation bit 3 of the exit qualification is
not enough to detect a present PTE; all three bits 3-5 have to
be checked.
Signed-off-by: NBandan Das <bsd@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d95c5568

kvm: mmu: don't set the present bit unconditionally · ffb128c8

由 Bandan Das 提交于 7月 12, 2016

To support execute only mappings on behalf of L1
hypervisors, we need to teach set_spte() to honor all three of
L1's XWR bits.  As a start, add a new variable "shadow_present_mask"
that will be set for non-EPT shadow paging and clear for EPT.
Signed-off-by: NBandan Das <bsd@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ffb128c8

11 7月, 2016 4 次提交

KVM: VMX: introduce vm_{entry,exit}_control_reset_shadow · 8391ce44

由 Paolo Bonzini 提交于 7月 07, 2016

There is no reason to read the entry/exit control fields of the
VMCS and immediately write back the same value.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8391ce44

KVM: nVMX: keep preemption timer enabled during L2 execution · 9314006d

由 Paolo Bonzini 提交于 7月 06, 2016

Because the vmcs12 preemption timer is emulated through a separate hrtimer,
we can keep on using the preemption timer in the vmcs02 to emulare L1's
TSC deadline timer.

However, the corresponding bit in the pin-based execution control field
must be kept consistent between vmcs01 and vmcs02.  On vmentry we copy
it into the vmcs02; on vmexit the preemption timer must be disabled in
the vmcs01 if a preemption timer vmexit happened while in guest mode.

The preemption timer value in the vmcs02 is set by vmx_vcpu_run, so it
need not be considered in prepare_vmcs02.

Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Cc: Haozhong Zhang <haozhong.zhang@intel.com>
Tested-by: NWanpeng Li <kernellwp@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9314006d

KVM: nVMX: avoid incorrect preemption timer vmexit in nested guest · 55123e3c

由 Wanpeng Li 提交于 7月 06, 2016

The preemption timer for nested VMX is emulated by hrtimer which is started on L2
entry, stopped on L2 exit and evaluated via the check_nested_events hook. However,
nested_vmx_exit_handled is always returning true for preemption timer vmexit.  Then,
the L1 preemption timer vmexit is captured and be treated as a L2 preemption
timer vmexit, causing NULL pointer dereferences or worse in the L1 guest's
vmexit handler:

    BUG: unable to handle kernel NULL pointer dereference at           (null)
    IP: [<          (null)>]           (null)
    PGD 0
    Oops: 0010 [#1] SMP
    Call Trace:
     ? kvm_lapic_expired_hv_timer+0x47/0x90 [kvm]
     handle_preemption_timer+0xe/0x20 [kvm_intel]
     vmx_handle_exit+0x169/0x15a0 [kvm_intel]
     ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
     kvm_arch_vcpu_ioctl_run+0xdee/0x19d0 [kvm]
     ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
     ? vcpu_load+0x1c/0x60 [kvm]
     ? kvm_arch_vcpu_load+0x57/0x260 [kvm]
     kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
     do_vfs_ioctl+0x96/0x6a0
     ? __fget_light+0x2a/0x90
     SyS_ioctl+0x79/0x90
     do_syscall_64+0x68/0x180
     entry_SYSCALL64_slow_path+0x25/0x25
    Code:  Bad RIP value.
    RIP  [<          (null)>]           (null)
     RSP <ffff8800b5263c48>
    CR2: 0000000000000000
    ---[ end trace 9c70c48b1a2bc66e ]---

This can be reproduced readily by preemption timer enabled on L0 and disabled
on L1.

Return false since preemption timer vmexits must never be reflected to L2.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

55123e3c

KVM: VMX: reflect broken preemption timer in vmcs_config · 1c17c3e6

由 Paolo Bonzini 提交于 7月 08, 2016

Simplify cpu_has_vmx_preemption_timer.  This is consistent with the
rest of setup_vmcs_config and preparatory for the next patch.
Tested-by: NWanpeng Li <kernellwp@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1c17c3e6

05 7月, 2016 1 次提交

KVM: x86: Use ARRAY_SIZE instead of dividing sizeof array with sizeof an element · 03f6a22a

由 Wei Yongjun 提交于 7月 04, 2016

Use ARRAY_SIZE instead of dividing sizeof array with sizeof an element
Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

03f6a22a

01 7月, 2016 3 次提交

KVM: vmx: fix underflow in TSC deadline calculation · 9175d2e9

由 Paolo Bonzini 提交于 6月 27, 2016

If the TSC deadline timer is programmed really close to the deadline or
even in the past, the computation in vmx_set_hv_timer can underflow and
cause delta_tsc to be set to a huge value.  This generally results
in vmx_set_hv_timer returning -ERANGE, but we can fix it by limiting
delta_tsc to be positive or zero.
Reported-by: NWanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9175d2e9

KVM: x86: use guest_exit_irqoff · f2485b3e

由 Paolo Bonzini 提交于 6月 15, 2016

This gains a few clock cycles per vmexit. On Intel there is no need
anymore to enable the interrupts in vmx_handle_external_intr, since
we are using the "acknowledge interrupt on exit" feature. AMD
needs to do that, and must be careful to avoid the interrupt shadow.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f2485b3e

KVM: x86: always use "acknowledge interrupt on exit" · 91fa0f8e

由 Paolo Bonzini 提交于 6月 15, 2016

This is necessary to simplify handle_external_intr in the next patch.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

91fa0f8e

24 6月, 2016 3 次提交

KVM: VMX: enable guest access to LMCE related MSRs · c45dcc71

由 Ashok Raj 提交于 6月 22, 2016

On Intel platforms, this patch adds LMCE to KVM MCE supported
capabilities and handles guest access to LMCE related MSRs.
Signed-off-by: NAshok Raj <ashok.raj@intel.com>
[Haozhong: macro KVM_MCE_CAP_SUPPORTED => variable kvm_mce_cap_supported
           Only enable LMCE on Intel platform
           Check MSR_IA32_FEATURE_CONTROL when handling guest
             access to MSR_IA32_MCG_EXT_CTL]
Signed-off-by: NHaozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c45dcc71

KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL · 37e4c997

由 Haozhong Zhang 提交于 6月 22, 2016

KVM currently does not check the value written to guest
MSR_IA32_FEATURE_CONTROL, though bits corresponding to disabled features
may be set. This patch makes KVM to validate individual bits written to
guest MSR_IA32_FEATURE_CONTROL according to enabled features.
Signed-off-by: NHaozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

37e4c997

KVM: VMX: move msr_ia32_feature_control to vcpu_vmx · 3b84080b

由 Haozhong Zhang 提交于 6月 22, 2016

msr_ia32_feature_control will be used for LMCE and not depend only on
nested anymore, so move it from struct nested_vmx to struct vcpu_vmx.
Signed-off-by: NHaozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3b84080b

16 6月, 2016 2 次提交

kvm: vmx: hook preemption timer support · 64672c95

由 Yunhong Jiang 提交于 6月 13, 2016

Hook the VMX preemption timer to the "hv timer" functionality added
by the previous patch.  This includes: checking if the feature is
supported, if the feature is broken on the CPU, the hooks to
setup/clean the VMX preemption timer, arming the timer on vmentry
and handling the vmexit.

A module parameter states if the VMX preemption timer should be
utilized.
Signed-off-by: NYunhong Jiang <yunhong.jiang@intel.com>
[Move hv_deadline_tsc to struct vcpu_vmx, use -1 as the "unset" value.
 Put all VMX bits here.  Enable it by default #yolo. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

64672c95

kvm: vmx: rename vmx_pre/post_block to pi_pre/post_block · bc22512b

由 Yunhong Jiang 提交于 6月 13, 2016

Prepare to switch from preemption timer to hrtimer in the
vmx_pre/post_block. Current functions are only for posted interrupt,
rename them accordingly.
Signed-off-by: NYunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bc22512b

14 6月, 2016 1 次提交

KVM: x86: Fix typos · bb3541f1

由 Andrea Gelmini 提交于 5月 21, 2016

Signed-off-by: NAndrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bb3541f1

25 5月, 2016 1 次提交

kvm:vmx: more complete state update on APICv on/off · 3ce424e4

由 Roman Kagan 提交于 5月 18, 2016

The function to update APICv on/off state (in particular, to deactivate
it when enabling Hyper-V SynIC) is incomplete: it doesn't adjust
APICv-related fields among secondary processor-based VM-execution
controls.  As a result, Windows 2012 guests get stuck when SynIC-based
auto-EOI interrupt intersected with e.g. an IPI in the guest.

In addition, the MSR intercept bitmap isn't updated every time "virtualize
x2APIC mode" is toggled.  This path can only be triggered by a malicious
guest, because Windows didn't use x2APIC but rather their own synthetic
APIC access MSRs; however a guest running in a SynIC-enabled VM could
switch to x2APIC and thus obtain direct access to host APIC MSRs
(CVE-2016-4440).

The patch fixes those omissions.
Signed-off-by: NRoman Kagan <rkagan@virtuozzo.com>
Reported-by: NSteve Rutherford <srutherford@google.com>
Reported-by: NYang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3ce424e4

19 5月, 2016 1 次提交

KVM: x86: make hwapic_isr_update and hwapic_irr_update look the same · 67c9dddc

由 Paolo Bonzini 提交于 5月 10, 2016

Neither APICv nor AVIC actually need the first argument of
hwapic_isr_update, but the vCPU makes more sense than passing the
pointer to the whole virtual machine! In fact in the APICv case it's
just happening that the vCPU is used implicitly, through the loaded VMCS.

The second argument instead is named differently, make it consistent.
Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

67c9dddc

29 4月, 2016 1 次提交

KVM: x86: fix ordering of cr0 initialization code in vmx_cpu_reset · f2463247

由 Bruce Rogers 提交于 4月 28, 2016

Commit d28bc9dd reversed the order of two lines which initialize cr0,
allowing the current (old) cr0 value to mess up vcpu initialization.
This was observed in the checks for cr0 X86_CR0_WP bit in the context of
kvm_mmu_reset_context(). Besides, setting vcpu->arch.cr0 after vmx_set_cr0()
is completely redundant. Change the order back to ensure proper vcpu
initialization.

The combination of booting with ovmf firmware when guest vcpus > 1 and kvm's
ept=N option being set results in a VM-entry failure. This patch fixes that.

Fixes: d28bc9dd ("KVM: x86: INIT and reset sequences are different")
Cc: stable@vger.kernel.org
Signed-off-by: NBruce Rogers <brogers@suse.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

f2463247

28 4月, 2016 1 次提交

perf/x86/intel/pt: Don't die on VMXON · 1c5ac21a

由 Alexander Shishkin 提交于 3月 29, 2016

Some versions of Intel PT do not support tracing across VMXON, more
specifically, VMXON will clear TraceEn control bit and any attempt to
set it before VMXOFF will throw a #GP, which in the current state of
things will crash the kernel. Namely:

  $ perf record -e intel_pt// kvm -nographic

on such a machine will kill it.

To avoid this, notify the intel_pt driver before VMXON and after
VMXOFF so that it knows when not to enable itself.
Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/87oa9dwrfk.fsf@ashishki-desk.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

1c5ac21a

13 4月, 2016 1 次提交

x86/cpufeature: Replace cpu_has_xsaves with boot_cpu_has() usage · 782511b0

由 Borislav Petkov 提交于 4月 04, 2016

Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: <kvm@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1459801503-15600-11-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

782511b0

22 3月, 2016 6 次提交

KVM, pkeys: add pkeys support for permission_fault · be94f6b7

由 Huaitong Han 提交于 3月 22, 2016

Protection keys define a new 4-bit protection key field (PKEY) in bits
62:59 of leaf entries of the page tables, the PKEY is an index to PKRU
register(16 domains), every domain has 2 bits(write disable bit, access
disable bit).

Static logic has been produced in update_pkru_bitmask, dynamic logic need
read pkey from page table entries, get pkru value, and deduce the correct
result.

[ Huaitong: Xiao helps to modify many sections. ]
Signed-off-by: NHuaitong Han <huaitong.han@intel.com>
Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

be94f6b7

KVM, pkeys: save/restore PKRU when guest/host switches · 1be0e61c

由 Xiao Guangrong 提交于 3月 22, 2016

Currently XSAVE state of host is not restored after VM-exit and PKRU
is managed by XSAVE so the PKRU from guest is still controlling the
memory access even if the CPU is running the code of host. This is
not safe as KVM needs to access the memory of userspace (e,g QEMU) to
do some emulation.

So we save/restore PKRU when guest/host switches.
Signed-off-by: NHuaitong Han <huaitong.han@intel.com>
Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1be0e61c

KVM, pkeys: disable pkeys for guests in non-paging mode · ddba2628

由 Huaitong Han 提交于 3月 22, 2016

Pkeys is disabled if CPU is in non-paging mode in hardware. However KVM
always uses paging mode to emulate guest non-paging, mode with TDP. To
emulate this behavior, pkeys needs to be manually disabled when guest
switches to non-paging mode.
Signed-off-by: NHuaitong Han <huaitong.han@intel.com>
Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ddba2628

KVM: VMX: fix nested vpid for old KVM guests · ef697a71

由 Paolo Bonzini 提交于 3月 18, 2016

Old KVM guests invoke single-context invvpid without actually checking
whether it is supported.  This was fixed by commit 518c8aee ("KVM: VMX:
Make sure single type invvpid is supported before issuing invvpid
instruction", 2010-08-01) and the patch after, but pre-2.6.36
kernels lack it including RHEL 6.

Reported-by: jmontleo@redhat.com
Tested-by: jmontleo@redhat.com
Cc: stable@vger.kernel.org
Fixes: 99b83ac8Reviewed-by: NDavid Matlack <dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ef697a71

KVM: VMX: avoid guest hang on invalid invvpid instruction · f6870ee9

由 Paolo Bonzini 提交于 3月 18, 2016

A guest executing an invalid invvpid instruction would hang
because the instruction pointer was not updated.

Reported-by: jmontleo@redhat.com
Tested-by: jmontleo@redhat.com
Cc: stable@vger.kernel.org
Fixes: 99b83ac8Reviewed-by: NDavid Matlack <dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f6870ee9

KVM: VMX: avoid guest hang on invalid invept instruction · 2849eb4f

由 Paolo Bonzini 提交于 3月 18, 2016

A guest executing an invalid invept instruction would hang
because the instruction pointer was not updated.

Cc: stable@vger.kernel.org
Fixes: bfd0a56bReviewed-by: NDavid Matlack <dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2849eb4f

10 3月, 2016 1 次提交

KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo · 844a5fe2

由 Paolo Bonzini 提交于 3月 08, 2016

Yes, all of these are needed. :) This is admittedly a bit odd, but
kvm-unit-tests access.flat tests this if you run it with "-cpu host"
and of course ept=0.

KVM runs the guest with CR0.WP=1, so it must handle supervisor writes
specially when pte.u=1/pte.w=0/CR0.WP=0. Such writes cause a fault
when U=1 and W=0 in the SPTE, but they must succeed because CR0.WP=0.
When KVM gets the fault, it sets U=0 and W=1 in the shadow PTE and
restarts execution. This will still cause a user write to fault, while
supervisor writes will succeed. User reads will fault spuriously now,
and KVM will then flip U and W again in the SPTE (U=1, W=0). User reads
will be enabled and supervisor writes disabled, going back to the
originary situation where supervisor writes fault spuriously.

When SMEP is in effect, however, U=0 will enable kernel execution of
this page. To avoid this, KVM also sets NX=1 in the shadow PTE together
with U=0. If the guest has not enabled NX, the result is a continuous
stream of page faults due to the NX bit being reserved.

The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER
switch. (All machines with SMEP have the CPU_LOAD_IA32_EFER vm-entry
control, so they do not use user-return notifiers for EFER---if they did,
EFER.NX would be forced to the same value as the host).

There is another bug in the reserved bit check, which I've split to a
separate patch for easier application to stable kernels.

Cc: stable@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
Fixes: f6577a5fSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>

844a5fe2

09 3月, 2016 1 次提交

KVM: x86: disable MPX if host did not enable MPX XSAVE features · a87036ad

由 Paolo Bonzini 提交于 3月 08, 2016

When eager FPU is disabled, KVM will still see the MPX bit in CPUID and
presumably the MPX vmentry and vmexit controls.  However, it will not
be able to expose the MPX XSAVE features to the guest, because the guest's
accessible XSAVE features are always a subset of host_xcr0.

In this case, we should disable the MPX CPUID bit, the BNDCFGS MSR,
and the MPX vmentry and vmexit controls for nested virtualization.
It is then unnecessary to enable guest eager FPU if the guest has the
MPX CPUID bit set.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a87036ad

08 3月, 2016 1 次提交

KVM: VMX: disable PEBS before a guest entry · 7099e2e1

由 Radim Krčmář 提交于 3月 04, 2016

Linux guests on Haswell (and also SandyBridge and Broadwell, at least)
would crash if you decided to run a host command that uses PEBS, like
  perf record -e 'cpu/mem-stores/pp' -a

This happens because KVM is using VMX MSR switching to disable PEBS, but
SDM [2015-12] 18.4.4.4 Re-configuring PEBS Facilities explains why it
isn't safe:
  When software needs to reconfigure PEBS facilities, it should allow a
  quiescent period between stopping the prior event counting and setting
  up a new PEBS event. The quiescent period is to allow any latent
  residual PEBS records to complete its capture at their previously
  specified buffer address (provided by IA32_DS_AREA).

There might not be a quiescent period after the MSR switch, so a CPU
ends up using host's MSR_IA32_DS_AREA to access an area in guest's
memory.  (Or MSR switching is just buggy on some models.)

The guest can learn something about the host this way:
If the guest doesn't map address pointed by MSR_IA32_DS_AREA, it results
in #PF where we leak host's MSR_IA32_DS_AREA through CR2.

After that, a malicious guest can map and configure memory where
MSR_IA32_DS_AREA is pointing and can therefore get an output from
host's tracing.

This is not a critical leak as the host must initiate with PEBS tracing
and I have not been able to get a record from more than one instruction
before vmentry in vmx_vcpu_run() (that place has most registers already
overwritten with guest's).

We could disable PEBS just few instructions before vmentry, but
disabling it earlier shouldn't affect host tracing too much.
We also don't need to switch MSR_IA32_PEBS_ENABLE on VMENTRY, but that
optimization isn't worth its code, IMO.

(If you are implementing PEBS for guests, be sure to handle the case
 where both host and guest enable PEBS, because this patch doesn't.)

Fixes: 26a4f3c0 ("perf/x86: disable PEBS on a guest entry.")
Cc: <stable@vger.kernel.org>
Reported-by: NJiří Olša <jolsa@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7099e2e1

04 3月, 2016 1 次提交

KVM: VMX: use vmcs_clear/set_bits for debug register exits · 8f22372f

由 Paolo Bonzini 提交于 2月 26, 2016

Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8f22372f

02 3月, 2016 1 次提交

kvm: x86: Update tsc multiplier on change. · 2680d6da

由 Owen Hofmann 提交于 3月 01, 2016

vmx.c writes the TSC_MULTIPLIER field in vmx_vcpu_load, but only when a
vcpu has migrated physical cpus. Record the last value written and
update in vmx_vcpu_load on any change, otherwise a cpu migration must
occur for TSC frequency scaling to take effect.

Cc: stable@vger.kernel.org
Fixes: ff2c3a18Signed-off-by: NOwen Hofmann <osh@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2680d6da

24 2月, 2016 2 次提交

x86: Fix misspellings in comments · 6a6256f9

由 Adam Buchbinder 提交于 2月 23, 2016

Signed-off-by: NAdam Buchbinder <adam.buchbinder@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: trivial@kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

6a6256f9

x86/kvm: Add output operand in vmx_handle_external_intr inline asm · 3f62de5f

由 Chris J Arges 提交于 1月 22, 2016

Stacktool generates the following warning:
  stacktool: arch/x86/kvm/vmx.o: vmx_handle_external_intr()+0x67: call without frame pointer save/setup

By adding the stackpointer as an output operand, this patch ensures that a
stack frame is created when CONFIG_FRAME_POINTER is enabled for the inline
assmebly statement.
Signed-off-by: NChris J Arges <chris.j.arges@canonical.com>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: gleb@kernel.org
Cc: kvm@vger.kernel.org
Cc: live-patching@vger.kernel.org
Cc: pbonzini@redhat.com
Link: http://lkml.kernel.org/r/1453499078-9330-3-git-send-email-chris.j.arges@canonical.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

3f62de5f

23 2月, 2016 1 次提交

KVM: x86: use list_last_entry · d74c0e6b

由 Geliang Tang 提交于 1月 01, 2016

To make the intention clearer, use list_last_entry instead of
list_entry.
Signed-off-by: NGeliang Tang <geliangtang@163.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d74c0e6b