提交 · 4a58038b9e420276157785afa0a0bbb4b9bc2265 · openeuler / Kernel

16 4月, 2019 4 次提交

Revert "svm: Fix AVIC incomplete IPI emulation" · 4a58038b

由 Suthikulpanit, Suravee 提交于 3月 20, 2019

This reverts commit bb218fbc.

As Oren Twaig pointed out the old discussion:

  https://patchwork.kernel.org/patch/8292231/

that the change coud potentially cause an extra IPI to be sent to
the destination vcpu because the AVIC hardware already set the IRR bit
before the incomplete IPI #VMEXIT with id=1 (target vcpu is not running).
Since writting to ICR and ICR2 will also set the IRR. If something triggers
the destination vcpu to get scheduled before the emulation finishes, then
this could result in an additional IPI.

Also, the issue mentioned in the commit bb218fbc was misdiagnosed.

Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: NOren Twaig <oren@scalemp.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4a58038b

kvm: mmu: Fix overflow on kvm mmu page limit calculation · bc8a3d89

由 Ben Gardon 提交于 4月 08, 2019

KVM bases its memory usage limits on the total number of guest pages
across all memslots. However, those limits, and the calculations to
produce them, use 32 bit unsigned integers. This can result in overflow
if a VM has more guest pages that can be represented by a u32. As a
result of this overflow, KVM can use a low limit on the number of MMU
pages it will allocate. This makes KVM unable to map all of guest memory
at once, prompting spurious faults.

Tested: Ran all kvm-unit-tests on an Intel Haswell machine. This patch
	introduced no new failures.
Signed-off-by: NBen Gardon <bgardon@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bc8a3d89

KVM: nVMX: always use early vmcs check when EPT is disabled · 2b27924b

由 Paolo Bonzini 提交于 4月 15, 2019

The remaining failures of vmx.flat when EPT is disabled are caused by
incorrectly reflecting VMfails to the L1 hypervisor.  What happens is
that nested_vmx_restore_host_state corrupts the guest CR3, reloading it
with the host's shadow CR3 instead, because it blindly loads GUEST_CR3
from the vmcs01.

For simplicity let's just always use hardware VMCS checks when EPT is
disabled.  This way, nested_vmx_restore_host_state is not reached at
all (or at least shouldn't be reached).
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2b27924b

KVM: nVMX: allow tests to use bad virtual-APIC page address · 69090810

由 Paolo Bonzini 提交于 4月 15, 2019

As mentioned in the comment, there are some special cases where we can simply
clear the TPR shadow bit from the CPU-based execution controls in the vmcs02.
Handle them so that we can remove some XFAILs from vmx.flat.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

69090810

15 4月, 2019 1 次提交

KVM: x86/mmu: Fix an inverted list_empty() check when zapping sptes · cfd32acf

由 Sean Christopherson 提交于 4月 12, 2019

A recently introduced helper for handling zap vs. remote flush
incorrectly bails early, effectively leaking defunct shadow pages.
Manifests as a slab BUG when exiting KVM due to the shadow pages
being alive when their associated cache is destroyed.

==========================================================================
BUG kvm_mmu_page_header: Objects remaining in kvm_mmu_page_header on ...
--------------------------------------------------------------------------
Disabling lock debugging due to kernel taint
INFO: Slab 0x00000000fc436387 objects=26 used=23 fp=0x00000000d023caee ...
CPU: 6 PID: 4315 Comm: rmmod Tainted: G    B             5.1.0-rc2+ #19
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
 dump_stack+0x46/0x5b
 slab_err+0xad/0xd0
 ? on_each_cpu_mask+0x3c/0x50
 ? ksm_migrate_page+0x60/0x60
 ? on_each_cpu_cond_mask+0x7c/0xa0
 ? __kmalloc+0x1ca/0x1e0
 __kmem_cache_shutdown+0x13a/0x310
 shutdown_cache+0xf/0x130
 kmem_cache_destroy+0x1d5/0x200
 kvm_mmu_module_exit+0xa/0x30 [kvm]
 kvm_arch_exit+0x45/0x60 [kvm]
 kvm_exit+0x6f/0x80 [kvm]
 vmx_exit+0x1a/0x50 [kvm_intel]
 __x64_sys_delete_module+0x153/0x1f0
 ? exit_to_usermode_loop+0x88/0xc0
 do_syscall_64+0x4f/0x100
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: a2113634 ("KVM: x86/mmu: Split remote_flush+zap case out of kvm_mmu_flush_or_zap()")
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cfd32acf

06 4月, 2019 4 次提交

KVM: x86: nVMX: fix x2APIC VTPR read intercept · c73f4c99

由 Marc Orr 提交于 4月 01, 2019

Referring to the "VIRTUALIZING MSR-BASED APIC ACCESSES" chapter of the
SDM, when "virtualize x2APIC mode" is 1 and "APIC-register
virtualization" is 0, a RDMSR of 808H should return the VTPR from the
virtual APIC page.

However, for nested, KVM currently fails to disable the read intercept
for this MSR. This means that a RDMSR exit takes precedence over
"virtualize x2APIC mode", and KVM passes through L1's TPR to L2,
instead of sourcing the value from L2's virtual APIC page.

This patch fixes the issue by disabling the read intercept, in VMCS02,
for the VTPR when "APIC-register virtualization" is 0.

The issue described above and fix prescribed here, were verified with
a related patch in kvm-unit-tests titled "Test VMX's virtualize x2APIC
mode w/ nested".
Signed-off-by: NMarc Orr <marcorr@google.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Fixes: c992384b ("KVM: vmx: speed up MSR bitmap merge")
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c73f4c99

KVM: x86: nVMX: close leak of L0's x2APIC MSRs (CVE-2019-3887) · acff7847

由 Marc Orr 提交于 4月 01, 2019

The nested_vmx_prepare_msr_bitmap() function doesn't directly guard the
x2APIC MSR intercepts with the "virtualize x2APIC mode" MSR. As a
result, we discovered the potential for a buggy or malicious L1 to get
access to L0's x2APIC MSRs, via an L2, as follows.

1. L1 executes WRMSR(IA32_SPEC_CTRL, 1). This causes the spec_ctrl
variable, in nested_vmx_prepare_msr_bitmap() to become true.
2. L1 disables "virtualize x2APIC mode" in VMCS12.
3. L1 enables "APIC-register virtualization" in VMCS12.

Now, KVM will set VMCS02's x2APIC MSR intercepts from VMCS12, and then
set "virtualize x2APIC mode" to 0 in VMCS02. Oops.

This patch closes the leak by explicitly guarding VMCS02's x2APIC MSR
intercepts with VMCS12's "virtualize x2APIC mode" control.

The scenario outlined above and fix prescribed here, were verified with
a related patch in kvm-unit-tests titled "Add leak scenario to
virt_x2apic_mode_test".

Note, it looks like this issue may have been introduced inadvertently
during a merge---see 15303ba5.
Signed-off-by: NMarc Orr <marcorr@google.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

acff7847

KVM: SVM: prevent DBG_DECRYPT and DBG_ENCRYPT overflow · b86bc285

由 David Rientjes 提交于 3月 25, 2019

This ensures that the address and length provided to DBG_DECRYPT and
DBG_ENCRYPT do not cause an overflow.

At the same time, pass the actual number of pages pinned in memory to
sev_unpin_memory() as a cleanup.
Reported-by: NCfir Cohen <cfir@google.com>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b86bc285

kvm: svm: fix potential get_num_contig_pages overflow · ede885ec

由 David Rientjes 提交于 3月 19, 2019

get_num_contig_pages() could potentially overflow int so make its type
consistent with its usage.
Reported-by: NCfir Cohen <cfir@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ede885ec

05 4月, 2019 3 次提交

syscalls: Remove start and number from syscall_set_arguments() args · 32d92586

由 Steven Rostedt (VMware) 提交于 3月 27, 2019

After removing the start and count arguments of syscall_get_arguments() it
seems reasonable to remove them from syscall_set_arguments(). Note, as of
today, there are no users of syscall_set_arguments(). But we are told that
there will be soon. But for now, at least make it consistent with
syscall_get_arguments().

Link: http://lkml.kernel.org/r/20190327222014.GA32540@altlinux.org

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Dave Martin <dave.martin@arm.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: x86@kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: uclinux-h8-devel@lists.sourceforge.jp
Cc: linux-hexagon@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: nios2-dev@lists.rocketboards.org
Cc: openrisc@lists.librecores.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-arch@vger.kernel.org
Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
Reviewed-by: NDmitry V. Levin <ldv@altlinux.org>
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>

32d92586

syscalls: Remove start and number from syscall_get_arguments() args · b35f549d

由 Steven Rostedt (Red Hat) 提交于 11月 07, 2016

At Linux Plumbers, Andy Lutomirski approached me and pointed out that the
function call syscall_get_arguments() implemented in x86 was horribly
written and not optimized for the standard case of passing in 0 and 6 for
the starting index and the number of system calls to get. When looking at
all the users of this function, I discovered that all instances pass in only
0 and 6 for these arguments. Instead of having this function handle
different cases that are never used, simply rewrite it to return the first 6
arguments of a system call.

This should help out the performance of tracing system calls by ptrace,
ftrace and perf.

Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Dave Martin <dave.martin@arm.com>
Cc: "Dmitry V. Levin" <ldv@altlinux.org>
Cc: x86@kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: uclinux-h8-devel@lists.sourceforge.jp
Cc: linux-hexagon@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: nios2-dev@lists.rocketboards.org
Cc: openrisc@lists.librecores.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-arch@vger.kernel.org
Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
Reviewed-by: NDmitry V. Levin <ldv@altlinux.org>
Reported-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>

b35f549d

xen: Prevent buffer overflow in privcmd ioctl · 42d8644b

由 Dan Carpenter 提交于 4月 04, 2019

The "call" variable comes from the user in privcmd_ioctl_hypercall().
It's an offset into the hypercall_page[] which has (PAGE_SIZE / 32)
elements.  We need to put an upper bound on it to prevent an out of
bounds access.

Cc: stable@vger.kernel.org
Fixes: 1246ae0b ("xen: add variable hypercall caller")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NJuergen Gross <jgross@suse.com>

42d8644b

29 3月, 2019 12 次提交

x86/realmode: Make set_real_mode_mem() static inline · f560bd19

由 Matteo Croce 提交于 3月 28, 2019

Remove the unused @size argument and move it into a header file, so it
can be inlined.

 [ bp: Massage. ]
Signed-off-by: NMatteo Croce <mcroce@redhat.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-efi <linux-efi@vger.kernel.org>
Cc: platform-driver-x86@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190328114233.27835-1-mcroce@redhat.com

f560bd19

KVM: x86: update %rip after emulating IO · 45def77e

由 Sean Christopherson 提交于 3月 11, 2019

Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.

To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
Skip pio instruction when it is emulated, not executed") changed the
behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
instruction prior to exiting to userspace.  Full emulation doesn't need
such tricks becase re-emulating the instruction will naturally handle
%rip being changed to point at the reset vector.

Updating %rip prior to executing to userspace has several drawbacks:

  - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
    fails it will likely yell about the wrong address.
  - Single step exits to userspace for are effectively dropped as
    KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
  - Behavior of PIO emulation is different depending on whether it
    goes down the fast path or the slow path.

Rather than skip the PIO instruction before exiting to userspace,
snapshot the linear %rip and cancel PIO completion if the current
value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
common scenario, the snapshot and comparison has negligible overhead
as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
VMREAD in this case.

All other alternatives to snapshotting the linear %rip that don't
rely on an explicit reset announcenment suffer from one corner case
or another.  For example, canceling PIO completion on any write to
%rip fails if userspace does a save/restore of %rip, and attempting to
avoid that issue by canceling PIO only if %rip changed then fails if PIO
collides with the reset %rip.  Attempting to zero in on the exact reset
vector won't work for APs, which means adding more hooks such as the
vCPU's MP_STATE, and so on and so forth.

Checking for a linear %rip match technically suffers from corner cases,
e.g. userspace could theoretically rewrite the underlying code page and
expect a different instruction to execute, or the guest hardcodes a PIO
reset at 0xfffffff0, but those are far, far outside of what can be
considered normal operation.

Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
Cc: <stable@vger.kernel.org>
Reported-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

45def77e

x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init · 013cc6eb

由 Vitaly Kuznetsov 提交于 3月 13, 2019

When userspace initializes guest vCPUs it may want to zero all supported
MSRs including Hyper-V related ones including HV_X64_MSR_STIMERn_CONFIG/
HV_X64_MSR_STIMERn_COUNT. With commit f3b138c5 ("kvm/x86: Update SynIC
timers on guest entry only") we began doing stimer_mark_pending()
unconditionally on every config change.

The issue I'm observing manifests itself as following:
- Qemu writes 0 to STIMERn_{CONFIG,COUNT} MSRs and marks all stimers as
  pending in stimer_pending_bitmap, arms KVM_REQ_HV_STIMER;
- kvm_hv_has_stimer_pending() starts returning true;
- kvm_vcpu_has_events() starts returning true;
- kvm_arch_vcpu_runnable() starts returning true;
- when kvm_arch_vcpu_ioctl_run() gets into
  (vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) case:
  - kvm_vcpu_block() gets in 'kvm_vcpu_check_block(vcpu) < 0' and returns
    immediately, avoiding normal wait path;
  - -EAGAIN is returned from kvm_arch_vcpu_ioctl_run() immediately forcing
    userspace to retry.

So instead of normal wait path we get a busy loop on all secondary vCPUs
before they get INIT signal. This seems to be undesirable, especially given
that this happens even when Hyper-V extensions are not used.

Generally, it seems to be pointless to mark an stimer as pending in
stimer_pending_bitmap and arm KVM_REQ_HV_STIMER as the only thing
kvm_hv_process_stimers() will do is clear the corresponding bit. We may
just not mark disabled timers as pending instead.

Fixes: f3b138c5 ("kvm/x86: Update SynIC timers on guest entry only")
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

013cc6eb

kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs · 2bdb76c0

由 Xiaoyao Li 提交于 3月 08, 2019

Since MSR_IA32_ARCH_CAPABILITIES is emualted unconditionally even if
host doesn't suppot it. We should move it to array emulated_msrs from
arry msrs_to_save, to report to userspace that guest support this msr.
Signed-off-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2bdb76c0

KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts · 0cf9135b

由 Sean Christopherson 提交于 3月 07, 2019

The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
regardless of hardware support under the pretense that KVM fully
emulates MSR_IA32_ARCH_CAPABILITIES.  Unfortunately, only VMX hosts
handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).

Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
that it's emulated on AMD hosts.

Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
Cc: stable@vger.kernel.org
Reported-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0cf9135b

kvm: mmu: Used range based flushing in slot_handle_level_range · f285c633

由 Ben Gardon 提交于 3月 12, 2019

Replace kvm_flush_remote_tlbs with kvm_flush_remote_tlbs_with_address
in slot_handle_level_range. When range based flushes are not enabled
kvm_flush_remote_tlbs_with_address falls back to kvm_flush_remote_tlbs.

This changes the behavior of many functions that indirectly use
slot_handle_level_range, iff the range based flushes are enabled. The
only potential problem I see with this is that kvm->tlbs_dirty will be
cleared less often, however the only caller of slot_handle_level_range that
checks tlbs_dirty is kvm_mmu_notifier_invalidate_range_start which
checks it and does a kvm_flush_remote_tlbs after calling
kvm_unmap_hva_range anyway.

Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and
without this patch. The patch introduced no new failures.
Signed-off-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f285c633

KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region() · 4d66623c

由 Wei Yang 提交于 9月 27, 2018

* nr_mmu_pages would be non-zero only if kvm->arch.n_requested_mmu_pages is
  non-zero.

* nr_mmu_pages is always non-zero, since kvm_mmu_calculate_mmu_pages()
  never return zero.

Based on these two reasons, we can merge the two *if* clause and use the
return value from kvm_mmu_calculate_mmu_pages() directly. This simplify
the code and also eliminate the possibility for reader to believe
nr_mmu_pages would be zero.
Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4d66623c

kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP fields · 711eff3a

由 Krish Sadhukhan 提交于 2月 07, 2019

According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
following check is performed on vmentry of L2 guests:

    On processors that support Intel 64 architecture, the IA32_SYSENTER_ESP
    field and the IA32_SYSENTER_EIP field must each contain a canonical
    address.
Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

711eff3a

KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation) · 05d5a486

由 Singh, Brijesh 提交于 2月 15, 2019

Errata#1096:

On a nested data page fault when CR.SMAP=1 and the guest data read
generates a SMAP violation, GuestInstrBytes field of the VMCB on a
VMEXIT will incorrectly return 0h instead the correct guest
instruction bytes .

Recommend Workaround:

To determine what instruction the guest was executing the hypervisor
will have to decode the instruction at the instruction pointer.

The recommended workaround can not be implemented for the SEV
guest because guest memory is encrypted with the guest specific key,
and instruction decoder will not be able to decode the instruction
bytes. If we hit this errata in the SEV guest then log the message
and request a guest shutdown.
Reported-by: NVenkatesh Srinivas <venkateshs@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

05d5a486

KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size' · 47c42e6b

由 Sean Christopherson 提交于 3月 07, 2019

The cr4_pae flag is a bit of a misnomer, its purpose is really to track
whether the guest PTE that is being shadowed is a 4-byte entry or an
8-byte entry. Prior to supporting nested EPT, the size of the gpte was
reflected purely by CR4.PAE. KVM fudged things a bit for direct sptes,
but it was mostly harmless since the size of the gpte never mattered.
Now that a spte may be tracking an indirect EPT entry, relying on
CR4.PAE is wrong and ill-named.

For direct shadow pages, force the gpte_size to '1' as they are always
8-byte entries; EPT entries can only be 8-bytes and KVM always uses
8-byte entries for NPT and its identity map (when running with EPT but
not unrestricted guest).

Likewise, nested EPT entries are always 8-bytes. Nested EPT presents a
unique scenario as the size of the entries are not dictated by CR4.PAE,
but neither is the shadow page a direct map. To handle this scenario,
set cr0_wp=1 and smap_andnot_wp=1, an otherwise impossible combination,
to denote a nested EPT shadow page. Use the information to avoid
incorrectly zapping an unsync'd indirect page in __kvm_sync_page().

Providing a consistent and accurate gpte_size fixes a bug reported by
Vitaly where fast_cr3_switch() always fails when switching from L2 to
L1 as kvm_mmu_get_page() would force role.cr4_pae=0 for direct pages,
whereas kvm_calc_mmu_role_common() would set it according to CR4.PAE.

Fixes: 7dcd5755 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

47c42e6b

KVM: nVMX: Do not inherit quadrant and invalid for the root shadow EPT · 552c69b1

由 Sean Christopherson 提交于 3月 07, 2019

Explicitly zero out quadrant and invalid instead of inheriting them from
the root_mmu. Functionally, this patch is a nop as we (should) never
set quadrant for a direct mapped (EPT) root_mmu and nested EPT is only
allowed if EPT is used for L1, and the root_mmu will never be invalid at
this point.

Explicitly setting flags sets the stage for repurposing the legacy
paging bits in role, e.g. nxe, cr0_wp, and sm{a,e}p_andnot_wp, at which
point 'smm' would be the only flag to be inherited from root_mmu.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

552c69b1

x86/cpufeature: Fix __percpu annotation in this_cpu_has() · f6027c81

由 Jann Horn 提交于 3月 28, 2019

&cpu_info.x86_capability is __percpu, and the second argument of
x86_this_cpu_test_bit() is expected to be __percpu. Don't cast the
__percpu away and then implicitly add it again. This gets rid of 106
lines of sparse warnings with the kernel config I'm using.
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190328154948.152273-1-jannh@google.com

f6027c81

28 3月, 2019 3 次提交

x86/mm: Don't exceed the valid physical address space · 92c77f7c

由 Ralph Campbell 提交于 3月 25, 2019

valid_phys_addr_range() is used to sanity check the physical address range
of an operation, e.g., access to /dev/mem. It uses __pa(high_memory)
internally.

If memory is populated at the end of the physical address space, then
__pa(high_memory) is outside of the physical address space because:

   high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;

For the comparison in valid_phys_addr_range() this is not an issue, but if
CONFIG_DEBUG_VIRTUAL is enabled, __pa() maps to __phys_addr(), which
verifies that the resulting physical address is within the valid physical
address space of the CPU. So in the case that memory is populated at the
end of the physical address space, this is not true and triggers a
VIRTUAL_BUG_ON().

Use __pa(high_memory - 1) to prevent the conversion from going beyond
the end of valid physical addresses.

Fixes: be62a320 ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses")
Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Craig Bergstrom <craigb@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hans Verkuil <hans.verkuil@cisco.com>
Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: Sean Young <sean@mess.org>

Link: https://lkml.kernel.org/r/20190326001817.15413-2-rcampbell@nvidia.com

92c77f7c

x86/retpolines: Disable switch jump tables when retpolines are enabled · a9d57ef1

由 Daniel Borkmann 提交于 3月 25, 2019

Commit ce02ef06 ("x86, retpolines: Raise limit for generating indirect
calls from switch-case") raised the limit under retpolines to 20 switch
cases where gcc would only then start to emit jump tables, and therefore
effectively disabling the emission of slow indirect calls in this area.

After this has been brought to attention to gcc folks [0], Martin Liska
has then fixed gcc to align with clang by avoiding to generate switch jump
tables entirely under retpolines. This is taking effect in gcc starting
from stable version 8.4.0. Given kernel supports compilation with older
versions of gcc where the fix is not being available or backported anymore,
we need to keep the extra KBUILD_CFLAGS around for some time and generally
set the -fno-jump-tables to align with what more recent gcc is doing
automatically today.

More than 20 switch cases are not expected to be fast-path critical, but
it would still be good to align with gcc behavior for versions < 8.4.0 in
order to have consistency across supported gcc versions. vmlinux size is
slightly growing by 0.27% for older gcc. This flag is only set to work
around affected gcc, no change for clang.

  [0] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952Suggested-by: NMartin Liska <mliska@suse.cz>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Björn Töpel<bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Link: https://lkml.kernel.org/r/20190325135620.14882-1-daniel@iogearbox.net

a9d57ef1

x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y · bebd024e

由 Thomas Gleixner 提交于 3月 26, 2019

The SMT disable 'nosmt' command line argument is not working properly when
CONFIG_HOTPLUG_CPU is disabled. The teardown of the sibling CPUs which are
required to be brought up due to the MCE issues, cannot work. The CPUs are
then kept in a half dead state.

As the 'nosmt' functionality has become popular due to the speculative
hardware vulnerabilities, the half torn down state is not a proper solution
to the problem.

Enforce CONFIG_HOTPLUG_CPU=y when SMP is enabled so the full operation is
possible.
Reported-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Mukesh Ojha <mojha@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Micheal Kelley <michael.h.kelley@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190326163811.598166056@linutronix.de

bebd024e

27 3月, 2019 2 次提交

x86/realmode: Don't leak the trampoline kernel address · b929a500

由 Matteo Croce 提交于 3月 26, 2019

Since commit

  ad67b74d ("printk: hash addresses printed with %p")

at boot "____ptrval____" is printed instead of the trampoline addresses:

  Base memory trampoline at [(____ptrval____)] 99000 size 24576

Remove the print as we don't want to leak kernel addresses and this
statement is not needed anymore.

Fixes: ad67b74d ("printk: hash addresses printed with %p")
Signed-off-by: NMatteo Croce <mcroce@redhat.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190326203046.20787-1-mcroce@redhat.com

b929a500

x86/boot: Fix incorrect ifdeffery scope · 0f02daed

由 Baoquan He 提交于 3月 04, 2019

The declarations related to immovable memory handling are out of the
BOOT_COMPRESSED_MISC_H #ifdef scope, wrap them inside.
Signed-off-by: NBaoquan He <bhe@redhat.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Chao Fan <fanc.fnst@cn.fujitsu.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190304055546.18566-1-bhe@redhat.com

0f02daed

25 3月, 2019 1 次提交

x86/resctrl: Remove unused variable · 7f2daa96

由 Peng Hao 提交于 3月 10, 2019

Variable "struct rdt_resource *r" is set but not used. So remove it.
Signed-off-by: NPeng Hao <peng.hao2@zte.com.cn>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1552152584-26087-1-git-send-email-peng.hao2@zte.com.cn

7f2daa96

23 3月, 2019 2 次提交

x86/gart: Exclude GART aperture from kcore · ffc8599a

由 Kairui Song 提交于 3月 08, 2019

On machines where the GART aperture is mapped over physical RAM,
/proc/kcore contains the GART aperture range. Accessing the GART range via
/proc/kcore results in a kernel crash.

vmcore used to have the same issue, until it was fixed with commit
2a3e83c6 ("x86/gart: Exclude GART aperture from vmcore")', leveraging
existing hook infrastructure in vmcore to let /proc/vmcore return zeroes
when attempting to read the aperture region, and so it won't read from the
actual memory.

Apply the same workaround for kcore. First implement the same hook
infrastructure for kcore, then reuse the hook functions introduced in the
previous vmcore fix. Just with some minor adjustment, rename some functions
for more general usage, and simplify the hook infrastructure a bit as there
is no module usage yet.
Suggested-by: NBaoquan He <bhe@redhat.com>
Signed-off-by: NKairui Song <kasong@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NJiri Bohac <jbohac@suse.cz>
Acked-by: NBaoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Dave Young <dyoung@redhat.com>
Link: https://lkml.kernel.org/r/20190308030508.13548-1-kasong@redhat.com

ffc8599a

x86/hw_breakpoints: Make default case in hw_breakpoint_arch_parse() return an error · e898e69d

由 Nathan Chancellor 提交于 3月 07, 2019

When building with -Wsometimes-uninitialized, Clang warns:

arch/x86/kernel/hw_breakpoint.c:355:2: warning: variable 'align' is used
uninitialized whenever switch default is taken
[-Wsometimes-uninitialized]

The default cannot be reached because arch_build_bp_info() initializes
hw->len to one of the specified cases. Nevertheless the warning is valid
and returning -EINVAL makes sure that this cannot be broken by future
modifications.
Suggested-by: NNick Desaulniers <ndesaulniers@google.com>
Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: clang-built-linux@googlegroups.com
Link: https://github.com/ClangBuiltLinux/linux/issues/392
Link: https://lkml.kernel.org/r/20190307212756.4648-1-natechancellor@gmail.com

e898e69d

22 3月, 2019 1 次提交

x86/mm/pti: Make local symbols static · 4fe64a62

由 Valdis Kletnieks 提交于 3月 12, 2019

With 'make C=2 W=1', sparse and gcc both complain:

  CHECK   arch/x86/mm/pti.c
arch/x86/mm/pti.c:84:3: warning: symbol 'pti_mode' was not declared. Should it be static?
arch/x86/mm/pti.c:605:6: warning: symbol 'pti_set_kernel_image_nonglobal' was not declared. Should it be static?
  CC      arch/x86/mm/pti.o
arch/x86/mm/pti.c:605:6: warning: no previous prototype for 'pti_set_kernel_image_nonglobal' [-Wmissing-prototypes]
  605 | void pti_set_kernel_image_nonglobal(void)
      |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

pti_set_kernel_image_nonglobal() is only used locally. 'pti_mode' exists in
drivers/hwtracing/intel_th/pti.c as well, but it's a completely unrelated
local (static) symbol.

Make both static.
Signed-off-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/27680.1552376873@turing-police

4fe64a62

21 3月, 2019 7 次提交

x86/cpu/cyrix: Remove {get,set}Cx86_old macros used for Cyrix processors · 0f4d3aa7

由 Matthew Whitehead 提交于 3月 14, 2019

The getCx86_old() and setCx86_old() macros have been replaced with
correctly working getCx86() and setCx86(), so remove these unused macros.
Signed-off-by: NMatthew Whitehead <tedheadster@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: luto@kernel.org
Link: https://lkml.kernel.org/r/1552596361-8967-3-git-send-email-tedheadster@gmail.com

0f4d3aa7

x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors · 18fb053f

由 Matthew Whitehead 提交于 3月 14, 2019

There are comments in processor-cyrix.h advising you to _not_ make calls
using the deprecated macros in this style:

  setCx86_old(CX86_CCR4, getCx86_old(CX86_CCR4) | 0x80);

This is because it expands the macro into a non-functioning calling
sequence. The calling order must be:

  outb(CX86_CCR2, 0x22);
  inb(0x23);

From the comments:

 * When using the old macros a line like
 *   setCx86(CX86_CCR2, getCx86(CX86_CCR2) | 0x88);
 * gets expanded to:
 *  do {
 *    outb((CX86_CCR2), 0x22);
 *    outb((({
 *        outb((CX86_CCR2), 0x22);
 *        inb(0x23);
 *    }) | 0x88), 0x23);
 *  } while (0);

The new macros fix this problem, so use them instead. Tested on an
actual Geode processor.
Signed-off-by: NMatthew Whitehead <tedheadster@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: luto@kernel.org
Link: https://lkml.kernel.org/r/1552596361-8967-2-git-send-email-tedheadster@gmail.com

18fb053f

x86/microcode: Announce reload operation's completion · 9bd68125

由 Borislav Petkov 提交于 3月 13, 2019

By popular demand, issue a single line to dmesg after the reload
operation completes to let the user know that a reload has at least been
attempted.
Signed-off-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190313110022.8229-1-bp@alien8.de

9bd68125

x86/hyperv: Prevent potential NULL pointer dereference · 534c89c2

由 Kangjie Lu 提交于 3月 14, 2019

The page allocation in hv_cpu_init() can fail, but the code does not
have a check for that.

Add a check and return -ENOMEM when the allocation fails.

[ tglx: Massaged changelog ]
Signed-off-by: NKangjie Lu <kjlu@umn.edu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
Acked-by: N"K. Y. Srinivasan" <kys@microsoft.com>
Cc: pakki001@umn.edu
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: linux-hyperv@vger.kernel.org
Link: https://lkml.kernel.org/r/20190314054651.1315-1-kjlu@umn.edu

534c89c2

x86/hpet: Prevent potential NULL pointer dereference · 2e84f116

由 Aditya Pakki 提交于 3月 18, 2019

hpet_virt_address may be NULL when ioremap_nocache fail, but the code lacks
a check.

Add a check to prevent NULL pointer dereference.
Signed-off-by: NAditya Pakki <pakki001@umn.edu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: kjlu@umn.edu
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Roland Dreier <roland@purestorage.com>
Link: https://lkml.kernel.org/r/20190319021958.17275-1-pakki001@umn.edu

2e84f116

x86/lib: Fix indentation issue, remove extra tab · 725e29db

由 Colin Ian King 提交于 3月 14, 2019

The increment of buff is indented one level too deeply, clean
this up by removing a tab.
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: kernel-janitors@vger.kernel.org
Link: https://lkml.kernel.org/r/20190314230838.18256-1-colin.king@canonical.com

725e29db

x86/boot: Restrict header scope to make Clang happy · a9c640ac

由 Nick Desaulniers 提交于 3月 14, 2019

The inclusion of <linux/kernel.h> was causing issue as the definition of
__arch_hweight64 from arch/x86/include/asm/arch_hweight.h eventually gets
included. The definition is problematic when compiled with -m16 (all code
in arch/x86/boot/ is) as the "D" inline assembly constraint is rejected
by both compilers when passed an argument of type long long (regardless
of signedness, anything smaller is fine).

Because GCC performs inlining before semantic analysis, and
__arch_hweight64 is dead in this translation unit, GCC does not report
any issues at compile time.  Clang does the semantic analysis in the
front end, before inlining (run in the middle) can determine the code is
dead. I consider this another case of PR33587, which I think we can do
more work to solve.

It turns out that arch/x86/boot/string.c doesn't actually need
linux/kernel.h, simply linux/limits.h and linux/compiler.h.
Suggested-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NNick Desaulniers <ndesaulniers@google.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NNathan Chancellor <natechancellor@gmail.com>
Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
Cc: bp@alien8.de
Cc: niravd@google.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Chao Fan <fanc.fnst@cn.fujitsu.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://bugs.llvm.org/show_bug.cgi?id=33587
Link: https://github.com/ClangBuiltLinux/linux/issues/347
Link: https://lkml.kernel.org/r/20190314221458.83047-1-ndesaulniers@google.com

a9c640ac

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功