提交 · d6a858d13e5d02b97daab5ba0648c704ae3f9517 · openeuler / raspberrypi-kernel

01 10月, 2015 20 次提交

KVM: vmx: disable posted interrupts if no local APIC · d6a858d1

由 Paolo Bonzini 提交于 9月 28, 2015

Uniprocessor 32-bit randconfigs can disable the local APIC, and posted
interrupts require reserving a vector on the LAPIC, so they are
incompatible.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d6a858d1

kvm/x86: Hyper-V HV_X64_MSR_VP_RUNTIME support · 9eec50b8

由 Andrey Smetanin 提交于 9月 16, 2015

HV_X64_MSR_VP_RUNTIME msr used by guest to get
"the time the virtual processor consumes running guest code,
and the time the associated logical processor spends running
hypervisor code on behalf of that guest."

Calculation of this time is performed by task_cputime_adjusted()
for vcpu task.

Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.
Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
Signed-off-by: NDenis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9eec50b8

kvm/x86: Hyper-V HV_X64_MSR_VP_INDEX export for QEMU. · 11c4b1ca

由 Andrey Smetanin 提交于 9月 16, 2015

Insert Hyper-V HV_X64_MSR_VP_INDEX into msr's emulated list,
so QEMU can set Hyper-V features cpuid HV_X64_MSR_VP_INDEX_AVAILABLE
bit correctly. KVM emulation part is in place already.

Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.
Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
Signed-off-by: NDenis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

11c4b1ca

kvm/x86: Hyper-V HV_X64_MSR_RESET msr · e516cebb

由 Andrey Smetanin 提交于 9月 16, 2015

HV_X64_MSR_RESET msr is used by Hyper-V based Windows guest
to reset guest VM by hypervisor.

Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.
Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
Signed-off-by: NDenis V. Lunev <den@openvz.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e516cebb

kvm: add tracepoint for fast mmio · 931c33b1

由 Jason Wang 提交于 9月 15, 2015

Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

931c33b1

KVM: x86: Add support for local interrupt requests from userspace · 1c1a9ce9

由 Steve Rutherford 提交于 7月 30, 2015

In order to enable userspace PIC support, the userspace PIC needs to
be able to inject local interrupts even when the APICs are in the
kernel.

KVM_INTERRUPT now supports sending local interrupts to an APIC when
APICs are in the kernel.

The ready_for_interrupt_request flag is now only set when the CPU/APIC
will immediately accept and inject an interrupt (i.e. APIC has not
masked the PIC).

When the PIC wishes to initiate an INTA cycle with, say, CPU0, it
kicks CPU0 out of the guest, and renedezvous with CPU0 once it arrives
in userspace.

When the CPU/APIC unmasks the PIC, a KVM_EXIT_IRQ_WINDOW_OPEN is
triggered, so that userspace has a chance to inject a PIC interrupt
if it had been pending.

Overall, this design can lead to a small number of spurious userspace
renedezvous. In particular, whenever the PIC transistions from low to
high while it is masked and whenever the PIC becomes unmasked while
it is low.

Note: this does not buffer more than one local interrupt in the
kernel, so the VMM needs to enter the guest in order to complete
interrupt injection before injecting an additional interrupt.

Compiles for x86.

Can pass the KVM Unit Tests.
Signed-off-by: NSteve Rutherford <srutherford@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1c1a9ce9

KVM: x86: Add EOI exit bitmap inference · b053b2ae

由 Steve Rutherford 提交于 7月 29, 2015

In order to support a userspace IOAPIC interacting with an in kernel
APIC, the EOI exit bitmaps need to be configurable.

If the IOAPIC is in userspace (i.e. the irqchip has been split), the
EOI exit bitmaps will be set whenever the GSI Routes are configured.
In particular, for the low MSI routes are reservable for userspace
IOAPICs. For these MSI routes, the EOI Exit bit corresponding to the
destination vector of the route will be set for the destination VCPU.

The intention is for the userspace IOAPICs to use the reservable MSI
routes to inject interrupts into the guest.

This is a slight abuse of the notion of an MSI Route, given that MSIs
classically bypass the IOAPIC. It might be worthwhile to add an
additional route type to improve clarity.

Compile tested for Intel x86.
Signed-off-by: NSteve Rutherford <srutherford@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b053b2ae

KVM: x86: Add KVM exit for IOAPIC EOIs · 7543a635

由 Steve Rutherford 提交于 7月 29, 2015

Adds KVM_EXIT_IOAPIC_EOI which allows the kernel to EOI
level-triggered IOAPIC interrupts.

Uses a per VCPU exit bitmap to decide whether or not the IOAPIC needs
to be informed (which is identical to the EOI_EXIT_BITMAP field used
by modern x86 processors, but can also be used to elide kvm IOAPIC EOI
exits on older processors).

[Note: A prototype using ResampleFDs found that decoupling the EOI
from the VCPU's thread made it possible for the VCPU to not see a
recent EOI after reentering the guest. This does not match real
hardware.]

Compile tested for Intel x86.
Signed-off-by: NSteve Rutherford <srutherford@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7543a635

KVM: x86: Split the APIC from the rest of IRQCHIP. · 49df6397

由 Steve Rutherford 提交于 7月 29, 2015

First patch in a series which enables the relocation of the
PIC/IOAPIC to userspace.

Adds capability KVM_CAP_SPLIT_IRQCHIP;

KVM_CAP_SPLIT_IRQCHIP enables the construction of LAPICs without the
rest of the irqchip.

Compile tested for x86.
Signed-off-by: NSteve Rutherford <srutherford@google.com>
Suggested-by: NAndrew Honig <ahonig@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

49df6397

KVM: x86: unify handling of interrupt window · 4ca7dd8c

由 Paolo Bonzini 提交于 7月 30, 2015

The interrupt window is currently checked twice, once in vmx.c/svm.c and
once in dm_request_for_irq_injection. The only difference is the extra
check for kvm_arch_interrupt_allowed in dm_request_for_irq_injection,
and the different return value (EINTR/KVM_EXIT_INTR for vmx.c/svm.c vs.
0/KVM_EXIT_IRQ_WINDOW_OPEN for dm_request_for_irq_injection).

However, dm_request_for_irq_injection is basically dead code! Revive it
by removing the checks in vmx.c and svm.c's vmexit handlers, and
fixing the returned values for the dm_request_for_irq_injection case.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4ca7dd8c

KVM: x86: introduce lapic_in_kernel · 35754c98

由 Paolo Bonzini 提交于 7月 29, 2015

Avoid pointer chasing and memory barriers, and simplify the code
when split irqchip (LAPIC in kernel, IOAPIC/PIC in userspace)
is introduced.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

35754c98

KVM: x86: replace vm_has_apicv hook with cpu_uses_apicv · d50ab6c1

由 Paolo Bonzini 提交于 7月 29, 2015

This will avoid an unnecessary trip to ->kvm and from there to the VPIC.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d50ab6c1

KVM: x86: store IOAPIC-handled vectors in each VCPU · 3bb345f3

由 Paolo Bonzini 提交于 7月 29, 2015

We can reuse the algorithm that computes the EOI exit bitmap to figure
out which vectors are handled by the IOAPIC.  The only difference
between the two is for edge-triggered interrupts other than IRQ8
that have no notifiers active; however, the IOAPIC does not have to
do anything special for these interrupts anyway.

This again limits the interactions between the IOAPIC and the LAPIC,
making it easier to move the former to userspace.

Inspired by a patch from Steve Rutherford.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3bb345f3

KVM: x86: set TMR when the interrupt is accepted · bdaffe1d

由 Paolo Bonzini 提交于 7月 29, 2015

Do not compute TMR in advance. Instead, set the TMR just before the interrupt
is accepted into the IRR. This limits the coupling between IOAPIC and LAPIC.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bdaffe1d

Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS · d2922422

由 Dirk Müller 提交于 10月 01, 2015

The cpu feature flags are not ever going to change, so warning
everytime can cause a lot of kernel log spam
(in our case more than 10GB/hour).

The warning seems to only occur when nested virtualization is
enabled, so it's probably triggered by a KVM bug.  This is a
sensible and safe change anyway, and the KVM bug fix might not
be suitable for stable releases anyway.

Cc: stable@vger.kernel.org
Signed-off-by: NDirk Mueller <dmueller@suse.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d2922422

Revert "KVM: SVM: use NPT page attributes" · fc07e76a

由 Paolo Bonzini 提交于 10月 01, 2015

This reverts commit 3c2e7f7d.
Initializing the mapping from MTRR to PAT values was reported to
fail nondeterministically, and it also caused extremely slow boot
(due to caching getting disabled---bug 103321) with assigned devices.
Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: NSebastian Schuette <dracon@ewetel.net>
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fc07e76a

Revert "KVM: svm: handle KVM_X86_QUIRK_CD_NW_CLEARED in svm_get_mt_mask" · bcf166a9

由 Paolo Bonzini 提交于 10月 01, 2015

This reverts commit 54928303.
It builds on the commit that is being reverted next.

Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bcf166a9

Revert "KVM: SVM: Sync g_pat with guest-written PAT value" · 625422f6

由 Paolo Bonzini 提交于 10月 01, 2015

This reverts commit e098223b,
which has a dependency on other commits being reverted.

Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

625422f6

Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages" · 606decd6

由 Paolo Bonzini 提交于 10月 01, 2015

This reverts commit fd717f11.
It was reported to cause Machine Check Exceptions (bug 104091).

Reported-by: harn-solo@gmx.de
Cc: stable@vger.kernel.org # 4.2+
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

606decd6

x86/x2apic: Make stub functions available even if !CONFIG_X86_LOCAL_APIC · e02ae387

由 Paolo Bonzini 提交于 9月 28, 2015

Some CONFIG_X86_X2APIC functions, especially x2apic_enabled(), are not
declared if !CONFIG_X86_LOCAL_APIC.  However, the same stubs that work
for !CONFIG_X86_X2APIC are okay even if there is no local APIC support
at all.

Avoid the introduction of #ifdefs by moving the x2apic declarations
completely outside the CONFIG_X86_LOCAL_APIC block.  (Unfortunately,
diff generation messes up the actual change that this patch makes).
There is no semantic change because CONFIG_X86_X2APIC depends on
CONFIG_X86_LOCAL_APIC.
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Cc: Feng Wu <feng.wu@intel.com>
Link: http://lkml.kernel.org/r/1443435991-35750-1-git-send-email-pbonzini@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

e02ae387

28 9月, 2015 1 次提交

Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR" · 9bac175d

由 Radim Krčmář 提交于 9月 18, 2015

Shifting pvclock_vcpu_time_info.system_time on write to KVM system time
MSR is a change of ABI.  Probably only 2.6.16 based SLES 10 breaks due
to its custom enhancements to kvmclock, but KVM never declared the MSR
only for one-shot initialization.  (Doc says that only one write is
needed.)

This reverts commit b7e60c5a.
And adds a note to the definition of PVCLOCK_COUNTS_FROM_ZERO.
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
Acked-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9bac175d

25 9月, 2015 4 次提交

KVM: disable halt_poll_ns as default for s390x · 920552b2

由 David Hildenbrand 提交于 9月 18, 2015

We observed some performance degradation on s390x with dynamic
halt polling. Until we can provide a proper fix, let's enable
halt_poll_ns as default only for supported architectures.

Architectures are now free to set their own halt_poll_ns
default value.
Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

920552b2

KVM: x86: fix off-by-one in reserved bits check · 58c95070

由 Paolo Bonzini 提交于 9月 22, 2015

29ecd660 ("KVM: x86: avoid uninitialized variable warning",
2015-09-06) introduced a not-so-subtle problem, which probably
escaped review because it was not part of the patch context.

Before the patch, leaf was always equal to iterator.level.  After,
it is equal to iterator.level - 1 in the call to is_shadow_zero_bits_set,
and when is_shadow_zero_bits_set does another "-1" the check on
reserved bits becomes incorrect.  Using "iterator.level" in the call
fixes this call trace:

WARNING: CPU: 2 PID: 17000 at arch/x86/kvm/mmu.c:3385 handle_mmio_page_fault.part.93+0x1a/0x20 [kvm]()
Modules linked in: tun sha256_ssse3 sha256_generic drbg binfmt_misc ipv6 vfat fat fuse dm_crypt dm_mod kvm_amd kvm crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd fam15h_power amd64_edac_mod k10temp edac_core amdkfd amd_iommu_v2 radeon acpi_cpufreq
[...]
Call Trace:
  dump_stack+0x4e/0x84
  warn_slowpath_common+0x95/0xe0
  warn_slowpath_null+0x1a/0x20
  handle_mmio_page_fault.part.93+0x1a/0x20 [kvm]
  tdp_page_fault+0x231/0x290 [kvm]
  ? emulator_pio_in_out+0x6e/0xf0 [kvm]
  kvm_mmu_page_fault+0x36/0x240 [kvm]
  ? svm_set_cr0+0x95/0xc0 [kvm_amd]
  pf_interception+0xde/0x1d0 [kvm_amd]
  handle_exit+0x181/0xa70 [kvm_amd]
  ? kvm_arch_vcpu_ioctl_run+0x68b/0x1730 [kvm]
  kvm_arch_vcpu_ioctl_run+0x6f6/0x1730 [kvm]
  ? kvm_arch_vcpu_ioctl_run+0x68b/0x1730 [kvm]
  ? preempt_count_sub+0x9b/0xf0
  ? mutex_lock_killable_nested+0x26f/0x490
  ? preempt_count_sub+0x9b/0xf0
  kvm_vcpu_ioctl+0x358/0x710 [kvm]
  ? __fget+0x5/0x210
  ? __fget+0x101/0x210
  do_vfs_ioctl+0x2f4/0x560
  ? __fget_light+0x29/0x90
  SyS_ioctl+0x4c/0x90
  entry_SYSCALL_64_fastpath+0x16/0x73
---[ end trace 37901c8686d84de6 ]---
Reported-by: NBorislav Petkov <bp@alien8.de>
Tested-by: NBorislav Petkov <bp@alien8.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

58c95070

KVM: x86: use correct page table format to check nested page table reserved bits · 6fec2144

由 Paolo Bonzini 提交于 9月 22, 2015

Intel CPUID on AMD host or vice versa is a weird case, but it can
happen.  Handle it by checking the host CPU vendor instead of the
guest's in reset_tdp_shadow_zero_bits_mask.  For speed, the
check uses the fact that Intel EPT has an X (executable) bit while
AMD NPT has NX.
Reported-by: NBorislav Petkov <bp@alien8.de>
Tested-by: NBorislav Petkov <bp@alien8.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6fec2144

KVM: svm: do not call kvm_set_cr0 from init_vmcb · 79a8059d

由 Paolo Bonzini 提交于 9月 21, 2015

kvm_set_cr0 may want to call kvm_zap_gfn_range and thus access the
memslots array (SRCU protected).  Using a mini SRCU critical section
is ugly, and adding it to kvm_arch_vcpu_create doesn't work because
the VMX vcpu_create callback calls synchronize_srcu.

Fixes this lockdep splat:

===============================
[ INFO: suspicious RCU usage. ]
4.3.0-rc1+ #1 Not tainted
-------------------------------
include/linux/kvm_host.h:488 suspicious rcu_dereference_check() usage!

other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
1 lock held by qemu-system-i38/17000:
 #0:  (&(&kvm->mmu_lock)->rlock){+.+...}, at: kvm_zap_gfn_range+0x24/0x1a0 [kvm]

[...]
Call Trace:
 dump_stack+0x4e/0x84
 lockdep_rcu_suspicious+0xfd/0x130
 kvm_zap_gfn_range+0x188/0x1a0 [kvm]
 kvm_set_cr0+0xde/0x1e0 [kvm]
 init_vmcb+0x760/0xad0 [kvm_amd]
 svm_create_vcpu+0x197/0x250 [kvm_amd]
 kvm_arch_vcpu_create+0x47/0x70 [kvm]
 kvm_vm_ioctl+0x302/0x7e0 [kvm]
 ? __lock_is_held+0x51/0x70
 ? __fget+0x101/0x210
 do_vfs_ioctl+0x2f4/0x560
 ? __fget_light+0x29/0x90
 SyS_ioctl+0x4c/0x90
 entry_SYSCALL_64_fastpath+0x16/0x73
Reported-by: NBorislav Petkov <bp@alien8.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

79a8059d

23 9月, 2015 3 次提交

x86, efi, kasan: #undef memset/memcpy/memmove per arch · 769a8089

由 Andrey Ryabinin 提交于 9月 22, 2015

In not-instrumented code KASAN replaces instrumented memset/memcpy/memmove
with not-instrumented analogues __memset/__memcpy/__memove.

However, on x86 the EFI stub is not linked with the kernel.  It uses
not-instrumented mem*() functions from arch/x86/boot/compressed/string.c

So we don't replace them with __mem*() variants in EFI stub.

On ARM64 the EFI stub is linked with the kernel, so we should replace
mem*() functions with __mem*(), because the EFI stub runs before KASAN
sets up early shadow.

So let's move these #undef mem* into arch's asm/efi.h which is also
included by the EFI stub.

Also, this will fix the warning in 32-bit build reported by kbuild test
robot:

	efi-stub-helper.c:599:2: warning: implicit declaration of function 'memcpy'

[akpm@linux-foundation.org: use 80 cols in comment]
Signed-off-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com>
Reported-by: NFengguang Wu <fengguang.wu@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

769a8089

x86/nmi/64: Fix a paravirt stack-clobbering bug in the NMI code · 83c133cf

由 Andy Lutomirski 提交于 9月 20, 2015

The NMI entry code that switches to the normal kernel stack needs to
be very careful not to clobber any extra stack slots on the NMI
stack. The code is fine under the assumption that SWAPGS is just a
normal instruction, but that assumption isn't really true. Use
SWAPGS_UNSAFE_STACK instead.

This is part of a fix for some random crashes that Sasha saw.

Fixes: 9b6e6a83 ("x86/nmi/64: Switch stacks on userspace NMI entry")
Reported-and-tested-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/974bc40edffdb5c2950a5c4977f821a446b76178.1442791737.git.luto@kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

83c133cf

x86/paravirt: Replace the paravirt nop with a bona fide empty function · fc57a7c6

由 Andy Lutomirski 提交于 9月 20, 2015

PARAVIRT_ADJUST_EXCEPTION_FRAME generates this code (using nmi as an
example, trimmed for readability):

    ff 15 00 00 00 00       callq  *0x0(%rip)        # 2796 <nmi+0x6>
              2792: R_X86_64_PC32     pv_irq_ops+0x2c

That's a call through a function pointer to regular C function that
does nothing on native boots, but that function isn't protected
against kprobes, isn't marked notrace, and is certainly not
guaranteed to preserve any registers if the compiler is feeling
perverse.  This is bad news for a CLBR_NONE operation.

Of course, if everything works correctly, once paravirt ops are
patched, it gets nopped out, but what if we hit this code before
paravirt ops are patched in?  This can potentially cause breakage
that is very difficult to debug.

A more subtle failure is possible here, too: if _paravirt_nop uses
the stack at all (even just to push RBP), it will overwrite the "NMI
executing" variable if it's called in the NMI prologue.

The Xen case, perhaps surprisingly, is fine, because it's already
written in asm.

Fix all of the cases that default to paravirt_nop (including
adjust_exception_frame) with a big hammer: replace paravirt_nop with
an asm function that is just a ret instruction.

The Xen case may have other problems, so document them.

This is part of a fix for some random crashes that Sasha saw.
Reported-and-tested-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/8f5d2ba295f9d73751c33d97fda03e0495d9ade0.1442791737.git.luto@kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

fc57a7c6

21 9月, 2015 1 次提交

KVM: x86: trap AMD MSRs for the TSeg base and mask · 3afb1121

由 Paolo Bonzini 提交于 9月 18, 2015

These have roughly the same purpose as the SMRR, which we do not need
to implement in KVM.  However, Linux accesses MSR_K8_TSEG_ADDR at
boot, which causes problems when running a Xen dom0 under KVM.
Just return 0, meaning that processor protection of SMRAM is not
in effect.
Reported-by: NM A Young <m.a.young@durham.ac.uk>
Cc: stable@vger.kernel.org
Acked-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3afb1121

18 9月, 2015 1 次提交

kvm: svm: reset mmu on VCPU reset · ebae871a

由 Igor Mammedov 提交于 9月 18, 2015

When INIT/SIPI sequence is sent to VCPU which before that
was in use by OS, VMRUN might fail with:

 KVM: entry failed, hardware error 0xffffffff
 EAX=00000000 EBX=00000000 ECX=00000000 EDX=000006d3
 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
 EIP=00000000 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =0000 00000000 0000ffff 00009300
 CS =9a00 0009a000 0000ffff 00009a00
 [...]
 CR0=60000010 CR2=b6f3e000 CR3=01942000 CR4=000007e0
 [...]
 EFER=0000000000000000

with corresponding SVM error:
 KVM: FAILED VMRUN WITH VMCB:
 [...]
 cpl:            0                efer:         0000000000001000
 cr0:            0000000080010010 cr2:          00007fd7fe85bf90
 cr3:            0000000187d0c000 cr4:          0000000000000020
 [...]

What happens is that VCPU state right after offlinig:
CR0: 0x80050033  EFER: 0xd01  CR4: 0x7e0
  -> long mode with CR3 pointing to longmode page tables

and when VCPU gets INIT/SIPI following transition happens
CR0: 0 -> 0x60000010 EFER: 0x0  CR4: 0x7e0
  -> paging disabled with stale CR3

However SVM under the hood puts VCPU in Paged Real Mode*
which effectively translates CR0 0x60000010 -> 80010010 after

   svm_vcpu_reset()
       -> init_vmcb()
           -> kvm_set_cr0()
               -> svm_set_cr0()

but from  kvm_set_cr0() perspective CR0: 0 -> 0x60000010
only caching bits are changed and
commit d81135a5
 ("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")'
regressed svm_vcpu_reset() which relied on MMU being reset.

As result VMRUN after svm_vcpu_reset() tries to run
VCPU in Paged Real Mode with stale MMU context (longmode page tables),
which causes some AMD CPUs** to bail out with VMEXIT_INVALID.

Fix issue by unconditionally resetting MMU context
at init_vmcb() time.

	* AMD64 Architecture Programmerâ€™s Manual,
	    Volume 2: System Programming, rev: 3.25
	      15.19 Paged Real Mode
	** Opteron 1216
Signed-off-by: NIgor Mammedov <imammedo@redhat.com>
Fixes: d81135a5
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ebae871a

17 9月, 2015 1 次提交

x86/pci/dma: Fix gfp flags for coherent DMA memory allocation · 590f0787

由 Junichi Nomura 提交于 9月 14, 2015

Commit 6894258e reversed the order of gfp_flags adjustment in
dma_alloc_attrs() for x86 [arch/x86/kernel/pci-dma.c] As a result,
relevant flags set by dma_alloc_coherent_gfp_flags() are just
discarded and cause coherent DMA memory allocation failure on some
devices.

Fixes: 6894258e ("dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent}")
Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
Tested-by: NTony Luck <tony.luck@intel.com>
Acked-by: NChristoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20150914073834.GA13077@xzibit.linux.bs1.fc.nec.co.jpSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

590f0787

16 9月, 2015 6 次提交

x86/platform: Fix Geode LX timekeeping in the generic x86 build · 03da3ff1

由 David Woodhouse 提交于 9月 16, 2015

In 2007, commit 07190a08 ("Mark TSC on GeodeLX reliable")
bypassed verification of the TSC on Geode LX. However, this code
(now in the check_system_tsc_reliable() function in
arch/x86/kernel/tsc.c) was only present if CONFIG_MGEODE_LX was
set.

OpenWRT has recently started building its generic Geode target
for Geode GX, not LX, to include support for additional
platforms. This broke the timekeeping on LX-based devices,
because the TSC wasn't marked as reliable:
https://dev.openwrt.org/ticket/20531

By adding a runtime check on is_geode_lx(), we can also include
the fix if CONFIG_MGEODEGX1 or CONFIG_X86_GENERIC are set, thus
fixing the problem.
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
Cc: Andres Salomon <dilinger@queued.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1442409003.131189.87.camel@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

03da3ff1

genirq: Remove irq argument from irq flow handlers · bd0b9ac4

由 Thomas Gleixner 提交于 9月 14, 2015

Most interrupt flow handlers do not use the irq argument. Those few
which use it can retrieve the irq number from the irq descriptor.

Remove the argument.

Search and replace was done with coccinelle and some extra helper
scripts around it. Thanks to Julia for her help!
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Jiang Liu <jiang.liu@linux.intel.com>

bd0b9ac4

genirq: Move field 'affinity' from irq_data into irq_common_data · 9df872fa

由 Jiang Liu 提交于 6月 03, 2015

Irq affinity mask is per-irq instead of per irqchip, so move it into
struct irq_common_data.
Signed-off-by: NJiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: http://lkml.kernel.org/r/1433303281-27688-1-git-send-email-jiang.liu@linux.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

9df872fa

KVM: vmx: fix VPID is 0000H in non-root operation · 04bb92e4

由 Wanpeng Li 提交于 9月 16, 2015

Reference SDM 28.1:

The current VPID is 0000H in the following situations:
- Outside VMX operation. (This includes operation in system-management
  mode under the default treatment of SMIs and SMM with VMX operation;
  see Section 34.14.)
- In VMX root operation.
- In VMX non-root operation when the “enable VPID” VM-execution control
  is 0.

The VPID should never be 0000H in non-root operation when "enable VPID"
VM-execution control is 1. However, commit 34a1cd60 ("kvm: x86: vmx:
move some vmx setting from vmx_init() to hardware_setup()") remove the
codes which reserve 0000H for VMX root operation.

This patch fix it by again reserving 0000H for VMX root operation.

Cc: stable@vger.kernel.org # 3.19+
Fixes: 34a1cd60Reported-by: NWincy Van <fanwenyi0529@gmail.com>
Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

04bb92e4

KVM: add halt_attempted_poll to VCPU stats · 62bea5bf

由 Paolo Bonzini 提交于 9月 15, 2015

This new statistic can help diagnosing VCPUs that, for any reason,
trigger bad behavior of halt_poll_ns autotuning.

For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
10+20+40+80+160+320+480 = 1110 microseconds out of every
479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
is consuming about 30% more CPU than it would use without
polling.  This would show as an abnormally high number of
attempted polling compared to the successful polls.

Acked-by: Christian Borntraeger <borntraeger@de.ibm.com<
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

62bea5bf

PCI: Revert "PCI: Call pci_read_bridge_bases() from core instead of arch code" · 237865f1

由 Bjorn Helgaas 提交于 9月 15, 2015

Revert dff22d20 ("PCI: Call pci_read_bridge_bases() from core instead
of arch code").

Reading PCI bridge windows is not arch-specific in itself, but there is PCI
core code that doesn't work correctly if we read them too early. For
example, Hannes found this case on an ARM Freescale i.mx6 board:

pci_bus 0000:00: root bus resource [mem 0x01000000-0x01efffff]
pci 0000:00:00.0: PCI bridge to [bus 01-ff]
pci 0000:00:00.0: BAR 8: no space for [mem size 0x01000000] (mem window)
pci 0000:01:00.0: BAR 2: failed to assign [mem size 0x00200000]
pci 0000:01:00.0: BAR 1: failed to assign [mem size 0x00004000]
pci 0000:01:00.0: BAR 0: failed to assign [mem size 0x00000100]

The 00:00.0 mem window needs to be at least 3MB: the 01:00.0 device needs
0x204100 of space, and mem windows are megabyte-aligned.

Bus sizing can increase a bridge window size, but never *decrease* it (see
d65245c3 ("PCI: don't shrink bridge resources")). Prior to
dff22d20, ARM didn't read bridge windows at all, so the "original size"
was zero, and we assigned a 3MB window.

After dff22d20, we read the bridge windows before sizing the bus. The
firmware programmed a 16MB window (size 0x01000000) in 00:00.0, and since
we never decrease the size, we kept 16MB even though we only needed 3MB.
But 16MB doesn't fit in the host bridge aperture, so we failed to assign
space for the window and the downstream devices.

I think this is a defect in the PCI core: we shouldn't rely on the firmware
to assign sensible windows.

Ray reported a similar problem, also on ARM, with Broadcom iProc.

Issues like this are too hard to fix right now, so revert dff22d20.
Reported-by: NHannes <oe5hpm@gmail.com>
Reported-by: NRay Jui <rjui@broadcom.com>
Link: http://lkml.kernel.org/r/CAAa04yFQEUJm7Jj1qMT57-LG7ZGtnhNDBe=PpSRa70Mj+XhW-A@mail.gmail.com
Link: http://lkml.kernel.org/r/55F75BB8.4070405@broadcom.comSigned-off-by: NBjorn Helgaas <bhelgaas@google.com>
Acked-by: NYinghai Lu <yinghai@kernel.org>
Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>

237865f1

15 9月, 2015 2 次提交

x86/apic: Serialize LVTT and TSC_DEADLINE writes · 5d7c631d

由 Shaohua Li 提交于 7月 30, 2015

The APIC LVTT register is MMIO mapped but the TSC_DEADLINE register is an
MSR. The write to the TSC_DEADLINE MSR is not serializing, so it's not
guaranteed that the write to LVTT has reached the APIC before the
TSC_DEADLINE MSR is written. In such a case the write to the MSR is
ignored and as a consequence the local timer interrupt never fires.

The SDM decribes this issue for xAPIC and x2APIC modes. The
serialization methods recommended by the SDM differ.

xAPIC:
 "1. Memory-mapped write to LVT Timer Register, setting bits 18:17 to 10b.
  2. WRMSR to the IA32_TSC_DEADLINE MSR a value much larger than current time-stamp counter.
  3. If RDMSR of the IA32_TSC_DEADLINE MSR returns zero, go to step 2.
  4. WRMSR to the IA32_TSC_DEADLINE MSR the desired deadline."

x2APIC:
 "To allow for efficient access to the APIC registers in x2APIC mode,
  the serializing semantics of WRMSR are relaxed when writing to the
  APIC registers. Thus, system software should not use 'WRMSR to APIC
  registers in x2APIC mode' as a serializing instruction. Read and write
  accesses to the APIC registers will occur in program order. A WRMSR to
  an APIC register may complete before all preceding stores are globally
  visible; software can prevent this by inserting a serializing
  instruction, an SFENCE, or an MFENCE before the WRMSR."

The xAPIC method is to just wait for the memory mapped write to hit
the LVTT by checking whether the MSR write has reached the hardware.
There is no reason why a proper MFENCE after the memory mapped write would
not do the same. Andi Kleen confirmed that MFENCE is sufficient for the
xAPIC case as well.

Issue MFENCE before writing to the TSC_DEADLINE MSR. This can be done
unconditionally as all CPUs which have TSC_DEADLINE also have MFENCE
support.

[ tglx: Massaged the changelog ]
Signed-off-by: NShaohua Li <shli@fb.com>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Cc: <Kernel-team@fb.com>
Cc: <lenb@kernel.org>
Cc: <fenghua.yu@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: stable@vger.kernel.org #v3.7+
Link: http://lkml.kernel.org/r/20150909041352.GA2059853@devbig257.prn2.facebook.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

5d7c631d

x86/ioapic: Force affinity setting in setup_ioapic_dest() · 4857c91f

由 Thomas Gleixner 提交于 9月 14, 2015

The recent ioapic cleanups changed the affinity setting in
setup_ioapic_dest() from a direct write to the hardware to the delayed
affinity setup via irq_set_affinity().

That results in a warning from chained_irq_exit():
WARNING: CPU: 0 PID: 5 at kernel/irq/migration.c:32 irq_move_masked_irq
[<ffffffff810a0a88>] irq_move_masked_irq+0xb8/0xc0
[<ffffffff8103c161>] ioapic_ack_level+0x111/0x130
[<ffffffff812bbfe8>] intel_gpio_irq_handler+0x148/0x1c0

The reason is that irq_set_affinity() does not write directly to the
hardware. It marks the affinity setting as pending and executes it
from the next interrupt. The chained handler infrastructure does not
take the irq descriptor lock for performance reasons because such a
chained interrupt is not visible to any interfaces. So the delayed
affinity setting triggers the warning in irq_move_masked_irq().

Restore the old behaviour by calling the set_affinity function of the
ioapic chip in setup_ioapic_dest(). This is safe as none of the
interrupts can be on the fly at this point.

Fixes: aa5cb97f 'x86/irq: Remove x86_io_apic_ops.set_affinity and related interfaces'
Reported-and-tested-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: jarkko.nikula@linux.intel.com

4857c91f

14 9月, 2015 1 次提交

x86/paravirt: Remove the unused pv_time_ops::get_tsc_khz method · cda34fc7

由 Juergen Gross 提交于 9月 14, 2015

It's not used anywhere.
Signed-off-by: NJuergen Gross <jgross@suse.com>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akataria@vmware.com
Cc: chrisw@sous-sol.org
Cc: jeremy@goop.org
Cc: virtualization@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/1442227343-403-1-git-send-email-jgross@suse.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

cda34fc7