提交 · 1eaafe91a0df4157521b6417b3dd8430bf5f52f0 · openeuler / Kernel

25 5月, 2018 1 次提交

kvm: x86: IA32_ARCH_CAPABILITIES is always supported · 1eaafe91

由 Jim Mattson 提交于 5月 09, 2018

If there is a possibility that a VM may migrate to a Skylake host,
then the hypervisor should report IA32_ARCH_CAPABILITIES.RSBA[bit 2]
as being set (future work, of course). This implies that
CPUID.(EAX=7,ECX=0):EDX.ARCH_CAPABILITIES[bit 29] should be
set. Therefore, kvm should report this CPUID bit as being supported
whether or not the host supports it.  Userspace is still free to clear
the bit if it chooses.

For more information on RSBA, see Intel's white paper, "Retpoline: A
Branch Target Injection Mitigation" (Document Number 337131-001),
currently available at https://bugzilla.kernel.org/show_bug.cgi?id=199511.

Since the IA32_ARCH_CAPABILITIES MSR is emulated in kvm, there is no
dependency on hardware support for this feature.
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Fixes: 28c1c9fa ("KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES")
Cc: stable@vger.kernel.org
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

1eaafe91

24 5月, 2018 2 次提交

KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed · c4d21882

由 Wei Huang 提交于 5月 01, 2018

The CPUID bits of OSXSAVE (function=0x1) and OSPKE (func=0x7, leaf=0x0)
allows user apps to detect if OS has set CR4.OSXSAVE or CR4.PKE. KVM is
supposed to update these CPUID bits when CR4 is updated. Current KVM
code doesn't handle some special cases when updates come from emulator.
Here is one example:

  Step 1: guest boots
  Step 2: guest OS enables XSAVE ==> CR4.OSXSAVE=1 and CPUID.OSXSAVE=1
  Step 3: guest hot reboot ==> QEMU reset CR4 to 0, but CPUID.OSXAVE==1
  Step 4: guest os checks CPUID.OSXAVE, detects 1, then executes xgetbv

Step 4 above will cause an #UD and guest crash because guest OS hasn't
turned on OSXAVE yet. This patch solves the problem by comparing the the
old_cr4 with cr4. If the related bits have been changed,
kvm_update_cpuid() needs to be called.
Signed-off-by: NWei Huang <wei@redhat.com>
Reviewed-by: NBandan Das <bsd@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

c4d21882

x86/kvm: fix LAPIC timer drift when guest uses periodic mode · d8f2f498

由 David Vrabel 提交于 5月 18, 2018

Since 4.10, commit 8003c9ae (KVM: LAPIC: add APIC Timer
periodic/oneshot mode VMX preemption timer support), guests using
periodic LAPIC timers (such as FreeBSD 8.4) would see their timers
drift significantly over time.

Differences in the underlying clocks and numerical errors means the
periods of the two timers (hv and sw) are not the same. This
difference will accumulate with every expiry resulting in a large
error between the hv and sw timer.

This means the sw timer may be running slow when compared to the hv
timer. When the timer is switched from hv to sw, the now active sw
timer will expire late. The guest VCPU is reentered and it switches to
using the hv timer. This timer catches up, injecting multiple IRQs
into the guest (of which the guest only sees one as it does not get to
run until the hv timer has caught up) and thus the guest's timer rate
is low (and becomes increasing slower over time as the sw timer lags
further and further behind).

I believe a similar problem would occur if the hv timer is the slower
one, but I have not observed this.

Fix this by synchronizing the deadlines for both timers to the same
time source on every tick. This prevents the errors from accumulating.

Fixes: 8003c9ae
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: NDavid Vrabel <david.vrabel@nutanix.com>
Cc: stable@vger.kernel.org
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

d8f2f498

18 5月, 2018 1 次提交

kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIME · 633711e8

由 Michael S. Tsirkin 提交于 5月 17, 2018

KVM_HINTS_DEDICATED seems to be somewhat confusing:

Guest doesn't really care whether it's the only task running on a host
CPU as long as it's not preempted.

And there are more reasons for Guest to be preempted than host CPU
sharing, for example, with memory overcommit it can get preempted on a
memory access, post copy migration can cause preemption, etc.

Let's call it KVM_HINTS_REALTIME which seems to better
match what guests expect.

Also, the flag most be set on all vCPUs - current guests assume this.
Note so in the documentation.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

633711e8

17 5月, 2018 6 次提交

KVM: s390: vsie: fix < 8k check for the itdba · f4a551b7

由 David Hildenbrand 提交于 5月 09, 2018

By missing an "L", we might detect some addresses to be <8k,
although they are not.

e.g. for itdba = 100001fff
!(gpa & ~0x1fffU) -> 1
!(gpa & ~0x1fffUL) -> 0

So we would report a SIE validity intercept although everything is fine.

Fixes: 166ecb3d ("KVM: s390: vsie: support transactional execution")
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: NJanosch Frank <frankja@linux.ibm.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NJanosch Frank <frankja@linux.ibm.com>
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

f4a551b7

KVM: PPC: Book 3S HV: Do ptesync in radix guest exit path · df158189

由 Paul Mackerras 提交于 5月 17, 2018

A radix guest can execute tlbie instructions to invalidate TLB entries.
After a tlbie or a group of tlbies, it must then do the architected
sequence eieio; tlbsync; ptesync to ensure that the TLB invalidation
has been processed by all CPUs in the system before it can rely on
no CPU using any translation that it just invalidated.

In fact it is the ptesync which does the actual synchronization in
this sequence, and hardware has a requirement that the ptesync must
be executed on the same CPU thread as the tlbies which it is expected
to order. Thus, if a vCPU gets moved from one physical CPU to
another after it has done some tlbies but before it can get to do the
ptesync, the ptesync will not have the desired effect when it is
executed on the second physical CPU.

To fix this, we do a ptesync in the exit path for radix guests. If
there are any pending tlbies, this will wait for them to complete.
If there aren't, then ptesync will just do the same as sync.
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

df158189

KVM: PPC: Book3S HV: XIVE: Resend re-routed interrupts on CPU priority change · 9dc81d6b

由 Benjamin Herrenschmidt 提交于 5月 10, 2018

When a vcpu priority (CPPR) is set to a lower value (masking more
interrupts), we stop processing interrupts already in the queue
for the priorities that have now been masked.

If those interrupts were previously re-routed to a different
CPU, they might still be stuck until the older one that has
them in its queue processes them. In the case of guest CPU
unplug, that can be never.

To address that without creating additional overhead for
the normal interrupt processing path, this changes H_CPPR
handling so that when such a priority change occurs, we
scan the interrupt queue for that vCPU, and for any
interrupt in there that has been re-routed, we replace it
with a dummy and force a re-trigger.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

9dc81d6b

KVM: PPC: Book3S HV: Make radix clear pte when unmapping · 7e3d9a1d

由 Nicholas Piggin 提交于 5月 09, 2018

The current partition table unmap code clears the _PAGE_PRESENT bit
out of the pte, which leaves pud_huge/pmd_huge true and does not
clear pud_present/pmd_present.  This can confuse subsequent page
faults and possibly lead to the guest looping doing continual
hypervisor page faults.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

7e3d9a1d

KVM: PPC: Book3S HV: Make radix use correct tlbie sequence in kvmppc_radix_tlbie_page · e2560b10

由 Nicholas Piggin 提交于 5月 09, 2018

The standard eieio ; tlbsync ; ptesync must follow tlbie to ensure it
is ordered with respect to subsequent operations.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

e2560b10

KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry · 57b8daa7

由 Paul Mackerras 提交于 4月 20, 2018

Currently, the HV KVM guest entry/exit code adds the timebase offset
from the vcore struct to the timebase on guest entry, and subtracts
it on guest exit. Which is fine, except that it is possible for
userspace to change the offset using the SET_ONE_REG interface while
the vcore is running, as there is only one timebase offset per vcore
but potentially multiple VCPUs in the vcore. If that were to happen,
KVM would subtract a different offset on guest exit from that which
it had added on guest entry, leading to the timebase being out of sync
between cores in the host, which then leads to bad things happening
such as hangs and spurious watchdog timeouts.

To fix this, we add a new field 'tb_offset_applied' to the vcore struct
which stores the offset that is currently applied to the timebase.
This value is set from the vcore tb_offset field on guest entry, and
is what is subtracted from the timebase on guest exit. Since it is
zero when the timebase offset is not applied, we can simplify the
logic in kvmhv_start_timing and kvmhv_accumulate_time.

In addition, we had secondary threads reading the timebase while
running concurrently with code on the primary thread which would
eventually add or subtract the timebase offset from the timebase.
This occurred while saving or restoring the DEC register value on
the secondary threads. Although no specific incorrect behaviour has
been observed, this is a race which should be fixed. To fix it, we
move the DEC saving code to just before we call kvmhv_commence_exit,
and the DEC restoring code to after the point where we have waited
for the primary thread to switch the MMU context and add the timebase
offset. That way we are sure that the timebase contains the guest
timebase value in both cases.
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

57b8daa7

15 5月, 2018 2 次提交

KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock · bf308242

由 Andre Przywara 提交于 5月 11, 2018

kvm_read_guest() will eventually look up in kvm_memslots(), which requires
either to hold the kvm->slots_lock or to be inside a kvm->srcu critical
section.
In contrast to x86 and s390 we don't take the SRCU lock on every guest
exit, so we have to do it individually for each kvm_read_guest() call.

Provide a wrapper which does that and use that everywhere.

Note that ending the SRCU critical section before returning from the
kvm_read_guest() wrapper is safe, because the data has been *copied*, so
we don't need to rely on valid references to the memslot anymore.

Cc: Stable <stable@vger.kernel.org> # 4.8+
Reported-by: NJan Glauber <jan.glauber@caviumnetworks.com>
Signed-off-by: NAndre Przywara <andre.przywara@arm.com>
Acked-by: NChristoffer Dall <christoffer.dall@arm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bf308242

KVM: X86: Lower the default timer frequency limit to 200us · 4c27625b

由 Wanpeng Li 提交于 5月 05, 2018

Anthoine reported:
 The period used by Windows change over time but it can be 1
 milliseconds or less. I saw the limit_periodic_timer_frequency
 print so 500 microseconds is sometimes reached.

As suggested by Paolo, lower the default timer frequency limit to a
smaller interval of 200 us (5000 Hz) to leave some headroom. This
is required due to Windows 10 changing the scheduler tick limit
from 1024 Hz to 2048 Hz.
Reported-by: NAnthoine Bourgeois <anthoine.bourgeois@blade-group.com>
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com>
Cc: Darren Kenny <darren.kenny@oracle.com>
Cc: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4c27625b

11 5月, 2018 4 次提交

KVM: vmx: update sec exec controls for UMIP iff emulating UMIP · 64f7a115

由 Sean Christopherson 提交于 4月 30, 2018

Update SECONDARY_EXEC_DESC for UMIP emulation if and only UMIP
is actually being emulated.  Skipping the VMCS update eliminates
unnecessary VMREAD/VMWRITE when UMIP is supported in hardware,
and on platforms that don't have SECONDARY_VM_EXEC_CONTROL.  The
latter case resolves a bug where KVM would fill the kernel log
with warnings due to failed VMWRITEs on older platforms.

Fixes: 0367f205 ("KVM: vmx: add support for emulating UMIP")
Cc: stable@vger.kernel.org #4.16
Reported-by: NPaolo Zeppegno <pzeppegno@gmail.com>
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Suggested-by: NRadim KrÄmÃ¡Å™ <rkrcmar@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

64f7a115

kvm: x86: Suppress CR3_PCID_INVD bit only when PCIDs are enabled · c19986fe

由 Junaid Shahid 提交于 5月 04, 2018

If the PCIDE bit is not set in CR4, then the MSb of CR3 is a reserved
bit. If the guest tries to set it, that should cause a #GP fault. So
mask out the bit only when the PCIDE bit is set.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c19986fe

KVM: hyperv: idr_find needs RCU protection · 452a68d0

由 Paolo Bonzini 提交于 5月 07, 2018

Even though the eventfd is released after the KVM SRCU grace period
elapses, the conn_to_evt data structure itself is not; it uses RCU
internally, instead.  Fix the read-side critical section to happen
under rcu_read_lock/unlock; the result is still protected by
vcpu->kvm->srcu.
Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

452a68d0

x86: Delay skip of emulated hypercall instruction · 6356ee0c

由 Marian Rotariu 提交于 4月 30, 2018

The IP increment should be done after the hypercall emulation, after
calling the various handlers. In this way, these handlers can accurately
identify the the IP of the VMCALL if they need it.

This patch keeps the same functionality for the Hyper-V handler which does
not use the return code of the standard kvm_skip_emulated_instruction()
call.
Signed-off-by: NMarian Rotariu <mrotariu@bitdefender.com>
[Hyper-V hypercalls also need kvm_skip_emulated_instruction() - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6356ee0c

06 5月, 2018 1 次提交

KVM: x86: remove APIC Timer periodic/oneshot spikes · ecf08dad

由 Anthoine Bourgeois 提交于 4月 29, 2018

Since the commit "8003c9ae: add APIC Timer periodic/oneshot mode VMX
preemption timer support", a Windows 10 guest has some erratic timer
spikes.

Here the results on a 150000 times 1ms timer without any load:
	  Before 8003c9ae | After 8003c9ae
Max           1834us          |  86000us
Mean          1100us          |   1021us
Deviation       59us          |    149us
Here the results on a 150000 times 1ms timer with a cpu-z stress test:
	  Before 8003c9ae | After 8003c9ae
Max          32000us          | 140000us
Mean          1006us          |   1997us
Deviation      140us          |  11095us

The root cause of the problem is starting hrtimer with an expiry time
already in the past can take more than 20 milliseconds to trigger the
timer function.  It can be solved by forward such past timers
immediately, rather than submitting them to hrtimer_start().
In case the timer is periodic, update the target expiration and call
hrtimer_start with it.

v2: Check if the tsc deadline is already expired. Thank you Mika.
v3: Execute the past timers immediately rather than submitting them to
hrtimer_start().
v4: Rearm the periodic timer with advance_periodic_target_expiration() a
simpler version of set_target_expiration(). Thank you Paolo.

Cc: Mika Penttilä <mika.penttila@nextfour.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NAnthoine Bourgeois <anthoine.bourgeois@blade-group.com>
8003c9ae ("KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support")
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

ecf08dad

04 5月, 2018 2 次提交

arm64: vgic-v2: Fix proxying of cpuif access · b220244d

由 James Morse 提交于 5月 04, 2018

Proxying the cpuif accesses at EL2 makes use of vcpu_data_guest_to_host
and co, which check the endianness, which call into vcpu_read_sys_reg...
which isn't mapped at EL2 (it was inlined before, and got moved OoL
with the VHE optimizations).

The result is of course a nice panic. Let's add some specialized
cruft to keep the broken platforms that require this hack alive.

But, this code used vcpu_data_guest_to_host(), which expected us to
write the value to host memory, instead we have trapped the guest's
read or write to an mmio-device, and are about to replay it using the
host's readl()/writel() which also perform swabbing based on the host
endianness. This goes wrong when both host and guest are big-endian,
as readl()/writel() will undo the guest's swabbing, causing the
big-endian value to be written to device-memory.

What needs doing?
A big-endian guest will have pre-swabbed data before storing, undo this.
If its necessary for the host, writel() will re-swab it.

For a read a big-endian guest expects to swab the data after the load.
The hosts's readl() will correct for host endianness, giving us the
device-memory's value in the register. For a big-endian guest, swab it
as if we'd only done the load.

For a little-endian guest, nothing needs doing as readl()/writel() leave
the correct device-memory value in registers.

Tested on Juno with that rarest of things: a big-endian 64K host.
Based on a patch from Marc Zyngier.
Reported-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
Fixes: bf8feb39 ("arm64: KVM: vgic-v2: Add GICV access from HYP")
Signed-off-by: NJames Morse <james.morse@arm.com>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>

b220244d

KVM: arm64: Fix order of vcpu_write_sys_reg() arguments · 1975fa56

由 James Morse 提交于 5月 02, 2018

A typo in kvm_vcpu_set_be()'s call:
| vcpu_write_sys_reg(vcpu, SCTLR_EL1, sctlr)
causes us to use the 32bit register value as an index into the sys_reg[]
array, and sail off the end of the linear map when we try to bring up
big-endian secondaries.

| Unable to handle kernel paging request at virtual address ffff80098b982c00
| Mem abort info:
|  ESR = 0x96000045
|  Exception class = DABT (current EL), IL = 32 bits
|   SET = 0, FnV = 0
|   EA = 0, S1PTW = 0
| Data abort info:
|   ISV = 0, ISS = 0x00000045
|   CM = 0, WnR = 1
| swapper pgtable: 4k pages, 48-bit VAs, pgdp = 000000002ea0571a
| [ffff80098b982c00] pgd=00000009ffff8803, pud=0000000000000000
| Internal error: Oops: 96000045 [#1] PREEMPT SMP
| Modules linked in:
| CPU: 2 PID: 1561 Comm: kvm-vcpu-0 Not tainted 4.17.0-rc3-00001-ga912e2261ca6-dirty #1323
| Hardware name: ARM Juno development board (r1) (DT)
| pstate: 60000005 (nZCv daif -PAN -UAO)
| pc : vcpu_write_sys_reg+0x50/0x134
| lr : vcpu_write_sys_reg+0x50/0x134

| Process kvm-vcpu-0 (pid: 1561, stack limit = 0x000000006df4728b)
| Call trace:
|  vcpu_write_sys_reg+0x50/0x134
|  kvm_psci_vcpu_on+0x14c/0x150
|  kvm_psci_0_2_call+0x244/0x2a4
|  kvm_hvc_call_handler+0x1cc/0x258
|  handle_hvc+0x20/0x3c
|  handle_exit+0x130/0x1ec
|  kvm_arch_vcpu_ioctl_run+0x340/0x614
|  kvm_vcpu_ioctl+0x4d0/0x840
|  do_vfs_ioctl+0xc8/0x8d0
|  ksys_ioctl+0x78/0xa8
|  sys_ioctl+0xc/0x18
|  el0_svc_naked+0x30/0x34
| Code: 73620291 604d00b0 00201891 1ab10194 (957a33f8)
|---[ end trace 4b4a4f9628596602 ]---

Fix the order of the arguments.

Fixes: 8d404c4c ("KVM: arm64: Rewrite system register accessors to read/write functions")
CC: Christoffer Dall <cdall@cs.columbia.edu>
Signed-off-by: NJames Morse <james.morse@arm.com>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>

1975fa56

03 5月, 2018 4 次提交

parisc: Fix section mismatches · 8d73b180

由 Helge Deller 提交于 4月 20, 2018

Fix three section mismatches:
1) Section mismatch in reference from the function ioread8() to the
   function .init.text:pcibios_init_bridge()
2) Section mismatch in reference from the function free_initmem() to the
   function .init.text:map_pages()
3) Section mismatch in reference from the function ccio_ioc_init() to
   the function .init.text:count_parisc_driver()
Signed-off-by: NHelge Deller <deller@gmx.de>

8d73b180

parisc: drivers.c: Fix section mismatches · b819439f

由 Helge Deller 提交于 4月 20, 2018

Fix two section mismatches in drivers.c:
1) Section mismatch in reference from the function alloc_tree_node() to
   the function .init.text:create_tree_node().
2) Section mismatch in reference from the function walk_native_bus() to
   the function .init.text:alloc_pa_dev().
Signed-off-by: NHelge Deller <deller@gmx.de>

b819439f

bpf, x64: fix memleak when not converging on calls · 39f56ca9

由 Daniel Borkmann 提交于 5月 02, 2018

The JIT logic in jit_subprogs() is as follows: for all subprogs we
allocate a bpf_prog_alloc(), populate it (prog->is_func = 1 here),
and pass it to bpf_int_jit_compile(). If a failure occurred during
JIT and prog->jited is not set, then we bail out from attempting to
JIT the whole program, and punt to the interpreter instead. In case
JITing went successful, we fixup BPF call offsets and do another
pass to bpf_int_jit_compile() (extra_pass is true at that point) to
complete JITing calls. Given that requires to pass JIT context around
addrs and jit_data from x86 JIT are freed in the extra_pass in
bpf_int_jit_compile() when calls are involved (if not, they can
be freed immediately). However, if in the original pass, the JIT
image didn't converge then we leak addrs and jit_data since image
itself is NULL, the prog->is_func is set and extra_pass is false
in that case, meaning both will become unreachable and are never
cleaned up, therefore we need to free as well on !image. Only x64
JIT is affected.

Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

39f56ca9

bpf, x64: fix memleak when not converging after image · 3aab8884

由 Daniel Borkmann 提交于 5月 02, 2018

While reviewing x64 JIT code, I noticed that we leak the prior allocated
JIT image in the case where proglen != oldproglen during the JIT passes.
Prior to the commit e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT
compiler") we would just break out of the loop, and using the image as the
JITed prog since it could only shrink in size anyway. After e0ee9c12,
we would bail out to out_addrs label where we free addrs and jit_data but
not the image coming from bpf_jit_binary_alloc().

Fixes: e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

3aab8884

02 5月, 2018 5 次提交

x86/cpu: Restore CPUID_8000_0008_EBX reload · c65732e4

由 Thomas Gleixner 提交于 4月 30, 2018

The recent commt which addresses the x86_phys_bits corruption with
encrypted memory on CPUID reload after a microcode update lost the reload
of CPUID_8000_0008_EBX as well.

As a consequence IBRS and IBRS_FW are not longer detected

Restore the behaviour by bringing the reload of CPUID_8000_0008_EBX
back. This restore has a twist due to the convoluted way the cpuid analysis
works:

CPUID_8000_0008_EBX is used by AMD to enumerate IBRB, IBRS, STIBP. On Intel
EBX is not used. But the speculation control code sets the AMD bits when
running on Intel depending on the Intel specific speculation control
bits. This was done to use the same bits for alternatives.

The change which moved the 8000_0008 evaluation out of get_cpu_cap() broke
this nasty scheme due to ordering. So that on Intel the store to
CPUID_8000_0008_EBX clears the IBRB, IBRS, STIBP bits which had been set
before by software.

So the actual CPUID_8000_0008_EBX needs to go back to the place where it
was and the phys/virt address space calculation cannot touch it.

In hindsight this should have used completely synthetic bits for IBRB,
IBRS, STIBP instead of reusing the AMD bits, but that's for 4.18.

/me needs to find time to cleanup that steaming pile of ...

Fixes: d94a155c ("x86/cpu: Prevent cpuinfo_x86::x86_phys_bits adjustment corruption")
Reported-by: NJörg Otte <jrg.otte@gmail.com>
Reported-by: NTim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NJörg Otte <jrg.otte@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kirill.shutemov@linux.intel.com
Cc: Borislav Petkov <bp@alien8.de
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1805021043510.1668@nanos.tec.linutronix.de

c65732e4

x86/tsc: Fix mark_tsc_unstable() · e3b4f790

由 Peter Zijlstra 提交于 4月 30, 2018

mark_tsc_unstable() also needs to affect tsc_early, Now that
clocksource_mark_unstable() can be used on a clocksource irrespective of
its registration state, use it on both tsc_early and tsc.

This does however require cs->list to be initialized empty, otherwise it
cannot tell the registation state before registation.

Fixes: aa83c457 ("x86/tsc: Introduce early tsc clocksource")
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NDiego Viola <diego.viola@gmail.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: rui.zhang@intel.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180430100344.533326547@infradead.org

e3b4f790

x86/tsc: Always unregister clocksource_tsc_early · e9088add

由 Peter Zijlstra 提交于 4月 30, 2018

Don't leave the tsc-early clocksource registered if it errors out
early.

This was reported by Diego, who on his Core2 era machine got TSC
invalidated while it was running with tsc-early (due to C-states).
This results in keeping tsc-early with very bad effects.
Reported-and-Tested-by: NDiego Viola <diego.viola@gmail.com>
Fixes: aa83c457 ("x86/tsc: Introduce early tsc clocksource")
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: len.brown@intel.com
Cc: rjw@rjwysocki.net
Cc: diego.viola@gmail.com
Cc: rui.zhang@intel.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180430100344.350507853@infradead.org

e9088add

hexagon: export csum_partial_copy_nocheck · 330e261c

由 Arnd Bergmann 提交于 4月 06, 2018

This is needed to link ipv6 as a loadable module, which in turn happens
in allmodconfig.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NRichard Kuo <rkuo@codeaurora.org>

330e261c

hexagon: add memset_io() helper · a57ab96e

由 Arnd Bergmann 提交于 4月 06, 2018

We already have memcpy_toio(), but not memset_io(), so let's
add the obvious version to allow building an allmodconfig kernel
without errors like

drivers/gpu/drm/ttm/ttm_bo_util.c: In function 'ttm_bo_move_memcpy':
drivers/gpu/drm/ttm/ttm_bo_util.c:390:3: error: implicit declaration of function 'memset_io' [-Werror=implicit-function-declaration]
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NRichard Kuo <rkuo@codeaurora.org>

a57ab96e

01 5月, 2018 2 次提交

sparc: vio: use put_device() instead of kfree() · 00ad691a

由 Arvind Yadav 提交于 4月 25, 2018

Never directly free @dev after calling device_register(), even
if it returned an error. Always use put_device() to give up the
reference initialized.
Signed-off-by: NArvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

00ad691a

sparc64: Fix mistake in oradax license text · d3c68d0b

由 Rob Gardner 提交于 4月 20, 2018

The license text in both oradax files mistakenly specifies "version 3" of
the GNU General Public License. This is corrected to specify "version 2".
Signed-off-by: NRob Gardner <rob.gardner@oracle.com>
Signed-off-by: NJonathan Helman <jonathan.helman@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d3c68d0b

28 4月, 2018 1 次提交

x86/headers/UAPI: Move DISABLE_EXITS KVM capability bits to the UAPI · 5e62493f

由 KarimAllah Ahmed 提交于 4月 17, 2018

Move DISABLE_EXITS KVM capability bits to the UAPI just like the rest of
capabilities.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

5e62493f

27 4月, 2018 8 次提交

kvm: apic: Flush TLB after APIC mode/address change if VPIDs are in use · a468f2db

由 Junaid Shahid 提交于 4月 26, 2018

Currently, KVM flushes the TLB after a change to the APIC access page
address or the APIC mode when EPT mode is enabled. However, even in
shadow paging mode, a TLB flush is needed if VPIDs are being used, as
specified in the Intel SDM Section 29.4.5.

So replace vmx_flush_tlb_ept_only() with vmx_flush_tlb(), which will
flush if either EPT or VPIDs are in use.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

a468f2db

x86/entry/64/compat: Preserve r8-r11 in int $0x80 · 8bb2610b

由 Andy Lutomirski 提交于 4月 17, 2018

32-bit user code that uses int $80 doesn't care about r8-r11.  There is,
however, some 64-bit user code that intentionally uses int $0x80 to invoke
32-bit system calls.  From what I've seen, basically all such code assumes
that r8-r15 are all preserved, but the kernel clobbers r8-r11.  Since I
doubt that there's any code that depends on int $0x80 zeroing r8-r11,
change the kernel to preserve them.

I suspect that very little user code is broken by the old clobber, since
r8-r11 are only rarely allocated by gcc, and they're clobbered by function
calls, so they only way we'd see a problem is if the same function that
invokes int $0x80 also spills something important to one of these
registers.

The current behavior seems to date back to the historical commit
"[PATCH] x86-64 merge for 2.6.4".  Before that, all regs were
preserved.  I can't find any explanation of why this change was made.

Update the test_syscall_vdso_32 testcase as well to verify the new
behavior, and it strengthens the test to make sure that the kernel doesn't
accidentally permute r8..r15.
Suggested-by: NDenys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Link: https://lkml.kernel.org/r/d4c4d9985fbe64f8c9e19291886453914b48caee.1523975710.git.luto@kernel.org

8bb2610b

x86/ipc: Fix x32 version of shmid64_ds and msqid64_ds · 1a512c08

由 Arnd Bergmann 提交于 4月 24, 2018

A bugfix broke the x32 shmid64_ds and msqid64_ds data structure layout
(as seen from user space)  a few years ago: Originally, __BITS_PER_LONG
was defined as 64 on x32, so we did not have padding after the 64-bit
__kernel_time_t fields, After __BITS_PER_LONG got changed to 32,
applications would observe extra padding.

In other parts of the uapi headers we seem to have a mix of those
expecting either 32 or 64 on x32 applications, so we can't easily revert
the path that broke these two structures.

Instead, this patch decouples x32 from the other architectures and moves
it back into arch specific headers, partially reverting the even older
commit 73a2d096 ("x86: remove all now-duplicate header files").

It's not clear whether this ever made any difference, since at least
glibc carries its own (correct) copy of both of these header files,
so possibly no application has ever observed the definitions here.

Based on a suggestion from H.J. Lu, I tried out the tool from
https://github.com/hjl-tools/linux-header to find other such
bugs, which pointed out the same bug in statfs(), which also has
a separate (correct) copy in glibc.

Fixes: f4b4aae1 ("x86/headers/uapi: Fix __BITS_PER_LONG value for x32 builds")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: "H . J . Lu" <hjl.tools@gmail.com>
Cc: Jeffrey Walton <noloader@gmail.com>
Cc: stable@vger.kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180424212013.3967461-1-arnd@arndb.de

1a512c08

x86/setup: Do not reserve a crash kernel region if booted on Xen PV · 3db3eb28

由 Petr Tesarik 提交于 4月 25, 2018

Xen PV domains cannot shut down and start a crash kernel. Instead,
the crashing kernel makes a SCHEDOP_shutdown hypercall with the
reason code SHUTDOWN_crash, cf. xen_crash_shutdown() machine op in
arch/x86/xen/enlighten_pv.c.

A crash kernel reservation is merely a waste of RAM in this case. It
may also confuse users of kexec_load(2) and/or kexec_file_load(2).
When flags include KEXEC_ON_CRASH or KEXEC_FILE_ON_CRASH,
respectively, these syscalls return success, which is technically
correct, but the crash kexec image will never be actually used.
Signed-off-by: NPetr Tesarik <ptesarik@suse.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NJuergen Gross <jgross@suse.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: xen-devel@lists.xenproject.org
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jean Delvare <jdelvare@suse.de>
Link: https://lkml.kernel.org/r/20180425120835.23cef60c@ezekiel.suse.cz

3db3eb28

arm64: avoid instrumenting atomic_ll_sc.o · 3789c122

由 Mark Rutland 提交于 4月 27, 2018

Our out-of-line atomics are built with a special calling convention,
preventing pointless stack spilling, and allowing us to patch call sites
with ARMv8.1 atomic instructions.

Instrumentation inserted by the compiler may result in calls to
functions not following this special calling convention, resulting in
registers being unexpectedly clobbered, and various problems resulting
from this.

For example, if a kernel is built with KCOV and ARM64_LSE_ATOMICS, the
compiler inserts calls to __sanitizer_cov_trace_pc in the prologues of
the atomic functions. This has been observed to result in spurious
cmpxchg failures, leading to a hang early on in the boot process.

This patch avoids such issues by preventing instrumentation of our
out-of-line atomics.
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>

3789c122

powerpc/kvm/booke: Fix altivec related build break · b2d7ecbe

由 Laurentiu Tudor 提交于 4月 26, 2018

Add missing "altivec unavailable" interrupt injection helper
thus fixing the linker error below:

arch/powerpc/kvm/emulate_loadstore.o: In function `kvmppc_check_altivec_disabled':
arch/powerpc/kvm/emulate_loadstore.c: undefined reference to `.kvmppc_core_queue_vec_unavail'

Fixes: 09f98496 ("KVM: PPC: Book3S: Add MMIO emulation for VMX instructions")
Signed-off-by: NLaurentiu Tudor <laurentiu.tudor@nxp.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b2d7ecbe

powerpc: Fix deadlock with multiple calls to smp_send_stop · 6029755e

由 Nicholas Piggin 提交于 4月 27, 2018

smp_send_stop can lock up the IPI path for any subsequent calls,
because the receiving CPUs spin in their handler function. This
started becoming a problem with the addition of an smp_send_stop
call in the reboot path, because panics can reboot after doing
their own smp_send_stop.

The NMI IPI variant was fixed with ac61c115 ("powerpc: Fix
smp_send_stop NMI IPI handling"), which leaves the smp_call_function
variant.

This is fixed by having smp_send_stop only ever do the
smp_call_function once. This is a bit less robust than the NMI IPI
fix, because any other call to smp_call_function after smp_send_stop
could deadlock, but that has always been the case, and it was not
been a problem before.

Fixes: f2748bdf ("powerpc/powernv: Always stop secondaries before reboot/shutdown")
Reported-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6029755e

x86/cpu/intel: Add missing TLB cpuid values · b837913f

由 jacek.tomaka@poczta.fm 提交于 4月 24, 2018

Make kernel print the correct number of TLB entries on Intel Xeon Phi 7210
(and others)

Before:
[ 0.320005] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
After:
[ 0.320005] Last level dTLB entries: 4KB 256, 2MB 128, 4MB 128, 1GB 16

The entries do exist in the official Intel SMD but the type column there is
incorrect (states "Cache" where it should read "TLB"), but the entries for
the values 0x6B, 0x6C and 0x6D are correctly described as 'Data TLB'.
Signed-off-by: NJacek Tomaka <jacek.tomaka@poczta.fm>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20180423161425.24366-1-jacekt@dugeo.com

b837913f

26 4月, 2018 1 次提交

arm64: fix possible spectre-v1 in ptrace_hbp_get_event() · 19791a7c

由 Mark Rutland 提交于 4月 25, 2018

It's possible for userspace to control idx. Sanitize idx when using it
as an array index.

Found by smatch.
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>

19791a7c

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功