提交 · 396d2e878f92ec108e4293f1c77ea3bc90b414ff · openeuler / Kernel

19 12月, 2019 1 次提交

kvm: x86: Host feature SSBD doesn't imply guest feature SPEC_CTRL_SSBD · 396d2e87

由 Jim Mattson 提交于 12月 13, 2019

The host reports support for the synthetic feature X86_FEATURE_SSBD
when any of the three following hardware features are set:
  CPUID.(EAX=7,ECX=0):EDX.SSBD[bit 31]
  CPUID.80000008H:EBX.AMD_SSBD[bit 24]
  CPUID.80000008H:EBX.VIRT_SSBD[bit 25]

Either of the first two hardware features implies the existence of the
IA32_SPEC_CTRL MSR, but CPUID.80000008H:EBX.VIRT_SSBD[bit 25] does
not. Therefore, CPUID.(EAX=7,ECX=0):EDX.SSBD[bit 31] should only be
set in the guest if CPUID.(EAX=7,ECX=0):EDX.SSBD[bit 31] or
CPUID.80000008H:EBX.AMD_SSBD[bit 24] is set on the host.

Fixes: 0c54914d ("KVM: x86: use Intel speculation bugs and features as derived in generic x86 code")
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NJacob Xu <jacobhxu@google.com>
Reviewed-by: NPeter Shier <pshier@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Reported-by: NEric Biggers <ebiggers@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

396d2e87

21 11月, 2019 5 次提交

KVM: vmx: use MSR_IA32_TSX_CTRL to hard-disable TSX on guest that lack it · b07a5c53

由 Paolo Bonzini 提交于 11月 18, 2019

If X86_FEATURE_RTM is disabled, the guest should not be able to access
MSR_IA32_TSX_CTRL.  We can therefore use it in KVM to force all
transactions from the guest to abort.
Tested-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b07a5c53

KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionality · c11f83e0

由 Paolo Bonzini 提交于 11月 18, 2019

The current guest mitigation of TAA is both too heavy and not really
sufficient.  It is too heavy because it will cause some affected CPUs
(those that have MDS_NO but lack TAA_NO) to fall back to VERW and
get the corresponding slowdown.  It is not really sufficient because
it will cause the MDS_NO bit to disappear upon microcode update, so
that VMs started before the microcode update will not be runnable
anymore afterwards, even with tsx=on.

Instead, if tsx=on on the host, we can emulate MSR_IA32_TSX_CTRL for
the guest and let it run without the VERW mitigation.  Even though
MSR_IA32_TSX_CTRL is quite heavyweight, and we do not want to write
it on every vmentry, we can use the shared MSR functionality because
the host kernel need not protect itself from TSX-based side-channels.
Tested-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c11f83e0

KVM: x86: implement MSR_IA32_TSX_CTRL effect on CPUID · edef5c36

由 Paolo Bonzini 提交于 11月 18, 2019

Because KVM always emulates CPUID, the CPUID clear bit
(bit 1) of MSR_IA32_TSX_CTRL must be emulated "manually"
by the hypervisor when performing said emulation.

Right now neither kvm-intel.ko nor kvm-amd.ko implement
MSR_IA32_TSX_CTRL but this will change in the next patch.
Reviewed-by: NJim Mattson <jmattson@google.com>
Tested-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

edef5c36

KVM: x86: do not modify masked bits of shared MSRs · de1fca5d

由 Paolo Bonzini 提交于 11月 18, 2019

"Shared MSRs" are guest MSRs that are written to the host MSRs but
keep their value until the next return to userspace.  They support
a mask, so that some bits keep the host value, but this mask is
only used to skip an unnecessary MSR write and the value written
to the MSR is always the guest MSR.

Fix this and, while at it, do not update smsr->values[slot].curr if
for whatever reason the wrmsr fails.  This should only happen due to
reserved bits, so the value written to smsr->values[slot].curr
will not match when the user-return notifier and the host value will
always be restored.  However, it is untidy and in rare cases this
can actually avoid spurious WRMSRs on return to userspace.

Cc: stable@vger.kernel.org
Reviewed-by: NJim Mattson <jmattson@google.com>
Tested-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

de1fca5d

KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES · cbbaa272

由 Paolo Bonzini 提交于 11月 18, 2019

KVM does not implement MSR_IA32_TSX_CTRL, so it must not be presented
to the guests.  It is also confusing to have !ARCH_CAP_TSX_CTRL_MSR &&
!RTM && ARCH_CAP_TAA_NO: lack of MSR_IA32_TSX_CTRL suggests TSX was not
hidden (it actually was), yet the value says that TSX is not vulnerable
to microarchitectural data sampling.  Fix both.

Cc: stable@vger.kernel.org
Tested-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cbbaa272

14 11月, 2019 1 次提交

KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast() · ed69a6cb

由 Sean Christopherson 提交于 11月 13, 2019

Acquire the per-VM slots_lock when zapping all shadow pages as part of
toggling nx_huge_pages.  The fast zap algorithm relies on exclusivity
(via slots_lock) to identify obsolete vs. valid shadow pages, because it
uses a single bit for its generation number. Holding slots_lock also
obviates the need to acquire a read lock on the VM's srcu.

Failing to take slots_lock when toggling nx_huge_pages allows multiple
instances of kvm_mmu_zap_all_fast() to run concurrently, as the other
user, KVM_SET_USER_MEMORY_REGION, does not take the global kvm_lock.
(kvm_mmu_zap_all_fast() does take kvm->mmu_lock, but it can be
temporarily dropped by kvm_zap_obsolete_pages(), so it is not enough
to enforce exclusivity).

Concurrent fast zap instances causes obsolete shadow pages to be
incorrectly identified as valid due to the single bit generation number
wrapping, which results in stale shadow pages being left in KVM's MMU
and leads to all sorts of undesirable behavior.
The bug is easily confirmed by running with CONFIG_PROVE_LOCKING and
toggling nx_huge_pages via its module param.

Note, until commit 4ae5acbc4936 ("KVM: x86/mmu: Take slots_lock when
using kvm_mmu_zap_all_fast()", 2019-11-13) the fast zap algorithm used
an ulong-sized generation instead of relying on exclusivity for
correctness, but all callers except the recently added set_nx_huge_pages()
needed to hold slots_lock anyways.  Therefore, this patch does not have
to be backported to stable kernels.

Given that toggling nx_huge_pages is by no means a fast path, force it
to conform to the current approach instead of reintroducing the previous
generation count.

Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation", but NOT FOR STABLE)
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ed69a6cb

13 11月, 2019 2 次提交

KVM: X86: Reset the three MSR list number variables to 0 in kvm_init_msr_list() · 6cbee2b9

由 Xiaoyao Li 提交于 11月 13, 2019

When applying commit 7a5ee6ed ("KVM: X86: Fix initialization of MSR
lists"), it forgot to reset the three MSR lists number varialbes to 0
while removing the useless conditionals.

Fixes: 7a5ee6ed (KVM: X86: Fix initialization of MSR lists)
Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6cbee2b9

kvm: x86: disable shattered huge page recovery for PREEMPT_RT. · 13fb5927

由 Paolo Bonzini 提交于 11月 13, 2019

If a huge page is recovered (and becomes no executable) while another
thread is executing it, the resulting contention on mmu_lock can cause
latency spikes.  Disabling recovery for PREEMPT_RT kernels fixes this
issue.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

13fb5927

12 11月, 2019 6 次提交

KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved · a78986aa

由 Sean Christopherson 提交于 11月 11, 2019

Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
instead manually handle ZONE_DEVICE on a case-by-case basis. For things
like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
pages, e.g. put pages grabbed via gup(). But for flows such as setting
A/D bits or shifting refcounts for transparent huge pages, KVM needs to
to avoid processing ZONE_DEVICE pages as the flows in question lack the
underlying machinery for proper handling of ZONE_DEVICE pages.

This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
when running a KVM guest backed with /dev/dax memory, as KVM straight up
doesn't put any references to ZONE_DEVICE pages acquired by gup().

Note, Dan Williams proposed an alternative solution of doing put_page()
on ZONE_DEVICE pages immediately after gup() in order to simplify the
auditing needed to ensure is_zone_device_page() is called if and only if
the backing device is pinned (via gup()). But that approach would break
kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
unmap() when accessing guest memory, unlike KVM's secondary MMU, which
coordinates with mmu_notifier invalidations to avoid creating stale
page references, i.e. doesn't rely on pages being pinned.

[*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.plReported-by: NAdam Borowski <kilobyte@angband.pl>
Analyzed-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Cc: stable@vger.kernel.org
Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a78986aa

KVM: VMX: Introduce pi_is_pir_empty() helper · 29881b6e

由 Joao Martins 提交于 11月 11, 2019

Streamline the PID.PIR check and change its call sites to use
the newly added helper.
Suggested-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

29881b6e

KVM: VMX: Do not change PID.NDST when loading a blocked vCPU · 132194ff

由 Joao Martins 提交于 11月 11, 2019

When vCPU enters block phase, pi_pre_block() inserts vCPU to a per pCPU
linked list of all vCPUs that are blocked on this pCPU. Afterwards, it
changes PID.NV to POSTED_INTR_WAKEUP_VECTOR which its handler
(wakeup_handler()) is responsible to kick (unblock) any vCPU on that
linked list that now has pending posted interrupts.

While vCPU is blocked (in kvm_vcpu_block()), it may be preempted which
will cause vmx_vcpu_pi_put() to set PID.SN.  If later the vCPU will be
scheduled to run on a different pCPU, vmx_vcpu_pi_load() will clear
PID.SN but will also *overwrite PID.NDST to this different pCPU*.
Instead of keeping it with original pCPU which vCPU had entered block
phase on.

This results in an issue because when a posted interrupt is delivered, as
the wakeup_handler() will be executed and fail to find blocked vCPU on
its per pCPU linked list of all vCPUs that are blocked on this pCPU.
Which is due to the vCPU being placed on a *different* per pCPU
linked list i.e. the original pCPU in which it entered block phase.

The regression is introduced by commit c112b5f5 ("KVM: x86:
Recompute PID.ON when clearing PID.SN"). Therefore, partially revert
it and reintroduce the condition in vmx_vcpu_pi_load() responsible for
avoiding changing PID.NDST when loading a blocked vCPU.

Fixes: c112b5f5 ("KVM: x86: Recompute PID.ON when clearing PID.SN")
Tested-by: NNathan Ni <nathan.ni@oracle.com>
Co-developed-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

132194ff

KVM: VMX: Consider PID.PIR to determine if vCPU has pending interrupts · 9482ae45

由 Joao Martins 提交于 11月 11, 2019

Commit 17e433b5 ("KVM: Fix leak vCPU's VMCS value into other pCPU")
introduced vmx_dy_apicv_has_pending_interrupt() in order to determine
if a vCPU have a pending posted interrupt. This routine is used by
kvm_vcpu_on_spin() when searching for a a new runnable vCPU to schedule
on pCPU instead of a vCPU doing busy loop.

vmx_dy_apicv_has_pending_interrupt() determines if a
vCPU has a pending posted interrupt solely based on PID.ON. However,
when a vCPU is preempted, vmx_vcpu_pi_put() sets PID.SN which cause
raised posted interrupts to only set bit in PID.PIR without setting
PID.ON (and without sending notification vector), as depicted in VT-d
manual section 5.2.3 "Interrupt-Posting Hardware Operation".

Therefore, checking PID.ON is insufficient to determine if a vCPU has
pending posted interrupts and instead we should also check if there is
some bit set on PID.PIR if PID.SN=1.

Fixes: 17e433b5 ("KVM: Fix leak vCPU's VMCS value into other pCPU")
Reviewed-by: NJagannathan Raman <jag.raman@oracle.com>
Co-developed-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9482ae45

KVM: VMX: Fix comment to specify PID.ON instead of PIR.ON · d9ff2744

由 Liran Alon 提交于 11月 11, 2019

The Outstanding Notification (ON) bit is part of the Posted Interrupt
Descriptor (PID) as opposed to the Posted Interrupts Register (PIR).
The latter is a bitmap for pending vectors.
Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d9ff2744

KVM: X86: Fix initialization of MSR lists · 7a5ee6ed

由 Chenyi Qiang 提交于 11月 06, 2019

The three MSR lists(msrs_to_save[], emulated_msrs[] and
msr_based_features[]) are global arrays of kvm.ko, which are
adjusted (copy supported MSRs forward to override the unsupported MSRs)
when insmod kvm-{intel,amd}.ko, but it doesn't reset these three arrays
to their initial value when rmmod kvm-{intel,amd}.ko. Thus, at the next
installation, kvm-{intel,amd}.ko will do operations on the modified
arrays with some MSRs lost and some MSRs duplicated.

So define three constant arrays to hold the initial MSR lists and
initialize msrs_to_save[], emulated_msrs[] and msr_based_features[]
based on the constant arrays.

Cc: stable@vger.kernel.org
Reviewed-by: NXiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
[Remove now useless conditionals. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7a5ee6ed

07 11月, 2019 2 次提交

x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs · 012206a8

由 Josh Poimboeuf 提交于 11月 06, 2019

For new IBRS_ALL CPUs, the Enhanced IBRS check at the beginning of
cpu_bugs_smt_update() causes the function to return early, unintentionally
skipping the MDS and TAA logic.

This is not a problem for MDS, because there appears to be no overlap
between IBRS_ALL and MDS-affected CPUs.  So the MDS mitigation would be
disabled and nothing would need to be done in this function anyway.

But for TAA, the TAA_MSG_SMT string will never get printed on Cascade
Lake and newer.

The check is superfluous anyway: when 'spectre_v2_enabled' is
SPECTRE_V2_IBRS_ENHANCED, 'spectre_v2_user' is always
SPECTRE_V2_USER_NONE, and so the 'spectre_v2_user' switch statement
handles it appropriately by doing nothing.  So just remove the check.

Fixes: 1b42f017 ("x86/speculation/taa: Add mitigation for TSX Async Abort")
Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>

012206a8

arm64: Do not mask out PTE_RDONLY in pte_same() · 6767df24

由 Catalin Marinas 提交于 11月 06, 2019

Following commit 73e86cb0 ("arm64: Move PTE_RDONLY bit handling out
of set_pte_at()"), the PTE_RDONLY bit is no longer managed by
set_pte_at() but built into the PAGE_* attribute definitions.
Consequently, pte_same() must include this bit when checking two PTEs
for equality.

Remove the arm64-specific pte_same() function, practically reverting
commit 747a70e6 ("arm64: Fix copy-on-write referencing in HugeTLB")

Fixes: 73e86cb0 ("arm64: Move PTE_RDONLY bit handling out of set_pte_at()")
Cc: <stable@vger.kernel.org> # 4.14.x-
Cc: Will Deacon <will@kernel.org>
Cc: Steve Capper <steve.capper@arm.com>
Reported-by: NJohn Stultz <john.stultz@linaro.org>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NWill Deacon <will@kernel.org>

6767df24

06 11月, 2019 4 次提交

ARM: dts: stm32: change joystick pinctrl definition on stm32mp157c-ev1 · f4d6e0f7

由 Amelie Delaunay 提交于 11月 04, 2019

Pins used for joystick are all configured as input. "push-pull" is not a
valid setting for an input pin.

Fixes: a502b343 ("pinctrl: stmfx: update pinconf settings")
Signed-off-by: NAlexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: NAmelie Delaunay <amelie.delaunay@st.com>
Signed-off-by: NAlexandre Torgue <alexandre.torgue@st.com>

f4d6e0f7

ARM: dts: stm32: remove OV5640 pinctrl definition on stm32mp157c-ev1 · afe3af89

由 Amelie Delaunay 提交于 11月 04, 2019

"push-pull" configuration is now fully handled by the gpiolib and the
STMFX pinctrl driver. There is no longer need to declare a pinctrl group
to only configure "push-pull" setting for the line. It is done directly by
the gpiolib.

Fixes: a502b343 ("pinctrl: stmfx: update pinconf settings")
Signed-off-by: NAlexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: NAmelie Delaunay <amelie.delaunay@st.com>
Signed-off-by: NAlexandre Torgue <alexandre.torgue@st.com>

afe3af89

ARM: dts: stm32: Fix CAN RAM mapping on stm32mp157c · 9df50c2e

由 Christophe Roullier 提交于 11月 04, 2019

Split the 10Kbytes CAN message RAM to be able to use simultaneously
FDCAN1 and FDCAN2 instances.
First 5Kbytes are allocated to FDCAN1 and last 5Kbytes are used for
FDCAN2. To do so, set the offset to 0x1400 in mram-cfg for FDCAN2.

Fixes: d44d6e02 ("ARM: dts: stm32: change CAN RAM mapping on stm32mp157c")
Signed-off-by: NChristophe Roullier <christophe.roullier@st.com>
Signed-off-by: NAlexandre Torgue <alexandre.torgue@st.com>

9df50c2e

ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157 · 832c4365

由 Patrice Chotard 提交于 10月 04, 2019

Relax qspi pins slew-rate to minimize peak currents.

Fixes: 84403005 ("ARM: dts: stm32: add flash nor support on stm32mp157c eval board")
Signed-off-by: NPatrice Chotard <patrice.chotard@st.com>
Signed-off-by: NAlexandre Torgue <alexandre.torgue@st.com>

832c4365

05 11月, 2019 5 次提交

x86/tsc: Respect tsc command line paraemeter for clocksource_tsc_early · 63ec58b4

由 Michael Zhivich 提交于 10月 24, 2019

The introduction of clocksource_tsc_early broke the functionality of
"tsc=reliable" and "tsc=nowatchdog" command line parameters, since
clocksource_tsc_early is unconditionally registered with
CLOCK_SOURCE_MUST_VERIFY and thus put on the watchdog list.

This can cause the TSC to be declared unstable during boot:

  clocksource: timekeeping watchdog on CPU0: Marking clocksource
               'tsc-early' as unstable because the skew is too large:
  clocksource: 'refined-jiffies' wd_now: fffb7018 wd_last: fffb6e9d
               mask: ffffffff
  clocksource: 'tsc-early' cs_now: 68a6a7070f6a0 cs_last: 68a69ab6f74d6
               mask: ffffffffffffffff
  tsc: Marking TSC unstable due to clocksource watchdog

The corresponding elapsed times are cs_nsec=1224152026 and wd_nsec=378942392, so
the watchdog differs from TSC by 0.84 seconds.

This happens when HPET is not available and jiffies are used as the TSC
watchdog instead and the jiffies update is not happening due to lost timer
interrupts in periodic mode, which can happen e.g. with expensive debug
mechanisms enabled or under massive overload conditions in virtualized
environments.

Before the introduction of the early TSC clocksource the command line
parameters "tsc=reliable" and "tsc=nowatchdog" could be used to work around
this issue.

Restore the behaviour by disabling the watchdog if requested on the kernel
command line.

[ tglx: Clarify changelog ]

Fixes: aa83c457 ("x86/tsc: Introduce early tsc clocksource")
Signed-off-by: NMichael Zhivich <mzhivich@akamai.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191024175945.14338-1-mzhivich@akamai.com

63ec58b4

x86/dumpstack/64: Don't evaluate exception stacks before setup · e361362b

由 Thomas Gleixner 提交于 10月 23, 2019

Cyrill reported the following crash:

  BUG: unable to handle page fault for address: 0000000000001ff0
  #PF: supervisor read access in kernel mode
  RIP: 0010:get_stack_info+0xb3/0x148

It turns out that if the stack tracer is invoked before the exception stack
mappings are initialized in_exception_stack() can erroneously classify an
invalid address as an address inside of an exception stack:

    begin = this_cpu_read(cea_exception_stacks);  <- 0
    end = begin + sizeof(exception stacks);

i.e. any address between 0 and end will be considered as exception stack
address and the subsequent code will then try to derefence the resulting
stack frame at a non mapped address.

 end = begin + (unsigned long)ep->size;
     ==> end = 0x2000

 regs = (struct pt_regs *)end - 1;
     ==> regs = 0x2000 - sizeof(struct pt_regs *) = 0x1ff0

 info->next_sp   = (unsigned long *)regs->sp;
     ==> Crashes due to accessing 0x1ff0

Prevent this by checking the validity of the cea_exception_stack base
address and bailing out if it is zero.

Fixes: afcd21da ("x86/dumpstack/64: Use cpu_entry_area instead of orig_ist")
Reported-by: NCyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NCyrill Gorcunov <gorcunov@gmail.com>
Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1910231950590.1852@nanos.tec.linutronix.de

e361362b

x86/apic/32: Avoid bogus LDR warnings · fe6f85ca

由 Jan Beulich 提交于 10月 29, 2019

The removal of the LDR initialization in the bigsmp_32 APIC code unearthed
a problem in setup_local_APIC().

The code checks unconditionally for a mismatch of the logical APIC id by
comparing the early APIC id which was initialized in get_smp_config() with
the actual LDR value in the APIC.

Due to the removal of the bogus LDR initialization the check now can
trigger on bigsmp_32 APIC systems emitting a warning for every booting
CPU. This is of course a false positive because the APIC is not using
logical destination mode.

Restrict the check and the possibly resulting fixup to systems which are
actually using the APIC in logical destination mode.

[ tglx: Massaged changelog and added Cc stable ]

Fixes: bae3a8d3 ("x86/apic: Do not initialize LDR and DFR for bigsmp")
Signed-off-by: NJan Beulich <jbeulich@suse.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/666d8f91-b5a8-1afd-7add-821e72a35f03@suse.com

fe6f85ca

timekeeping/vsyscall: Update VDSO data unconditionally · 52338415

由 Huacai Chen 提交于 10月 24, 2019

The update of the VDSO data is depending on __arch_use_vsyscall() returning
True. This is a leftover from the attempt to map the features of various
architectures 1:1 into generic code.

The usage of __arch_use_vsyscall() in the actual vsyscall implementations
got dropped and replaced by the requirement for the architecture code to
return U64_MAX if the global clocksource is not usable in the VDSO.

But the __arch_use_vsyscall() check in the update code stayed which causes
the VDSO data to be stale or invalid when an architecture actually
implements that function and returns False when the current clocksource is
not usable in the VDSO.

As a consequence the VDSO implementations of clock_getres(), time(),
clock_gettime(CLOCK_.*_COARSE) operate on invalid data and return bogus
information.

Remove the __arch_use_vsyscall() check from the VDSO update function and
update the VDSO data unconditionally.

[ tglx: Massaged changelog and removed the now useless implementations in
  	asm-generic/ARM64/MIPS ]

Fixes: 44f57d78 ("timekeeping: Provide a generic update_vsyscall() implementation")
Signed-off-by: NHuacai Chen <chenhc@lemote.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Burton <paul.burton@mips.com>
Cc: linux-mips@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1571887709-11447-1-git-send-email-chenhc@lemote.com

52338415

kvm: x86: mmu: Recovery of shattered NX large pages · 1aa9b957

由 Junaid Shahid 提交于 11月 04, 2019

The page table pages corresponding to broken down large pages are zapped in
FIFO order, so that the large page can potentially be recovered, if it is
not longer being used for execution.  This removes the performance penalty
for walking deeper EPT page tables.

By default, one large page will last about one hour once the guest
reaches a steady state.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

1aa9b957

04 11月, 2019 5 次提交

kvm: mmu: ITLB_MULTIHIT mitigation · b8e8c830

由 Paolo Bonzini 提交于 11月 04, 2019

With some Intel processors, putting the same virtual address in the TLB
as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
and cause the processor to issue a machine check resulting in a CPU lockup.

Unfortunately when EPT page tables use huge pages, it is possible for a
malicious guest to cause this situation.

Add a knob to mark huge pages as non-executable. When the nx_huge_pages
parameter is enabled (and we are using EPT), all huge pages are marked as
NX. If the guest attempts to execute in one of those pages, the page is
broken down into 4K pages, which are then marked executable.

This is not an issue for shadow paging (except nested EPT), because then
the host is in control of TLB flushes and the problematic situation cannot
happen.  With nested EPT, again the nested guest can cause problems shadow
and direct EPT is treated in the same way.

[ tglx: Fixup default to auto and massage wording a bit ]
Originally-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

b8e8c830

x86/cpu: Add Tremont to the cpu vulnerability whitelist · cad14885

由 Pawan Gupta 提交于 11月 04, 2019

Add the new cpu family ATOM_TREMONT_D to the cpu vunerability
whitelist. ATOM_TREMONT_D is not affected by X86_BUG_ITLB_MULTIHIT.

ATOM_TREMONT_D might have mitigations against other issues as well, but
only the ITLB multihit mitigation is confirmed at this point.
Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

cad14885

x86/bugs: Add ITLB_MULTIHIT bug infrastructure · db4d30fb

由 Vineela Tummalapalli 提交于 11月 04, 2019

Some processors may incur a machine check error possibly resulting in an
unrecoverable CPU lockup when an instruction fetch encounters a TLB
multi-hit in the instruction TLB. This can occur when the page size is
changed along with either the physical address or cache type. The relevant
erratum can be found here:

   https://bugzilla.kernel.org/show_bug.cgi?id=205195

There are other processors affected for which the erratum does not fully
disclose the impact.

This issue affects both bare-metal x86 page tables and EPT.

It can be mitigated by either eliminating the use of large pages or by
using careful TLB invalidations when changing the page size in the page
tables.

Just like Spectre, Meltdown, L1TF and MDS, a new bit has been allocated in
MSR_IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) and will be set on CPUs which
are mitigated against this issue.
Signed-off-by: NVineela Tummalapalli <vineela.tummalapalli@intel.com>
Co-developed-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

db4d30fb

arm64: dts: zii-ultra: fix ARM regulator GPIO handle · f852497c

由 Lucas Stach 提交于 10月 30, 2019

The GPIO handle is referencing the wrong GPIO, so the voltage did not
actually change as intended. The pinmux is already correct, so just
correct the GPIO number.

Fixes: 4a13b3be (arm64: dts: imx: add Zii Ultra board support)
Signed-off-by: NLucas Stach <l.stach@pengutronix.de>
Signed-off-by: NShawn Guo <shawnguo@kernel.org>

f852497c

x86/resctrl: Prevent NULL pointer dereference when reading mondata · 26467b0f

由 Xiaochen Shen 提交于 10月 29, 2019

When a mon group is being deleted, rdtgrp->flags is set to RDT_DELETED
in rdtgroup_rmdir_mon() firstly. The structure of rdtgrp will be freed
until rdtgrp->waitcount is dropped to 0 in rdtgroup_kn_unlock() later.

During the window of deleting a mon group, if an application calls
rdtgroup_mondata_show() to read mondata under this mon group,
'rdtgrp' returned from rdtgroup_kn_lock_live() is a NULL pointer when
rdtgrp->flags is RDT_DELETED. And then 'rdtgrp' is passed in this path:
rdtgroup_mondata_show() --> mon_event_read() --> mon_event_count().
Thus it results in NULL pointer dereference in mon_event_count().

Check 'rdtgrp' in rdtgroup_mondata_show(), and return -ENOENT
immediately when reading mondata during the window of deleting a mon
group.

Fixes: d89b7379 ("x86/intel_rdt/cqm: Add mon_data")
Signed-off-by: NXiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NFenghua Yu <fenghua.yu@intel.com>
Reviewed-by: NTony Luck <tony.luck@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: pei.p.jia@intel.com
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/1572326702-27577-1-git-send-email-xiaochen.shen@intel.com

26467b0f

02 11月, 2019 1 次提交

powerpc/bpf: Fix tail call implementation · 7de08690

由 Eric Dumazet 提交于 10月 31, 2019

We have seen many crashes on powerpc hosts while loading bpf programs.

The problem here is that bpf_int_jit_compile() does a first pass
to compute the program length.

Then it allocates memory to store the generated program and
calls bpf_jit_build_body() a second time (and a third time
later)

What I have observed is that the second bpf_jit_build_body()
could end up using few more words than expected.

If bpf_jit_binary_alloc() put the space for the program
at the end of the allocated page, we then write on
a non mapped memory.

It appears that bpf_jit_emit_tail_call() calls
bpf_jit_emit_common_epilogue() while ctx->seen might not
be stable.

Only after the second pass we can be sure ctx->seen wont be changed.

Trying to avoid a second pass seems quite complex and probably
not worth it.

Fixes: ce076141 ("powerpc/bpf: Implement support for tail calls")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20191101033444.143741-1-edumazet@google.com

7de08690

01 11月, 2019 6 次提交

arm64: apply ARM64_ERRATUM_843419 workaround for Brahma-B53 core · 1cf45b8f

由 Florian Fainelli 提交于 10月 31, 2019

The Broadcom Brahma-B53 core is susceptible to the issue described by
ARM64_ERRATUM_843419 so this commit enables the workaround to be applied
when executing on that core.

Since there are now multiple entries to match, we must convert the
existing ARM64_ERRATUM_843419 into an erratum list and use
cpucap_multi_entry_cap_matches to match our entries.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NWill Deacon <will@kernel.org>

1cf45b8f

arm64: Brahma-B53 is SSB and spectre v2 safe · e059770c

由 Florian Fainelli 提交于 10月 31, 2019

Add the Brahma-B53 CPU (all versions) to the whitelists of CPUs for the
SSB and spectre v2 mitigations.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NWill Deacon <will@kernel.org>

e059770c

arm64: apply ARM64_ERRATUM_845719 workaround for Brahma-B53 core · bfc97f9f

由 Doug Berger 提交于 10月 31, 2019

The Broadcom Brahma-B53 core is susceptible to the issue described by
ARM64_ERRATUM_845719 so this commit enables the workaround to be applied
when executing on that core.

Since there are now multiple entries to match, we must convert the
existing ARM64_ERRATUM_845719 into an erratum list.
Signed-off-by: NDoug Berger <opendmb@gmail.com>
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NWill Deacon <will@kernel.org>

bfc97f9f

s390/idle: fix cpu idle time calculation · 3d7efa4e

由 Heiko Carstens 提交于 10月 28, 2019

The idle time reported in /proc/stat sometimes incorrectly contains
huge values on s390. This is caused by a bug in arch_cpu_idle_time().

The kernel tries to figure out when a different cpu entered idle by
accessing its per-cpu data structure. There is an ordering problem: if
the remote cpu has an idle_enter value which is not zero, and an
idle_exit value which is zero, it is assumed it is idle since
"now". The "now" timestamp however is taken before the idle_enter
value is read.

Which in turn means that "now" can be smaller than idle_enter of the
remote cpu. Unconditionally subtracting idle_enter from "now" can thus
lead to a negative value (aka large unsigned value).

Fix this by moving the get_tod_clock() invocation out of the
loop. While at it also make the code a bit more readable.

A similar bug also exists for show_idle_time(). Fix this is as well.

Cc: <stable@vger.kernel.org>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

3d7efa4e

s390/unwind: fix mixing regs and sp · a1d863ac

由 Ilya Leoshkevich 提交于 10月 02, 2019

unwind_for_each_frame stops after the first frame if regs->gprs[15] <=
sp.

The reason is that in case regs are specified, the first frame should be
regs->psw.addr and the second frame should be sp->gprs[8]. However,
currently the second frame is regs->gprs[15], which confuses
outside_of_stack().

Fix by introducing a flag to distinguish this special case from
unwinding the interrupt handler, for which the current behavior is
appropriate.

Fixes: 78c98f90 ("s390/unwind: introduce stack unwind API")
Signed-off-by: NIlya Leoshkevich <iii@linux.ibm.com>
Cc: stable@vger.kernel.org # v5.2+
Reviewed-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

a1d863ac

s390/cmm: fix information leak in cmm_timeout_handler() · b8e51a6a

由 Yihui ZENG 提交于 10月 25, 2019

The problem is that we were putting the NUL terminator too far:

	buf[sizeof(buf) - 1] = '\0';

If the user input isn't NUL terminated and they haven't initialized the
whole buffer then it leads to an info leak.  The NUL terminator should
be:

	buf[len - 1] = '\0';
Signed-off-by: NYihui Zeng <yzeng56@asu.edu>
Cc: stable@vger.kernel.org
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
[heiko.carstens@de.ibm.com: keep semantics of how *lenp and *ppos are handled]
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

b8e51a6a

31 10月, 2019 2 次提交

arm64: cpufeature: Enable Qualcomm Falkor errata 1009 for Kryo · 36c602dc

由 Bjorn Andersson 提交于 10月 29, 2019

The Kryo cores share errata 1009 with Falkor, so add their model
definitions and enable it for them as well.
Signed-off-by: NBjorn Andersson <bjorn.andersson@linaro.org>
[will: Update entry in silicon-errata.rst]
Signed-off-by: NWill Deacon <will@kernel.org>

36c602dc

KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active · 9167ab79

由 Paolo Bonzini 提交于 10月 27, 2019

VMX already does so if the host has SMEP, in order to support the combination of
CR0.WP=1 and CR4.SMEP=1. However, it is perfectly safe to always do so, and in
fact VMX already ends up running with EFER.NXE=1 on old processors that lack the
"load EFER" controls, because it may help avoiding a slow MSR write. Removing
all the conditionals simplifies the code.

SVM does not have similar code, but it should since recent AMD processors do
support SMEP. So this patch also makes the code for the two vendors more similar
while fixing NPT=0, CR0.WP=1 and CR4.SMEP=1 on AMD processors.

Cc: stable@vger.kernel.org
Cc: Joerg Roedel <jroedel@suse.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9167ab79

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功