提交 · 7311467702710cc30ac4e3a6c6670a766e7667f9 · openeuler / Kernel

29 9月, 2020 15 次提交

KVM: arm64: Get rid of kvm_arm_have_ssbd() · 73114677

由 Marc Zyngier 提交于 9月 18, 2020

kvm_arm_have_ssbd() is now completely unused, get rid of it.
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NWill Deacon <will@kernel.org>

73114677

KVM: arm64: Simplify handling of ARCH_WORKAROUND_2 · 29e8910a

由 Marc Zyngier 提交于 9月 18, 2020

Owing to the fact that the host kernel is always mitigated, we can
drastically simplify the WA2 handling by keeping the mitigation
state ON when entering the guest. This means the guest is either
unaffected or not mitigated.

This results in a nice simplification of the mitigation space,
and the removal of a lot of code that was never really used anyway.
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NWill Deacon <will@kernel.org>

29e8910a

arm64: Rewrite Spectre-v4 mitigation code · c2876207

由 Will Deacon 提交于 9月 18, 2020

Rewrite the Spectre-v4 mitigation handling code to follow the same
approach as that taken by Spectre-v2.

For now, report to KVM that the system is vulnerable (by forcing
'ssbd_state' to ARM64_SSBD_UNKNOWN), as this will be cleared up in
subsequent steps.
Signed-off-by: NWill Deacon <will@kernel.org>

c2876207

arm64: Move SSBD prctl() handler alongside other spectre mitigation code · 9e78b659

由 Will Deacon 提交于 9月 18, 2020

As part of the spectre consolidation effort to shift all of the ghosts
into their own proton pack, move all of the horrible SSBD prctl() code
out of its own 'ssbd.c' file.
Signed-off-by: NWill Deacon <will@kernel.org>

9e78b659

arm64: Rename ARM64_SSBD to ARM64_SPECTRE_V4 · 9b0955ba

由 Will Deacon 提交于 9月 15, 2020

In a similar manner to the renaming of ARM64_HARDEN_BRANCH_PREDICTOR
to ARM64_SPECTRE_V2, rename ARM64_SSBD to ARM64_SPECTRE_V4. This isn't
_entirely_ accurate, as we also need to take into account the interaction
with SSBS, but that will be taken care of in subsequent patches.
Signed-off-by: NWill Deacon <will@kernel.org>

9b0955ba

arm64: Treat SSBS as a non-strict system feature · 532d5815

由 Will Deacon 提交于 9月 15, 2020

If all CPUs discovered during boot have SSBS, then spectre-v4 will be
considered to be "mitigated". However, we still allow late CPUs without
SSBS to be onlined, albeit with a "SANITY CHECK" warning. This is
problematic for userspace because it means that the system can quietly
transition to "Vulnerable" at runtime.

Avoid this by treating SSBS as a non-strict system feature: if all of
the CPUs discovered during boot have SSBS, then late arriving secondaries
better have it as well.
Signed-off-by: NWill Deacon <will@kernel.org>

532d5815

arm64: Group start_thread() functions together · a8de9498

由 Will Deacon 提交于 9月 15, 2020

The is_ttbrX_addr() functions have somehow ended up in the middle of
the start_thread() functions, so move them out of the way to keep the
code readable.
Signed-off-by: NWill Deacon <will@kernel.org>

a8de9498

KVM: arm64: Set CSV2 for guests on hardware unaffected by Spectre-v2 · e1026237

由 Marc Zyngier 提交于 9月 16, 2020

If the system is not affected by Spectre-v2, then advertise to the KVM
guest that it is not affected, without the need for a safelist in the
guest.
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NWill Deacon <will@kernel.org>

e1026237

arm64: Rewrite Spectre-v2 mitigation code · d4647f0a

由 Will Deacon 提交于 9月 15, 2020

The Spectre-v2 mitigation code is pretty unwieldy and hard to maintain.
This is largely due to it being written hastily, without much clue as to
how things would pan out, and also because it ends up mixing policy and
state in such a way that it is very difficult to figure out what's going
on.

Rewrite the Spectre-v2 mitigation so that it clearly separates state from
policy and follows a more structured approach to handling the mitigation.
Signed-off-by: NWill Deacon <will@kernel.org>

d4647f0a

arm64: Introduce separate file for spectre mitigations and reporting · 455697ad

由 Will Deacon 提交于 9月 15, 2020

The spectre mitigation code is spread over a few different files, which
makes it both hard to follow, but also hard to remove it should we want
to do that in future.

Introduce a new file for housing the spectre mitigations, and populate
it with the spectre-v1 reporting code to start with.
Signed-off-by: NWill Deacon <will@kernel.org>

455697ad

arm64: Rename ARM64_HARDEN_BRANCH_PREDICTOR to ARM64_SPECTRE_V2 · 688f1e4b

由 Will Deacon 提交于 9月 15, 2020

For better or worse, the world knows about "Spectre" and not about
"Branch predictor hardening". Rename ARM64_HARDEN_BRANCH_PREDICTOR to
ARM64_SPECTRE_V2 as part of moving all of the Spectre mitigations into
their own little corner.
Signed-off-by: NWill Deacon <will@kernel.org>

688f1e4b

KVM: arm64: Simplify install_bp_hardening_cb() · b181048f

由 Will Deacon 提交于 9月 15, 2020

Use is_hyp_mode_available() to detect whether or not we need to patch
the KVM vectors for branch hardening, which avoids the need to take the
vector pointers as parameters.
Signed-off-by: NWill Deacon <will@kernel.org>

b181048f

KVM: arm64: Replace CONFIG_KVM_INDIRECT_VECTORS with CONFIG_RANDOMIZE_BASE · 5359a87d

由 Will Deacon 提交于 9月 15, 2020

The removal of CONFIG_HARDEN_BRANCH_PREDICTOR means that
CONFIG_KVM_INDIRECT_VECTORS is synonymous with CONFIG_RANDOMIZE_BASE,
so replace it.
Signed-off-by: NWill Deacon <will@kernel.org>

5359a87d

arm64: Remove Spectre-related CONFIG_* options · 6e5f0927

由 Will Deacon 提交于 9月 15, 2020

The spectre mitigations are too configurable for their own good, leading
to confusing logic trying to figure out when we should mitigate and when
we shouldn't. Although the plethora of command-line options need to stick
around for backwards compatibility, the default-on CONFIG options that
depend on EXPERT can be dropped, as the mitigations only do anything if
the system is vulnerable, a mitigation is available and the command-line
hasn't disabled it.

Remove CONFIG_HARDEN_BRANCH_PREDICTOR and CONFIG_ARM64_SSBD in favour of
enabling this code unconditionally.
Signed-off-by: NWill Deacon <will@kernel.org>

6e5f0927

arm64: Run ARCH_WORKAROUND_2 enabling code on all CPUs · 39533e12

由 Marc Zyngier 提交于 7月 16, 2020

Commit 606f8e7b ("arm64: capabilities: Use linear array for
detection and verification") changed the way we deal with per-CPU errata
by only calling the .matches() callback until one CPU is found to be
affected. At this point, .matches() stop being called, and .cpu_enable()
will be called on all CPUs.

This breaks the ARCH_WORKAROUND_2 handling, as only a single CPU will be
mitigated.

In order to address this, forcefully call the .matches() callback from a
.cpu_enable() callback, which brings us back to the original behaviour.

Fixes: 606f8e7b ("arm64: capabilities: Use linear array for detection and verification")
Cc: <stable@vger.kernel.org>
Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NWill Deacon <will@kernel.org>

39533e12

22 9月, 2020 2 次提交

arm64: Run ARCH_WORKAROUND_1 enabling code on all CPUs · 18fce561

由 Marc Zyngier 提交于 7月 16, 2020

Commit 73f38166 ("arm64: Advertise mitigation of Spectre-v2, or lack
thereof") changed the way we deal with ARCH_WORKAROUND_1, by moving most
of the enabling code to the .matches() callback.

This has the unfortunate effect that the workaround gets only enabled on
the first affected CPU, and no other.

In order to address this, forcefully call the .matches() callback from a
.cpu_enable() callback, which brings us back to the original behaviour.

Fixes: 73f38166 ("arm64: Advertise mitigation of Spectre-v2, or lack thereof")
Cc: <stable@vger.kernel.org>
Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NWill Deacon <will@kernel.org>

18fce561

arm64: Make use of ARCH_WORKAROUND_1 even when KVM is not enabled · b11483ef

由 Marc Zyngier 提交于 7月 16, 2020

We seem to be pretending that we don't have any firmware mitigation
when KVM is not compiled in, which is not quite expected.

Bring back the mitigation in this case.

Fixes: 4db61fef ("arm64: kvm: Modernize __smccc_workaround_1_smc_start annotations")
Cc: <stable@vger.kernel.org>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NWill Deacon <will@kernel.org>

b11483ef

28 8月, 2020 7 次提交

KVM: arm64: Set HCR_EL2.PTW to prevent AT taking synchronous exception · 71a7f8cb

由 James Morse 提交于 8月 21, 2020

AT instructions do a translation table walk and return the result, or
the fault in PAR_EL1. KVM uses these to find the IPA when the value is
not provided by the CPU in HPFAR_EL1.

If a translation table walk causes an external abort it is taken as an
exception, even if it was due to an AT instruction. (DDI0487F.a's D5.2.11
"Synchronous faults generated by address translation instructions")

While we previously made KVM resilient to exceptions taken due to AT
instructions, the device access causes mismatched attributes, and may
occur speculatively. Prevent this, by forbidding a walk through memory
described as device at stage2. Now such AT instructions will report a
stage2 fault.

Such a fault will cause KVM to restart the guest. If the AT instructions
always walk the page tables, but guest execution uses the translation cached
in the TLB, the guest can't make forward progress until the TLB entry is
evicted. This isn't a problem, as since commit 5dcd0fdb ("KVM: arm64:
Defer guest entry when an asynchronous exception is pending"), KVM will
return to the host to process IRQs allowing the rest of the system to keep
running.

Cc: stable@vger.kernel.org # <v5.3: 5dcd0fdb ("KVM: arm64: Defer guest entry when an asynchronous exception is pending")
Signed-off-by: NJames Morse <james.morse@arm.com>
Reviewed-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

71a7f8cb

KVM: arm64: Survive synchronous exceptions caused by AT instructions · 88a84ccc

由 James Morse 提交于 8月 21, 2020

KVM doesn't expect any synchronous exceptions when executing, any such
exception leads to a panic(). AT instructions access the guest page
tables, and can cause a synchronous external abort to be taken.

The arm-arm is unclear on what should happen if the guest has configured
the hardware update of the access-flag, and a memory type in TCR_EL1 that
does not support atomic operations. B2.2.6 "Possible implementation
restrictions on using atomic instructions" from DDI0487F.a lists
synchronous external abort as a possible behaviour of atomic instructions
that target memory that isn't writeback cacheable, but the page table
walker may behave differently.

Make KVM robust to synchronous exceptions caused by AT instructions.
Add a get_user() style helper for AT instructions that returns -EFAULT
if an exception was generated.

While KVM's version of the exception table mixes synchronous and
asynchronous exceptions, only one of these can occur at each location.

Re-enter the guest when the AT instructions take an exception on the
assumption the guest will take the same exception. This isn't guaranteed
to make forward progress, as the AT instructions may always walk the page
tables, but guest execution may use the translation cached in the TLB.

This isn't a problem, as since commit 5dcd0fdb ("KVM: arm64: Defer guest
entry when an asynchronous exception is pending"), KVM will return to the
host to process IRQs allowing the rest of the system to keep running.

88a84ccc

KVM: arm64: Add kvm_extable for vaxorcism code · e9ee186b

由 James Morse 提交于 8月 21, 2020

KVM has a one instruction window where it will allow an SError exception
to be consumed by the hypervisor without treating it as a hypervisor bug.
This is used to consume asynchronous external abort that were caused by
the guest.

As we are about to add another location that survives unexpected exceptions,
generalise this code to make it behave like the host's extable.

KVM's version has to be mapped to EL2 to be accessible on nVHE systems.

The SError vaxorcism code is a one instruction window, so has two entries
in the extable. Because the KVM code is copied for VHE and nVHE, we end up
with four entries, half of which correspond with code that isn't mapped.
Signed-off-by: NJames Morse <james.morse@arm.com>
Reviewed-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

e9ee186b

arm64: vdso32: make vdso32 install conditional · 5d28ba5f

由 Frank van der Linden 提交于 8月 27, 2020

vdso32 should only be installed if CONFIG_COMPAT_VDSO is enabled,
since it's not even supposed to be compiled otherwise, and arm64
builds without a 32bit crosscompiler will fail.

Fixes: 8d75785a ("ARM64: vdso32: Install vdso32 from vdso_install")
Signed-off-by: NFrank van der Linden <fllinden@amazon.com>
Cc: stable@vger.kernel.org [5.4+]
Link: https://lore.kernel.org/r/20200827234012.19757-1-fllinden@amazon.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>

5d28ba5f

arm64: use a common .arch preamble for inline assembly · 1764c3ed

由 Sami Tolvanen 提交于 8月 27, 2020

Commit 7c78f67e ("arm64: enable tlbi range instructions") breaks
LLVM's integrated assembler, because -Wa,-march is only passed to
external assemblers and therefore, the new instructions are not enabled
when IAS is used.

This change adds a common architecture version preamble, which can be
used in inline assembly blocks that contain instructions that require
a newer architecture version, and uses it to fix __TLBI_0 and __TLBI_1
with ARM64_TLB_RANGE.

Fixes: 7c78f67e ("arm64: enable tlbi range instructions")
Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
Tested-by: NNathan Chancellor <natechancellor@gmail.com>
Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1106
Link: https://lore.kernel.org/r/20200827203608.1225689-1-samitolvanen@google.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>

1764c3ed

powerpc/32s: Disable VMAP stack which CONFIG_ADB_PMU · 4a133eb3

由 Christophe Leroy 提交于 8月 27, 2020

low_sleep_handler() can't restore the context from virtual
stack because the stack can hardly be accessed with MMU OFF.

For now, disable VMAP stack when CONFIG_ADB_PMU is selected.

Fixes: cd08f109 ("powerpc/32s: Enable CONFIG_VMAP_STACK")
Cc: stable@vger.kernel.org # v5.6+
Reported-by: NGiuseppe Sacco <giuseppe@sguazz.it>
Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ec96c15bfa1a7415ab604ee1c98cd45779c08be0.1598553015.git.christophe.leroy@csgroup.eu

4a133eb3

arm64/cpuinfo: Remove unnecessary fallthrough annotation · c165a08d

由 Gustavo A. R. Silva 提交于 8月 27, 2020

Fallthrough annotations for consecutive default and case labels
are not necessary.
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

c165a08d

27 8月, 2020 9 次提交

Revert "powerpc/powernv/idle: Replace CPU feature check with PVR check" · 16d83a54

由 Pratik Rajesh Sampat 提交于 8月 26, 2020

cpuidle stop state implementation has minor optimizations for P10
where hardware preserves more SPR registers compared to P9. The
current P9 driver works for P10, although does few extra
save-restores. P9 driver can provide the required power management
features like SMT thread folding and core level power savings on a P10
platform.

Until the P10 stop driver is available, revert the commit which allows
for only P9 systems to utilize cpuidle and blocks all idle stop states
for P10. CPU idle states are enabled and tested on the P10 platform
with this fix.

This reverts commit 8747bf36.

Fixes: 8747bf36 ("powerpc/powernv/idle: Replace CPU feature check with PVR check")
Signed-off-by: NPratik Rajesh Sampat <psampat@linux.ibm.com>
Reviewed-by: NVaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200826082918.89306-1-psampat@linux.ibm.com

16d83a54

powerpc/perf: Fix reading of MSR[HV/PR] bits in trace-imc · 82715a0f

由 Athira Rajeev 提交于 8月 26, 2020

IMC trace-mode uses MSR[HV/PR] bits to set the cpumode for the
instruction pointer captured in each sample. The bits are fetched from
the third double word of the trace record. Reading third double word
from IMC trace record should use be64_to_cpu() along with READ_ONCE
inorder to fetch correct MSR[HV/PR] bits. Patch addresses this change.

Currently we are using PERF_RECORD_MISC_HYPERVISOR as cpumode if MSR
HV is 1 and PR is 0 which means the address is from host counter. But
using PERF_RECORD_MISC_HYPERVISOR for host counter data will fail to
resolve the address -> symbol during "perf report" because perf tools
side uses PERF_RECORD_MISC_KERNEL to represent the host counter data.
Therefore, fix the trace imc sample data to use
PERF_RECORD_MISC_KERNEL as cpumode for host kernel information.

Fixes: 77ca3951 ("powerpc/perf: Add kernel support for new MSR[HV PR] bits in trace-imc")
Signed-off-by: NAthira Rajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1598424029-1662-1-git-send-email-atrajeev@linux.vnet.ibm.com

82715a0f

powerpc/perf: Fix crashes with generic_compat_pmu & BHRB · b460b512

由 Alexey Kardashevskiy 提交于 6月 02, 2020

The bhrb_filter_map ("The Branch History Rolling Buffer") callback is
only defined in raw CPUs' power_pmu structs. The "architected" CPUs
use generic_compat_pmu, which does not have this callback, and crashes
occur if a user tries to enable branch stack for an event.

This add a NULL pointer check for bhrb_filter_map() which behaves as
if the callback returned an error.

This does not add the same check for config_bhrb() as the only caller
checks for cpuhw->bhrb_users which remains zero if bhrb_filter_map==0.

Fixes: be80e758 ("powerpc/perf: Add generic compat mode pmu driver")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NMadhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200602025612.62707-1-aik@ozlabs.ru

b460b512

powerpc/64s: Fix crash in load_fp_state() due to fpexc_mode · b91eb518

由 Michael Ellerman 提交于 8月 25, 2020

The recent commit 01eb0187 ("powerpc/64s: Fix restore_math
unnecessarily changing MSR") changed some of the handling of floating
point/vector restore.

In particular it caused current->thread.fpexc_mode to be copied into
the current MSR (via msr_check_and_set()), rather than just into
regs->msr (which is moved into MSR on return to userspace).

This can lead to a crash in the kernel if we take a floating point
exception when restoring FPSCR:

  Oops: Exception in kernel mode, sig: 8 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
  Modules linked in:
  CPU: 3 PID: 101213 Comm: ld64.so.2 Not tainted 5.9.0-rc1-00098-g18445bf4-dirty #9
  NIP:  c00000000000fbb4 LR: c00000000001a7ac CTR: c000000000183570
  REGS: c0000016b7cfb3b0 TRAP: 0700   Not tainted  (5.9.0-rc1-00098-g18445bf4-dirty)
  MSR:  900000000290b933 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44002444  XER: 00000000
  CFAR: c00000000001a7a8 IRQMASK: 1
  GPR00: c00000000001ae40 c0000016b7cfb640 c0000000011b7f00 c000001542a0f740
  GPR04: c000001542a0f720 c000001542a0eb00 0000000000000900 c000001542a0eb00
  GPR08: 000000000000000a 0000000000002000 9000000000009033 0000000000000000
  GPR12: 0000000000004000 c0000017ffffd900 0000000000000001 c000000000df5a58
  GPR16: c000000000e19c18 c0000000010e1123 0000000000000001 c000000000e1a638
  GPR20: 0000000000000000 c0000000044b1d00 0000000000000000 c000001542a0f2a0
  GPR24: 00000016c7fe0000 c000001542a0f720 c000000001c93da0 c000000000fe5f28
  GPR28: c000001542a0f720 0000000000800000 c0000016b7cfbe90 0000000002802900
  NIP load_fp_state+0x4/0x214
  LR  restore_math+0x17c/0x1f0
  Call Trace:
    0xc0000016b7cfb680 (unreliable)
    __switch_to+0x330/0x460
    __schedule+0x318/0x920
    schedule+0x74/0x140
    schedule_timeout+0x318/0x3f0
    wait_for_completion+0xc8/0x210
    call_usermodehelper_exec+0x234/0x280
    do_coredump+0xedc/0x13c0
    get_signal+0x1d4/0xbe0
    do_notify_resume+0x1a0/0x490
    interrupt_exit_user_prepare+0x1c4/0x230
    interrupt_return+0x14/0x1c0
  Instruction dump:
  ebe10168 e88101a0 7c8ff120 382101e0 e8010010 7c0803a6 4e800020 790605c4
  782905c4 7c0008a8 7c0008a8 c8030200 <fffe058e> 48000088 c8030000 c8230010

Fix it by only loading the fpexc_mode value into regs->msr.

Also add a comment to explain that although VSX is subject to the
value of fpexc_mode, we don't have to handle that separately because
we only allow VSX to be enabled if FP is also enabled.

Fixes: 01eb0187 ("powerpc/64s: Fix restore_math unnecessarily changing MSR")
Reported-by: NMilton Miller <miltonm@us.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
Link: https://lore.kernel.org/r/20200825093424.3967813-1-mpe@ellerman.id.au

b91eb518

powerpc/64s: scv entry should set PPR · e5fe5609

由 Nicholas Piggin 提交于 8月 25, 2020

Kernel entry sets PPR to HMT_MEDIUM by convention. The scv entry
path missed this.

Fixes: 7fa95f9a ("powerpc/64s: system call support for scv/rfscv instructions")
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200825075309.224184-1-npiggin@gmail.com

e5fe5609

x86/irq: Unbreak interrupt affinity setting · e027ffff

由 Thomas Gleixner 提交于 8月 26, 2020

Several people reported that 5.8 broke the interrupt affinity setting
mechanism.

The consolidation of the entry code reused the regular exception entry code
for device interrupts and changed the way how the vector number is conveyed
from ptregs->orig_ax to a function argument.

The low level entry uses the hardware error code slot to push the vector
number onto the stack which is retrieved from there into a function
argument and the slot on stack is set to -1.

The reason for setting it to -1 is that the error code slot is at the
position where pt_regs::orig_ax is. A positive value in pt_regs::orig_ax
indicates that the entry came via a syscall. If it's not set to a negative
value then a signal delivery on return to userspace would try to restart a
syscall. But there are other places which rely on pt_regs::orig_ax being a
valid indicator for syscall entry.

But setting pt_regs::orig_ax to -1 has a nasty side effect vs. the
interrupt affinity setting mechanism, which was overlooked when this change
was made.

Moving interrupts on x86 happens in several steps. A new vector on a
different CPU is allocated and the relevant interrupt source is
reprogrammed to that. But that's racy and there might be an interrupt
already in flight to the old vector. So the old vector is preserved until
the first interrupt arrives on the new vector and the new target CPU. Once
that happens the old vector is cleaned up, but this cleanup still depends
on the vector number being stored in pt_regs::orig_ax, which is now -1.

That -1 makes the check for cleanup: pt_regs::orig_ax == new_vector
always false. As a consequence the interrupt is moved once, but then it
cannot be moved anymore because the cleanup of the old vector never
happens.

There would be several ways to convey the vector information to that place
in the guts of the interrupt handling, but on deeper inspection it turned
out that this check is pointless and a leftover from the old affinity model
of X86 which supported multi-CPU affinities. Under this model it was
possible that an interrupt had an old and a new vector on the same CPU, so
the vector match was required.

Under the new model the effective affinity of an interrupt is always a
single CPU from the requested affinity mask. If the affinity mask changes
then either the interrupt stays on the CPU and on the same vector when that
CPU is still in the new affinity mask or it is moved to a different CPU, but
it is never moved to a different vector on the same CPU.

Ergo the cleanup check for the matching vector number is not required and
can be removed which makes the dependency on pt_regs:orig_ax go away.

The remaining check for new_cpu == smp_processsor_id() is completely
sufficient. If it matches then the interrupt was successfully migrated and
the cleanup can proceed.

For paranoia sake add a warning into the vector assignment code to
validate that the assumption of never moving to a different vector on
the same CPU holds.

Fixes: 633260fa ("x86/irq: Convey vector as argument and not in ptregs")
Reported-by: NAlex bykov <alex.bykov@scylladb.com>
Reported-by: NAvi Kivity <avi@scylladb.com>
Reported-by: NAlexander Graf <graf@amazon.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NAlexander Graf <graf@amazon.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87wo1ltaxz.fsf@nanos.tec.linutronix.de

e027ffff

x86/hotplug: Silence APIC only after all interrupts are migrated · 52d6b926

由 Ashok Raj 提交于 8月 26, 2020

There is a race when taking a CPU offline. Current code looks like this:

native_cpu_disable()
{
	...
	apic_soft_disable();
	/*
	 * Any existing set bits for pending interrupt to
	 * this CPU are preserved and will be sent via IPI
	 * to another CPU by fixup_irqs().
	 */
	cpu_disable_common();
	{
		....
		/*
		 * Race window happens here. Once local APIC has been
		 * disabled any new interrupts from the device to
		 * the old CPU are lost
		 */
		fixup_irqs(); // Too late to capture anything in IRR.
		...
	}
}

The fix is to disable the APIC *after* cpu_disable_common().

Testing was done with a USB NIC that provided a source of frequent
interrupts. A script migrated interrupts to a specific CPU and
then took that CPU offline.

Fixes: 60dcaad5 ("x86/hotplug: Silence APIC and NMI when CPU is dead")
Reported-by: NEvan Green <evgreen@chromium.org>
Signed-off-by: NAshok Raj <ashok.raj@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NMathias Nyman <mathias.nyman@linux.intel.com>
Tested-by: NEvan Green <evgreen@chromium.org>
Reviewed-by: NEvan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/875zdarr4h.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/r/1598501530-45821-1-git-send-email-ashok.raj@intel.com

52d6b926

s390/vmem: fix vmem_add_range for 4-level paging · bffc2f7a

由 Vasily Gorbik 提交于 8月 21, 2020

The kernel currently crashes if 4-level paging is used. Add missing
p4d_populate for just allocated pud entry.

Fixes: 3e0d3e40 ("s390/vmem: consolidate vmem_add_range() and vmem_remove_range()")
Reviewed-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

bffc2f7a

s390: don't trace preemption in percpu macros · 1196f12a

由 Sven Schnelle 提交于 8月 20, 2020

Since commit a21ee605 ("lockdep: Change hardirq{s_enabled,_context}
to per-cpu variables") the lockdep code itself uses percpu variables. This
leads to recursions because the percpu macros are calling preempt_enable()
which might call trace_preempt_on().
Signed-off-by: NSven Schnelle <svens@linux.ibm.com>
Reviewed-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

1196f12a

26 8月, 2020 7 次提交

lockdep: Only trace IRQ edges · 044d0d6d

由 Nicholas Piggin 提交于 7月 23, 2020

Problem:

  raw_local_irq_save(); // software state on
  local_irq_save(); // software state off
  ...
  local_irq_restore(); // software state still off, because we don't enable IRQs
  raw_local_irq_restore(); // software state still off, *whoopsie*

existing instances:

 - lock_acquire()
     raw_local_irq_save()
     __lock_acquire()
       arch_spin_lock(&graph_lock)
         pv_wait() := kvm_wait() (same or worse for Xen/HyperV)
           local_irq_save()

 - trace_clock_global()
     raw_local_irq_save()
     arch_spin_lock()
       pv_wait() := kvm_wait()
	 local_irq_save()

 - apic_retrigger_irq()
     raw_local_irq_save()
     apic->send_IPI() := default_send_IPI_single_phys()
       local_irq_save()

Possible solutions:

 A) make it work by enabling the tracing inside raw_*()
 B) make it work by keeping tracing disabled inside raw_*()
 C) call it broken and clean it up now

Now, given that the only reason to use the raw_* variant is because you don't
want tracing. Therefore A) seems like a weird option (although it can be done).
C) is tempting, but OTOH it ends up converting a _lot_ of code to raw just
because there is one raw user, this strips the validation/tracing off for all
the other users.

So we pick B) and declare any code that ends up doing:

	raw_local_irq_save()
	local_irq_save()
	lockdep_assert_irqs_disabled();

broken. AFAICT this problem has existed forever, the only reason it came
up is because commit: 859d069e ("lockdep: Prepare for NMI IRQ
state tracking") changed IRQ tracing vs lockdep recursion and the
first instance is fairly common, the other cases hardly ever happen.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[rewrote changelog]
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NMarco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200723105615.1268126-1-npiggin@gmail.com

044d0d6d

mips: Implement arch_irqs_disabled() · 99dc56fe

由 Peter Zijlstra 提交于 8月 22, 2020

Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Paul Burton <paulburton@kernel.org>
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200826101653.GE1362448@hirez.programming.kicks-ass.net

99dc56fe

arm64: Implement arch_irqs_disabled() · 021c1093

由 Peter Zijlstra 提交于 8月 21, 2020

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200821085348.664425120@infradead.org

021c1093

nds32: Implement arch_irqs_disabled() · 36206b58

由 Peter Zijlstra 提交于 8月 20, 2020

Cc: Nick Hu <nickhu@andestech.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NMarco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.604899379@infradead.org

36206b58

x86/entry: Remove unused THUNKs · 7da93f37

由 Peter Zijlstra 提交于 8月 12, 2020

Unused remnants
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NMarco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.487040689@infradead.org

7da93f37

cpuidle: Move trace_cpu_idle() into generic code · 9864f5b5

由 Peter Zijlstra 提交于 8月 12, 2020

Remove trace_cpu_idle() from the arch_cpu_idle() implementations and
put it in the generic code, right before disabling RCU. Gets rid of
more trace_*_rcuidle() users.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NMarco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.428433395@infradead.org

9864f5b5

cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED generic · bf9282dc

由 Peter Zijlstra 提交于 8月 12, 2020

This allows moving the leave_mm() call into generic code before
rcu_idle_enter(). Gets rid of more trace_*_rcuidle() users.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NMarco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.369441600@infradead.org

bf9282dc

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功