提交 · b65399f6111b03df870d8daa75b8d140c0de93f4 · openeuler / Kernel

11 9月, 2020 1 次提交

arm64/mm: Change THP helpers to comply with generic MM semantics · b65399f6

由 Anshuman Khandual 提交于 9月 09, 2020

pmd_present() and pmd_trans_huge() are expected to behave in the following
manner during various phases of a given PMD. It is derived from a previous
detailed discussion on this topic [1] and present THP documentation [2].

pmd_present(pmd):

- Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
- Returns false if pmd refers to a migration or swap entry

pmd_trans_huge(pmd):

- Returns true if pmd refers to system RAM and is a trans huge mapping

-------------------------------------------------------------------------
|	PMD states	|	pmd_present	|	pmd_trans_huge	|
-------------------------------------------------------------------------
|	Mapped		|	Yes		|	Yes		|
-------------------------------------------------------------------------
|	Splitting	|	Yes		|	Yes		|
-------------------------------------------------------------------------
|	Migration/Swap	|	No		|	No		|
-------------------------------------------------------------------------

The problem:

PMD is first invalidated with pmdp_invalidate() before it's splitting. This
invalidation clears PMD_SECT_VALID as below.

PMD Split -> pmdp_invalidate() -> pmd_mkinvalid -> Clears PMD_SECT_VALID

Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
on the PMD entry. It will need another bit apart from PMD_SECT_VALID to re-
affirm pmd_present() as true during the THP split process. To comply with
above mentioned semantics, pmd_trans_huge() should also check pmd_present()
first before testing presence of an actual transparent huge mapping.

The solution:

Ideally PMD_TYPE_SECT should have been used here instead. But it shares the
bit position with PMD_SECT_VALID which is used for THP invalidation. Hence
it will not be there for pmd_present() check after pmdp_invalidate().

A new software defined PMD_PRESENT_INVALID (bit 59) can be set on the PMD
entry during invalidation which can help pmd_present() return true and in
recognizing the fact that it still points to memory.

This bit is transient. During the split process it will be overridden by a
page table page representing normal pages in place of erstwhile huge page.
Other pmdp_invalidate() callers always write a fresh PMD value on the entry
overriding this transient PMD_PRESENT_INVALID bit, which makes it safe.

[1]: https://lkml.org/lkml/2018/10/17/231
[2]: https://www.kernel.org/doc/Documentation/vm/transhuge.txtSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki Poulose <suzuki.poulose@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Link: https://lore.kernel.org/r/1599627183-14453-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NWill Deacon <will@kernel.org>

b65399f6

08 9月, 2020 1 次提交

arm64/mm/ptdump: Add address markers for BPF regions · c048ddf8

由 Anshuman Khandual 提交于 9月 04, 2020

Kernel virtual region [BPF_JIT_REGION_START..BPF_JIT_REGION_END] is missing
from address_markers[], hence relevant page table entries are not displayed
with /sys/kernel/debug/kernel_page_tables. This adds those missing markers.
While here, also rename arch/arm64/mm/dump.c which sounds bit ambiguous, as
arch/arm64/mm/ptdump.c instead.
Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: NSteven Price <steven.price@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Link: https://lore.kernel.org/r/1599208259-11191-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NWill Deacon <will@kernel.org>

c048ddf8

28 8月, 2020 7 次提交

KVM: arm64: Set HCR_EL2.PTW to prevent AT taking synchronous exception · 71a7f8cb

由 James Morse 提交于 8月 21, 2020

AT instructions do a translation table walk and return the result, or
the fault in PAR_EL1. KVM uses these to find the IPA when the value is
not provided by the CPU in HPFAR_EL1.

If a translation table walk causes an external abort it is taken as an
exception, even if it was due to an AT instruction. (DDI0487F.a's D5.2.11
"Synchronous faults generated by address translation instructions")

While we previously made KVM resilient to exceptions taken due to AT
instructions, the device access causes mismatched attributes, and may
occur speculatively. Prevent this, by forbidding a walk through memory
described as device at stage2. Now such AT instructions will report a
stage2 fault.

Such a fault will cause KVM to restart the guest. If the AT instructions
always walk the page tables, but guest execution uses the translation cached
in the TLB, the guest can't make forward progress until the TLB entry is
evicted. This isn't a problem, as since commit 5dcd0fdb ("KVM: arm64:
Defer guest entry when an asynchronous exception is pending"), KVM will
return to the host to process IRQs allowing the rest of the system to keep
running.

Cc: stable@vger.kernel.org # <v5.3: 5dcd0fdb ("KVM: arm64: Defer guest entry when an asynchronous exception is pending")
Signed-off-by: NJames Morse <james.morse@arm.com>
Reviewed-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

71a7f8cb

KVM: arm64: Survive synchronous exceptions caused by AT instructions · 88a84ccc

由 James Morse 提交于 8月 21, 2020

KVM doesn't expect any synchronous exceptions when executing, any such
exception leads to a panic(). AT instructions access the guest page
tables, and can cause a synchronous external abort to be taken.

The arm-arm is unclear on what should happen if the guest has configured
the hardware update of the access-flag, and a memory type in TCR_EL1 that
does not support atomic operations. B2.2.6 "Possible implementation
restrictions on using atomic instructions" from DDI0487F.a lists
synchronous external abort as a possible behaviour of atomic instructions
that target memory that isn't writeback cacheable, but the page table
walker may behave differently.

Make KVM robust to synchronous exceptions caused by AT instructions.
Add a get_user() style helper for AT instructions that returns -EFAULT
if an exception was generated.

While KVM's version of the exception table mixes synchronous and
asynchronous exceptions, only one of these can occur at each location.

Re-enter the guest when the AT instructions take an exception on the
assumption the guest will take the same exception. This isn't guaranteed
to make forward progress, as the AT instructions may always walk the page
tables, but guest execution may use the translation cached in the TLB.

This isn't a problem, as since commit 5dcd0fdb ("KVM: arm64: Defer guest
entry when an asynchronous exception is pending"), KVM will return to the
host to process IRQs allowing the rest of the system to keep running.

88a84ccc

KVM: arm64: Add kvm_extable for vaxorcism code · e9ee186b

由 James Morse 提交于 8月 21, 2020

KVM has a one instruction window where it will allow an SError exception
to be consumed by the hypervisor without treating it as a hypervisor bug.
This is used to consume asynchronous external abort that were caused by
the guest.

As we are about to add another location that survives unexpected exceptions,
generalise this code to make it behave like the host's extable.

KVM's version has to be mapped to EL2 to be accessible on nVHE systems.

The SError vaxorcism code is a one instruction window, so has two entries
in the extable. Because the KVM code is copied for VHE and nVHE, we end up
with four entries, half of which correspond with code that isn't mapped.
Signed-off-by: NJames Morse <james.morse@arm.com>
Reviewed-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

e9ee186b

arm64: vdso32: make vdso32 install conditional · 5d28ba5f

由 Frank van der Linden 提交于 8月 27, 2020

vdso32 should only be installed if CONFIG_COMPAT_VDSO is enabled,
since it's not even supposed to be compiled otherwise, and arm64
builds without a 32bit crosscompiler will fail.

Fixes: 8d75785a ("ARM64: vdso32: Install vdso32 from vdso_install")
Signed-off-by: NFrank van der Linden <fllinden@amazon.com>
Cc: stable@vger.kernel.org [5.4+]
Link: https://lore.kernel.org/r/20200827234012.19757-1-fllinden@amazon.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>

5d28ba5f

arm64: use a common .arch preamble for inline assembly · 1764c3ed

由 Sami Tolvanen 提交于 8月 27, 2020

Commit 7c78f67e ("arm64: enable tlbi range instructions") breaks
LLVM's integrated assembler, because -Wa,-march is only passed to
external assemblers and therefore, the new instructions are not enabled
when IAS is used.

This change adds a common architecture version preamble, which can be
used in inline assembly blocks that contain instructions that require
a newer architecture version, and uses it to fix __TLBI_0 and __TLBI_1
with ARM64_TLB_RANGE.

Fixes: 7c78f67e ("arm64: enable tlbi range instructions")
Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
Tested-by: NNathan Chancellor <natechancellor@gmail.com>
Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1106
Link: https://lore.kernel.org/r/20200827203608.1225689-1-samitolvanen@google.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>

1764c3ed

powerpc/32s: Disable VMAP stack which CONFIG_ADB_PMU · 4a133eb3

由 Christophe Leroy 提交于 8月 27, 2020

low_sleep_handler() can't restore the context from virtual
stack because the stack can hardly be accessed with MMU OFF.

For now, disable VMAP stack when CONFIG_ADB_PMU is selected.

Fixes: cd08f109 ("powerpc/32s: Enable CONFIG_VMAP_STACK")
Cc: stable@vger.kernel.org # v5.6+
Reported-by: NGiuseppe Sacco <giuseppe@sguazz.it>
Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ec96c15bfa1a7415ab604ee1c98cd45779c08be0.1598553015.git.christophe.leroy@csgroup.eu

4a133eb3

arm64/cpuinfo: Remove unnecessary fallthrough annotation · c165a08d

由 Gustavo A. R. Silva 提交于 8月 27, 2020

Fallthrough annotations for consecutive default and case labels
are not necessary.
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

c165a08d

27 8月, 2020 9 次提交

Revert "powerpc/powernv/idle: Replace CPU feature check with PVR check" · 16d83a54

由 Pratik Rajesh Sampat 提交于 8月 26, 2020

cpuidle stop state implementation has minor optimizations for P10
where hardware preserves more SPR registers compared to P9. The
current P9 driver works for P10, although does few extra
save-restores. P9 driver can provide the required power management
features like SMT thread folding and core level power savings on a P10
platform.

Until the P10 stop driver is available, revert the commit which allows
for only P9 systems to utilize cpuidle and blocks all idle stop states
for P10. CPU idle states are enabled and tested on the P10 platform
with this fix.

This reverts commit 8747bf36.

Fixes: 8747bf36 ("powerpc/powernv/idle: Replace CPU feature check with PVR check")
Signed-off-by: NPratik Rajesh Sampat <psampat@linux.ibm.com>
Reviewed-by: NVaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200826082918.89306-1-psampat@linux.ibm.com

16d83a54

powerpc/perf: Fix reading of MSR[HV/PR] bits in trace-imc · 82715a0f

由 Athira Rajeev 提交于 8月 26, 2020

IMC trace-mode uses MSR[HV/PR] bits to set the cpumode for the
instruction pointer captured in each sample. The bits are fetched from
the third double word of the trace record. Reading third double word
from IMC trace record should use be64_to_cpu() along with READ_ONCE
inorder to fetch correct MSR[HV/PR] bits. Patch addresses this change.

Currently we are using PERF_RECORD_MISC_HYPERVISOR as cpumode if MSR
HV is 1 and PR is 0 which means the address is from host counter. But
using PERF_RECORD_MISC_HYPERVISOR for host counter data will fail to
resolve the address -> symbol during "perf report" because perf tools
side uses PERF_RECORD_MISC_KERNEL to represent the host counter data.
Therefore, fix the trace imc sample data to use
PERF_RECORD_MISC_KERNEL as cpumode for host kernel information.

Fixes: 77ca3951 ("powerpc/perf: Add kernel support for new MSR[HV PR] bits in trace-imc")
Signed-off-by: NAthira Rajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1598424029-1662-1-git-send-email-atrajeev@linux.vnet.ibm.com

82715a0f

powerpc/perf: Fix crashes with generic_compat_pmu & BHRB · b460b512

由 Alexey Kardashevskiy 提交于 6月 02, 2020

The bhrb_filter_map ("The Branch History Rolling Buffer") callback is
only defined in raw CPUs' power_pmu structs. The "architected" CPUs
use generic_compat_pmu, which does not have this callback, and crashes
occur if a user tries to enable branch stack for an event.

This add a NULL pointer check for bhrb_filter_map() which behaves as
if the callback returned an error.

This does not add the same check for config_bhrb() as the only caller
checks for cpuhw->bhrb_users which remains zero if bhrb_filter_map==0.

Fixes: be80e758 ("powerpc/perf: Add generic compat mode pmu driver")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NMadhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200602025612.62707-1-aik@ozlabs.ru

b460b512

powerpc/64s: Fix crash in load_fp_state() due to fpexc_mode · b91eb518

由 Michael Ellerman 提交于 8月 25, 2020

The recent commit 01eb0187 ("powerpc/64s: Fix restore_math
unnecessarily changing MSR") changed some of the handling of floating
point/vector restore.

In particular it caused current->thread.fpexc_mode to be copied into
the current MSR (via msr_check_and_set()), rather than just into
regs->msr (which is moved into MSR on return to userspace).

This can lead to a crash in the kernel if we take a floating point
exception when restoring FPSCR:

  Oops: Exception in kernel mode, sig: 8 [#1]
  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
  Modules linked in:
  CPU: 3 PID: 101213 Comm: ld64.so.2 Not tainted 5.9.0-rc1-00098-g18445bf4-dirty #9
  NIP:  c00000000000fbb4 LR: c00000000001a7ac CTR: c000000000183570
  REGS: c0000016b7cfb3b0 TRAP: 0700   Not tainted  (5.9.0-rc1-00098-g18445bf4-dirty)
  MSR:  900000000290b933 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44002444  XER: 00000000
  CFAR: c00000000001a7a8 IRQMASK: 1
  GPR00: c00000000001ae40 c0000016b7cfb640 c0000000011b7f00 c000001542a0f740
  GPR04: c000001542a0f720 c000001542a0eb00 0000000000000900 c000001542a0eb00
  GPR08: 000000000000000a 0000000000002000 9000000000009033 0000000000000000
  GPR12: 0000000000004000 c0000017ffffd900 0000000000000001 c000000000df5a58
  GPR16: c000000000e19c18 c0000000010e1123 0000000000000001 c000000000e1a638
  GPR20: 0000000000000000 c0000000044b1d00 0000000000000000 c000001542a0f2a0
  GPR24: 00000016c7fe0000 c000001542a0f720 c000000001c93da0 c000000000fe5f28
  GPR28: c000001542a0f720 0000000000800000 c0000016b7cfbe90 0000000002802900
  NIP load_fp_state+0x4/0x214
  LR  restore_math+0x17c/0x1f0
  Call Trace:
    0xc0000016b7cfb680 (unreliable)
    __switch_to+0x330/0x460
    __schedule+0x318/0x920
    schedule+0x74/0x140
    schedule_timeout+0x318/0x3f0
    wait_for_completion+0xc8/0x210
    call_usermodehelper_exec+0x234/0x280
    do_coredump+0xedc/0x13c0
    get_signal+0x1d4/0xbe0
    do_notify_resume+0x1a0/0x490
    interrupt_exit_user_prepare+0x1c4/0x230
    interrupt_return+0x14/0x1c0
  Instruction dump:
  ebe10168 e88101a0 7c8ff120 382101e0 e8010010 7c0803a6 4e800020 790605c4
  782905c4 7c0008a8 7c0008a8 c8030200 <fffe058e> 48000088 c8030000 c8230010

Fix it by only loading the fpexc_mode value into regs->msr.

Also add a comment to explain that although VSX is subject to the
value of fpexc_mode, we don't have to handle that separately because
we only allow VSX to be enabled if FP is also enabled.

Fixes: 01eb0187 ("powerpc/64s: Fix restore_math unnecessarily changing MSR")
Reported-by: NMilton Miller <miltonm@us.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
Link: https://lore.kernel.org/r/20200825093424.3967813-1-mpe@ellerman.id.au

b91eb518

powerpc/64s: scv entry should set PPR · e5fe5609

由 Nicholas Piggin 提交于 8月 25, 2020

Kernel entry sets PPR to HMT_MEDIUM by convention. The scv entry
path missed this.

Fixes: 7fa95f9a ("powerpc/64s: system call support for scv/rfscv instructions")
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200825075309.224184-1-npiggin@gmail.com

e5fe5609

x86/irq: Unbreak interrupt affinity setting · e027ffff

由 Thomas Gleixner 提交于 8月 26, 2020

Several people reported that 5.8 broke the interrupt affinity setting
mechanism.

The consolidation of the entry code reused the regular exception entry code
for device interrupts and changed the way how the vector number is conveyed
from ptregs->orig_ax to a function argument.

The low level entry uses the hardware error code slot to push the vector
number onto the stack which is retrieved from there into a function
argument and the slot on stack is set to -1.

The reason for setting it to -1 is that the error code slot is at the
position where pt_regs::orig_ax is. A positive value in pt_regs::orig_ax
indicates that the entry came via a syscall. If it's not set to a negative
value then a signal delivery on return to userspace would try to restart a
syscall. But there are other places which rely on pt_regs::orig_ax being a
valid indicator for syscall entry.

But setting pt_regs::orig_ax to -1 has a nasty side effect vs. the
interrupt affinity setting mechanism, which was overlooked when this change
was made.

Moving interrupts on x86 happens in several steps. A new vector on a
different CPU is allocated and the relevant interrupt source is
reprogrammed to that. But that's racy and there might be an interrupt
already in flight to the old vector. So the old vector is preserved until
the first interrupt arrives on the new vector and the new target CPU. Once
that happens the old vector is cleaned up, but this cleanup still depends
on the vector number being stored in pt_regs::orig_ax, which is now -1.

That -1 makes the check for cleanup: pt_regs::orig_ax == new_vector
always false. As a consequence the interrupt is moved once, but then it
cannot be moved anymore because the cleanup of the old vector never
happens.

There would be several ways to convey the vector information to that place
in the guts of the interrupt handling, but on deeper inspection it turned
out that this check is pointless and a leftover from the old affinity model
of X86 which supported multi-CPU affinities. Under this model it was
possible that an interrupt had an old and a new vector on the same CPU, so
the vector match was required.

Under the new model the effective affinity of an interrupt is always a
single CPU from the requested affinity mask. If the affinity mask changes
then either the interrupt stays on the CPU and on the same vector when that
CPU is still in the new affinity mask or it is moved to a different CPU, but
it is never moved to a different vector on the same CPU.

Ergo the cleanup check for the matching vector number is not required and
can be removed which makes the dependency on pt_regs:orig_ax go away.

The remaining check for new_cpu == smp_processsor_id() is completely
sufficient. If it matches then the interrupt was successfully migrated and
the cleanup can proceed.

For paranoia sake add a warning into the vector assignment code to
validate that the assumption of never moving to a different vector on
the same CPU holds.

Fixes: 633260fa ("x86/irq: Convey vector as argument and not in ptregs")
Reported-by: NAlex bykov <alex.bykov@scylladb.com>
Reported-by: NAvi Kivity <avi@scylladb.com>
Reported-by: NAlexander Graf <graf@amazon.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NAlexander Graf <graf@amazon.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87wo1ltaxz.fsf@nanos.tec.linutronix.de

e027ffff

x86/hotplug: Silence APIC only after all interrupts are migrated · 52d6b926

由 Ashok Raj 提交于 8月 26, 2020

There is a race when taking a CPU offline. Current code looks like this:

native_cpu_disable()
{
	...
	apic_soft_disable();
	/*
	 * Any existing set bits for pending interrupt to
	 * this CPU are preserved and will be sent via IPI
	 * to another CPU by fixup_irqs().
	 */
	cpu_disable_common();
	{
		....
		/*
		 * Race window happens here. Once local APIC has been
		 * disabled any new interrupts from the device to
		 * the old CPU are lost
		 */
		fixup_irqs(); // Too late to capture anything in IRR.
		...
	}
}

The fix is to disable the APIC *after* cpu_disable_common().

Testing was done with a USB NIC that provided a source of frequent
interrupts. A script migrated interrupts to a specific CPU and
then took that CPU offline.

Fixes: 60dcaad5 ("x86/hotplug: Silence APIC and NMI when CPU is dead")
Reported-by: NEvan Green <evgreen@chromium.org>
Signed-off-by: NAshok Raj <ashok.raj@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NMathias Nyman <mathias.nyman@linux.intel.com>
Tested-by: NEvan Green <evgreen@chromium.org>
Reviewed-by: NEvan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/875zdarr4h.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/r/1598501530-45821-1-git-send-email-ashok.raj@intel.com

52d6b926

s390/vmem: fix vmem_add_range for 4-level paging · bffc2f7a

由 Vasily Gorbik 提交于 8月 21, 2020

The kernel currently crashes if 4-level paging is used. Add missing
p4d_populate for just allocated pud entry.

Fixes: 3e0d3e40 ("s390/vmem: consolidate vmem_add_range() and vmem_remove_range()")
Reviewed-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

bffc2f7a

s390: don't trace preemption in percpu macros · 1196f12a

由 Sven Schnelle 提交于 8月 20, 2020

Since commit a21ee605 ("lockdep: Change hardirq{s_enabled,_context}
to per-cpu variables") the lockdep code itself uses percpu variables. This
leads to recursions because the percpu macros are calling preempt_enable()
which might call trace_preempt_on().
Signed-off-by: NSven Schnelle <svens@linux.ibm.com>
Reviewed-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

1196f12a

26 8月, 2020 7 次提交

lockdep: Only trace IRQ edges · 044d0d6d

由 Nicholas Piggin 提交于 7月 23, 2020

Problem:

  raw_local_irq_save(); // software state on
  local_irq_save(); // software state off
  ...
  local_irq_restore(); // software state still off, because we don't enable IRQs
  raw_local_irq_restore(); // software state still off, *whoopsie*

existing instances:

 - lock_acquire()
     raw_local_irq_save()
     __lock_acquire()
       arch_spin_lock(&graph_lock)
         pv_wait() := kvm_wait() (same or worse for Xen/HyperV)
           local_irq_save()

 - trace_clock_global()
     raw_local_irq_save()
     arch_spin_lock()
       pv_wait() := kvm_wait()
	 local_irq_save()

 - apic_retrigger_irq()
     raw_local_irq_save()
     apic->send_IPI() := default_send_IPI_single_phys()
       local_irq_save()

Possible solutions:

 A) make it work by enabling the tracing inside raw_*()
 B) make it work by keeping tracing disabled inside raw_*()
 C) call it broken and clean it up now

Now, given that the only reason to use the raw_* variant is because you don't
want tracing. Therefore A) seems like a weird option (although it can be done).
C) is tempting, but OTOH it ends up converting a _lot_ of code to raw just
because there is one raw user, this strips the validation/tracing off for all
the other users.

So we pick B) and declare any code that ends up doing:

	raw_local_irq_save()
	local_irq_save()
	lockdep_assert_irqs_disabled();

broken. AFAICT this problem has existed forever, the only reason it came
up is because commit: 859d069e ("lockdep: Prepare for NMI IRQ
state tracking") changed IRQ tracing vs lockdep recursion and the
first instance is fairly common, the other cases hardly ever happen.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[rewrote changelog]
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NMarco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200723105615.1268126-1-npiggin@gmail.com

044d0d6d

mips: Implement arch_irqs_disabled() · 99dc56fe

由 Peter Zijlstra 提交于 8月 22, 2020

Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Paul Burton <paulburton@kernel.org>
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200826101653.GE1362448@hirez.programming.kicks-ass.net

99dc56fe

arm64: Implement arch_irqs_disabled() · 021c1093

由 Peter Zijlstra 提交于 8月 21, 2020

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200821085348.664425120@infradead.org

021c1093

nds32: Implement arch_irqs_disabled() · 36206b58

由 Peter Zijlstra 提交于 8月 20, 2020

Cc: Nick Hu <nickhu@andestech.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NMarco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.604899379@infradead.org

36206b58

x86/entry: Remove unused THUNKs · 7da93f37

由 Peter Zijlstra 提交于 8月 12, 2020

Unused remnants
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NMarco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.487040689@infradead.org

7da93f37

cpuidle: Move trace_cpu_idle() into generic code · 9864f5b5

由 Peter Zijlstra 提交于 8月 12, 2020

Remove trace_cpu_idle() from the arch_cpu_idle() implementations and
put it in the generic code, right before disabling RCU. Gets rid of
more trace_*_rcuidle() users.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NMarco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.428433395@infradead.org

9864f5b5

cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED generic · bf9282dc

由 Peter Zijlstra 提交于 8月 12, 2020

This allows moving the leave_mm() call into generic code before
rcu_idle_enter(). Gets rid of more trace_*_rcuidle() users.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NMarco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20200821085348.369441600@infradead.org

bf9282dc

24 8月, 2020 3 次提交

powerpc/64s: Disallow PROT_SAO in LPARs by default · 9b725a90

由 Shawn Anastasio 提交于 8月 21, 2020

Since migration of guests using SAO to ISA 3.1 hosts may cause issues,
disable PROT_SAO in LPARs by default and introduce a new Kconfig option
PPC_PROT_SAO_LPAR to allow users to enable it if desired.
Signed-off-by: NShawn Anastasio <shawn@anastas.io>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200821185558.35561-3-shawn@anastas.io

9b725a90

Revert "powerpc/64s: Remove PROT_SAO support" · 12564485

由 Shawn Anastasio 提交于 8月 21, 2020

This reverts commit 5c9fa16e.

Since PROT_SAO can still be useful for certain classes of software,
reintroduce it. Concerns about guest migration for LPARs using SAO
will be addressed next.
Signed-off-by: NShawn Anastasio <shawn@anastas.io>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200821185558.35561-2-shawn@anastas.io

12564485

treewide: Use fallthrough pseudo-keyword · df561f66

由 Gustavo A. R. Silva 提交于 8月 23, 2020

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

df561f66

22 8月, 2020 3 次提交

KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set · b5331379

由 Will Deacon 提交于 8月 11, 2020

When an MMU notifier call results in unmapping a range that spans multiple
PGDs, we end up calling into cond_resched_lock() when crossing a PGD boundary,
since this avoids running into RCU stalls during VM teardown. Unfortunately,
if the VM is destroyed as a result of OOM, then blocking is not permitted
and the call to the scheduler triggers the following BUG():

 | BUG: sleeping function called from invalid context at arch/arm64/kvm/mmu.c:394
 | in_atomic(): 1, irqs_disabled(): 0, non_block: 1, pid: 36, name: oom_reaper
 | INFO: lockdep is turned off.
 | CPU: 3 PID: 36 Comm: oom_reaper Not tainted 5.8.0 #1
 | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
 | Call trace:
 |  dump_backtrace+0x0/0x284
 |  show_stack+0x1c/0x28
 |  dump_stack+0xf0/0x1a4
 |  ___might_sleep+0x2bc/0x2cc
 |  unmap_stage2_range+0x160/0x1ac
 |  kvm_unmap_hva_range+0x1a0/0x1c8
 |  kvm_mmu_notifier_invalidate_range_start+0x8c/0xf8
 |  __mmu_notifier_invalidate_range_start+0x218/0x31c
 |  mmu_notifier_invalidate_range_start_nonblock+0x78/0xb0
 |  __oom_reap_task_mm+0x128/0x268
 |  oom_reap_task+0xac/0x298
 |  oom_reaper+0x178/0x17c
 |  kthread+0x1e4/0x1fc
 |  ret_from_fork+0x10/0x30

Use the new 'flags' argument to kvm_unmap_hva_range() to ensure that we
only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is set in the notifier
flags.

Cc: <stable@vger.kernel.org>
Fixes: 8b3405e3 ("kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd")
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: NWill Deacon <will@kernel.org>
Message-Id: <20200811102725.7121-3-will@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b5331379

KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() · fdfe7cbd

由 Will Deacon 提交于 8月 11, 2020

The 'flags' field of 'struct mmu_notifier_range' is used to indicate
whether invalidate_range_{start,end}() are permitted to block. In the
case of kvm_mmu_notifier_invalidate_range_start(), this field is not
forwarded on to the architecture-specific implementation of
kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
whether or not to block.

Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
architectures are aware as to whether or not they are permitted to block.

Cc: <stable@vger.kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: NWill Deacon <will@kernel.org>
Message-Id: <20200811102725.7121-2-will@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fdfe7cbd

ARM64: vdso32: Install vdso32 from vdso_install · 8d75785a

由 Stephen Boyd 提交于 8月 17, 2020

Add the 32-bit vdso Makefile to the vdso_install rule so that 'make
vdso_install' installs the 32-bit compat vdso when it is compiled.

Fixes: a7f71a2c ("arm64: compat: Add vDSO")
Signed-off-by: NStephen Boyd <swboyd@chromium.org>
Reviewed-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
Acked-by: NWill Deacon <will@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lore.kernel.org/r/20200818014950.42492-1-swboyd@chromium.orgSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>

8d75785a

21 8月, 2020 9 次提交

x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM · 6a3ea3e6

由 Sean Christopherson 提交于 8月 21, 2020

KVM has an optmization to avoid expensive MRS read/writes on
VMENTER/EXIT. It caches the MSR values and restores them either when
leaving the run loop, on preemption or when going out to user space.

The affected MSRs are not required for kernel context operations. This
changed with the recently introduced mechanism to handle FSGSBASE in the
paranoid entry code which has to retrieve the kernel GSBASE value by
accessing per CPU memory. The mechanism needs to retrieve the CPU number
and uses either LSL or RDPID if the processor supports it.

Unfortunately RDPID uses MSR_TSC_AUX which is in the list of cached and
lazily restored MSRs, which means between the point where the guest value
is written and the point of restore, MSR_TSC_AUX contains a random number.

If an NMI or any other exception which uses the paranoid entry path happens
in such a context, then RDPID returns the random guest MSR_TSC_AUX value.

As a consequence this reads from the wrong memory location to retrieve the
kernel GSBASE value. Kernel GS is used to for all regular this_cpu_*()
operations. If the GSBASE in the exception handler points to the per CPU
memory of a different CPU then this has the obvious consequences of data
corruption and crashes.

As the paranoid entry path is the only place which accesses MSR_TSX_AUX
(via RDPID) and the fallback via LSL is not significantly slower, remove
the RDPID alternative from the entry path and always use LSL.

The alternative would be to write MSR_TSC_AUX on every VMENTER and VMEXIT
which would be inflicting massive overhead on that code path.

[ tglx: Rewrote changelog ]

Fixes: eaad9812 ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
Reported-by: NTom Lendacky <thomas.lendacky@amd.com>
Debugged-by: NTom Lendacky <thomas.lendacky@amd.com>
Suggested-by: NAndy Lutomirski <luto@kernel.org>
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200821105229.18938-1-pbonzini@redhat.com

6a3ea3e6

powerpc/perf/hv-24x7: Move cpumask file to top folder of hv-24x7 driver · 64ef8f2c

由 Kajol Jain 提交于 8月 21, 2020

Commit 792f73f7 ("powerpc/hv-24x7: Add sysfs files inside hv-24x7
device to show cpumask") added cpumask file as part of hv-24x7 driver
inside the interface folder. The cpumask file is supposed to be in the
top folder of the PMU driver in order to make hotplug work.

This patch fixes that issue and creates new group 'cpumask_attr_group'
to add cpumask file and make sure it added in top folder.

  command:# cat /sys/devices/hv_24x7/cpumask
  0

Fixes: 792f73f7 ("powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask")
Signed-off-by: NKajol Jain <kjain@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200821080610.123997-1-kjain@linux.ibm.com

64ef8f2c

powerpc/32s: Fix module loading failure when VMALLOC_END is over 0xf0000000 · 541cebb5

由 Christophe Leroy 提交于 8月 21, 2020

In is_module_segment(), when VMALLOC_END is over 0xf0000000,
ALIGN(VMALLOC_END, SZ_256M) has value 0.

In that case, addr >= ALIGN(VMALLOC_END, SZ_256M) is always
true then is_module_segment() always returns false.

Use (ALIGN(VMALLOC_END, SZ_256M) - 1) which will have
value 0xffffffff and will be suitable for the comparison.

Fixes: c4964331 ("powerpc/32s: Only leave NX unset on segments used for modules")
Reported-by: NAndreas Schwab <schwab@linux-m68k.org>
Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: NAndreas Schwab <schwab@linux-m68k.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/09fc73fe9c7423c6b4cf93f93df9bb0ed8eefab5.1597994047.git.christophe.leroy@csgroup.eu

541cebb5

KVM: arm64: Print warning when cpu erratum can cause guests to deadlock · abf532cc

由 Rob Herring 提交于 8月 03, 2020

If guests don't have certain CPU erratum workarounds implemented, then
there is a possibility a guest can deadlock the system. IOW, only trusted
guests should be used on systems with the erratum.

This is the case for Cortex-A57 erratum 832075.
Signed-off-by: NRob Herring <robh@kernel.org>
Acked-by: NWill Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: kvmarm@lists.cs.columbia.edu
Link: https://lore.kernel.org/r/20200803193127.3012242-2-robh@kernel.orgSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>

abf532cc

arm64: Allow booting of late CPUs affected by erratum 1418040 · bf87bb08

由 Marc Zyngier 提交于 7月 31, 2020

As we can now switch from a system that isn't affected by 1418040
to a system that globally is affected, let's allow affected CPUs
to come in at a later time.
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Tested-by: NSai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Reviewed-by: NStephen Boyd <swboyd@chromium.org>
Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
Acked-by: NWill Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200731173824.107480-3-maz@kernel.orgSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>

bf87bb08

arm64: Move handling of erratum 1418040 into C code · d49f7d73

由 Marc Zyngier 提交于 7月 31, 2020

Instead of dealing with erratum 1418040 on each entry and exit,
let's move the handling to __switch_to() instead, which has
several advantages:

- It can be applied when it matters (switching between 32 and 64
  bit tasks).
- It is written in C (yay!)
- It can rely on static keys rather than alternatives
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Tested-by: NSai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Reviewed-by: NStephen Boyd <swboyd@chromium.org>
Acked-by: NWill Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200731173824.107480-2-maz@kernel.orgSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>

d49f7d73

riscv: Add SiFive drivers to rv32_defconfig · fc26f5bb

由 Bin Meng 提交于 7月 15, 2020

This adds SiFive drivers to rv32_defconfig, to keep in sync with the
64-bit config. This is useful when testing 32-bit kernel with QEMU
'sifive_u' 32-bit machine.
Signed-off-by: NBin Meng <bin.meng@windriver.com>
Reviewed-by: NAnup Patel <anup@brainfault.org>
Reviewed-by: NAlistair Francis <alistair.francis@wdc.com>
Signed-off-by: NPalmer Dabbelt <palmerdabbelt@google.com>

fc26f5bb

RISC-V: Remove CLINT related code from timer and arch · 2bc3fc87

由 Anup Patel 提交于 8月 17, 2020

Right now the RISC-V timer driver is convoluted to support:
1. Linux RISC-V S-mode (with MMU) where it will use TIME CSR for
   clocksource and SBI timer calls for clockevent device.
2. Linux RISC-V M-mode (without MMU) where it will use CLINT MMIO
   counter register for clocksource and CLINT MMIO compare register
   for clockevent device.

We now have a separate CLINT timer driver which also provide CLINT
based IPI operations so let's remove CLINT MMIO related code from
arch/riscv directory and RISC-V timer driver.
Signed-off-by: NAnup Patel <anup.patel@wdc.com>
Tested-by: NEmil Renner Berhing <kernel@esmil.dk>
Acked-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: NAtish Patra <atish.patra@wdc.com>
Reviewed-by: NPalmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: NPalmer Dabbelt <palmerdabbelt@google.com>

2bc3fc87

RISC-V: Add mechanism to provide custom IPI operations · cc7f3f72

由 Anup Patel 提交于 8月 17, 2020

We add mechanism to set custom IPI operations so that CLINT driver
from drivers directory can provide custom IPI operations.
Signed-off-by: NAnup Patel <anup.patel@wdc.com>
Tested-by: NEmil Renner Berhing <kernel@esmil.dk>
Reviewed-by: NAtish Patra <atish.patra@wdc.com>
Reviewed-by: NPalmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: NPalmer Dabbelt <palmerdabbelt@google.com>

cc7f3f72

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功