提交 · b1ee8a3de5790777f325416ad97340428d8ae25f · openeuler / Kernel

28 4月, 2017 5 次提交

powerpc/64s: Dedicated system reset interrupt stack · b1ee8a3d

由 Nicholas Piggin 提交于 12月 20, 2016

The system reset interrupt is used for crash/debug situations, so it is
desirable to have as little impact on the normal state of the system as
possible.

Currently it uses the current kernel stack to process the exception.
This stores into the stack which may be involved with the crash. The
stack pointer may be corrupted, or it may have overflowed.

Avoid or minimise these problems by creating a dedicated NMI stack for
the system reset interrupt to use.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b1ee8a3d

powerpc/64s: Disallow system reset vs system reset reentrancy · c4f3b52c

由 Nicholas Piggin 提交于 12月 20, 2016

In preparation for using a dedicated stack for system reset interrupts,
prevent a nested system reset from recovering, in order to simplify
code that is called in crash/debug path. This allows a system reset
interrupt to just use the base stack pointer.

Keep an in_nmi nesting counter similarly to the in_mce counter. Consider
the interrrupt non-recoverable if it is taken inside another system
reset.

Interrupt nesting could be allowed similarly to MCE, but system reset
is a special case that's not for normal operation, so simplicity wins
until there is requirement for nested system reset interrupts.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

c4f3b52c

powerpc/64s: Fix system reset vs general interrupt reentrancy · a3d96f70

由 Nicholas Piggin 提交于 12月 20, 2016

The system reset interrupt can occur when MSR_EE=0, and it currently
uses the PACA_EXGEN save area.

Some PACA_EXGEN interrupts have a window where MSR_RI=1 and MSR_EE=0
when the save area is still in use. A system reset interrupt in this
window can lead to undetected corruption when the save area gets
overwritten.

This patch introduces PACA_EXNMI save area for system reset exceptions,
which closes this corruption window. It's also helpful to retain the
EXGEN state for debugging situations, even if not considering the
recoverability aspect.

This patch also moves the PACA_EXMC area down to a less frequently used
part of the paca with the new save area.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a3d96f70

powerpc/64s: Exception macro for stack frame and initial register save · a4087a4d

由 Nicholas Piggin 提交于 12月 20, 2016

This code is common to a few exceptions, and another user will be added.
This causes a trivial change to generated code:

-     604: std     r9,416(r1)
-     608: mfspr   r11,314
-     60c: std     r11,368(r1)
-     610: mfspr   r12,315
+     604: mfspr   r11,314
+     608: mfspr   r12,315
+     60c: std     r9,416(r1)
+     610: std     r11,368(r1)

machine_check_powernv_early could also use this, but that requires non
trivial changes to generated code, so that's for another patch.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a4087a4d

powerpc/64s: Add exception macro that does not enable RI · 83a980f7

由 Nicholas Piggin 提交于 12月 20, 2016

Subsequent patches will add more non-RI variant exceptions, so
create a macro for it rather than open-code it.

This does not change generated instructions.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

83a980f7

27 4月, 2017 1 次提交

powerpc/mm: Fix missing page attributes in page table dump · fd893fe5

由 Christophe Leroy 提交于 4月 14, 2017

On some targets, _PAGE_RW is 0 and this is _PAGE_RO which is used.
There is also _PAGE_SHARED that is missing.
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

fd893fe5

24 4月, 2017 3 次提交

powerpc/mm: Ensure IRQs are off in switch_mm() · 9765ad13

由 David Gibson 提交于 4月 19, 2017

powerpc expects IRQs to already be (soft) disabled when switch_mm() is
called, as made clear in the commit message of 9c1e1052 ("powerpc: Allow
perf_counters to access user memory at interrupt time").

Aside from any race conditions that might exist between switch_mm() and an IRQ,
there is also an unconditional hard_irq_disable() in switch_slb(). If that isn't
followed at some point by an IRQ enable then interrupts will remain disabled
until we return to userspace.

It is true that when switch_mm() is called from the scheduler IRQs are off, but
not when it's called by use_mm(). Looking closer we see that last year in commit
f98db601 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
this was made more explicit by the addition of switch_mm_irqs_off() which is now
called by the scheduler, vs switch_mm() which is used by use_mm().

Arguably it is a bug in use_mm() to call switch_mm() in a different context than
it expects, but fixing that will take time.

This was discovered recently when vhost started throwing warnings such as:

  BUG: sleeping function called from invalid context at kernel/mutex.c:578
  in_atomic(): 0, irqs_disabled(): 1, pid: 10768, name: vhost-10760
  no locks held by vhost-10760/10768.
  irq event stamp: 10
  hardirqs last  enabled at (9):  _raw_spin_unlock_irq+0x40/0x80
  hardirqs last disabled at (10): switch_slb+0x2e4/0x490
  softirqs last  enabled at (0):  copy_process+0x5e8/0x1260
  softirqs last disabled at (0):  (null)
  Call Trace:
    show_stack+0x88/0x390 (unreliable)
    dump_stack+0x30/0x44
    __might_sleep+0x1c4/0x2d0
    mutex_lock_nested+0x74/0x5c0
    cgroup_attach_task_all+0x5c/0x180
    vhost_attach_cgroups_work+0x58/0x80 [vhost]
    vhost_worker+0x24c/0x3d0 [vhost]
    kthread+0xec/0x100
    ret_from_kernel_thread+0x5c/0xd4

Prior to commit 04b96e55 ("vhost: lockless enqueuing") (Aug 2016) the
vhost_worker() would do a spin_unlock_irq() not long after calling use_mm(),
which had the effect of reenabling IRQs. Since that commit removed the locking
in vhost_worker() the body of the vhost_worker() loop now runs with interrupts
off causing the warnings.

This patch addresses the problem by making the powerpc code mirror the x86 code,
ie. we disable interrupts in switch_mm(), and optimise the scheduler case by
defining switch_mm_irqs_off().

Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
[mpe: Flesh out/rewrite change log, add stable]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9765ad13

powerpc: Introduce a new helper to obtain function entry points · 1b32cd17

由 Naveen N. Rao 提交于 4月 19, 2017

kprobe_lookup_name() is specific to the kprobe subsystem and may not always
return the function entry point (in a subsequent patch for KPROBES_ON_FTRACE).
For looking up function entry points, introduce a separate helper and use it
in optprobes.c
Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

1b32cd17

powerpc/kprobes: Add support for KPROBES_ON_FTRACE · ead514d5

由 Naveen N. Rao 提交于 4月 19, 2017

Allow kprobes to be placed on ftrace _mcount() call sites. This optimization
avoids the use of a trap, by riding on ftrace infrastructure.

This depends on HAVE_DYNAMIC_FTRACE_WITH_REGS which depends on MPROFILE_KERNEL,
which is only currently enabled on powerpc64le with newer toolchains.

Based on the x86 code by Masami.
Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

ead514d5

23 4月, 2017 8 次提交

powerpc/kprobes: Blacklist common exception handlers · 9a914aa6

由 Naveen N. Rao 提交于 4月 19, 2017

Blacklist all the exception common/OOL handlers as the kernel stack is not yet
setup, which means we can't take a trap at this point.
Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9a914aa6

powerpc/kprobes: Blacklist exception handlers · 7aa5b018

由 Naveen N. Rao 提交于 4月 19, 2017

Introduce __head_end to mark end of the early fixed sections and use it to
blacklist all exception handlers from kprobes.

mpe: We do not need to do anything special for relocatable kernels, where the
exception vectors are split from the main kernel, as the split vectors are
already excluded by the check for kernel_text_address().
Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
[mpe: Move __head_end outside #ifdef 64-bit to unbreak the 32-bit build]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

7aa5b018

powerpc/64s: Simplify POWER9 DD1 idle workaround code · 9cba253d

由 Nicholas Piggin 提交于 4月 19, 2017

The idle workaround does not need to load PACATOC, and it does not
need to be called within a nested function that requires LR to be
saved.

Load the PACATOC at entry to the idle wakeup. It does not matter which
PACA this comes from, so it's okay to call before the workaround. Then
apply the workaround to get the right PACA.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9cba253d

powerpc/64s: Idle POWER8 avoid full state loss recovery where possible · 0d7720a2

由 Nicholas Piggin 提交于 4月 19, 2017

If not all threads were in winkle, full state loss recovery is not
necessary and can be avoided. A previous patch removed this optimisation
due to some complexity with the implementation. Re-implement it by
counting the number of threads in winkle with the per-core idle state.
Only restore full state loss if all threads were in winkle.

This has a small window of false positives right before threads execute
winkle and just after they wake up, when the winkle count does not
reflect the true number of threads in winkle. This is not a significant
problem in comparison with even the minimum winkle duration. For
correctness, a false positive is not a problem (only false negatives
would be).
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

0d7720a2

powerpc/64s: Expand core idle state bits · adbcf8d7

由 Nicholas Piggin 提交于 4月 19, 2017

In preparation for adding more bits to the core idle state word, move
the lock bit up, and unlock by flipping the lock bit rather than masking
off all but the thread bits.

Add branch hints for atomic operations while we're here.
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

adbcf8d7

powerpc/64s: Fix POWER9 machine check handler from stop state · 1945bc45

由 Nicholas Piggin 提交于 4月 19, 2017

The ISA specifies power save wakeup due to a machine check exception can
cause a machine check interrupt (rather than the usual system reset
interrupt).

The machine check handler copes with this by doing low level machine
check recovery without restoring full state from idle, then queues up a
machine check event for logging, then directly executes the same idle
instruction it woke from. This minimises the work done before recovery
is performed.

The problem is that it requires machine specific instructions and
knowledge of the book3s idle code. Currently it only has code to handle
POWER8 idle, so POWER9 crashes when trying to execute the P8 idle
instructions which don't exist in ISAv3.0B.

cpu 0x0: Vector: e40 (Emulation Assist) at [c0000000008f3810]
    pc: c000000000008380: machine_check_handle_early+0x130/0x2f0
    lr: c00000000053a098: stop_loop+0x68/0xd0
    sp: c0000000008f3a90
   msr: 9000000000081001
  current = 0xc0000000008a1080
  paca    = 0xc00000000ffd0000   softe: 0        irq_happened: 0x01
    pid   = 0, comm = swapper/0

Instead of going to sleep after recovery, do the usual idle wakeup and
state restoration by calling into the normal idle wakeup path. This
reuses the normal idle wakeup paths.
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: NMahesh J Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

1945bc45

powerpc/64s: Stop using bit in HSPRG0 to test winkle · 544686ca

由 Nicholas Piggin 提交于 4月 19, 2017

The POWER8 idle code has a neat trick of programming the power on engine
to restore a low bit into HSPRG0, so idle wakeup code can test and see
if it has been programmed this way and therefore lost all state. Restore
time can be reduced if winkle has not been reached.

However this messes with our r13 PACA pointer, and requires HSPRG0 to be
written to. It also optimizes the slowest and most uncommon case at the
expense of another SPR write in the common nap state wakeup.

Remove this complexity and assume winkle sleeps always require a state
restore. This speedup could be made entirely contained within the winkle
idle code by counting per-core winkles and setting a thread bitmap when
all have gone to winkle.
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

544686ca

powerpc/64s: Remove unnecessary relocation branch from idle handler · 2563a70c

由 Nicholas Piggin 提交于 4月 19, 2017

The system reset idle handler system_reset_idle_common is relocated, so
relocation is not required to branch to kvm_start_guest. The superfluous
relocation does not result in incorrect code, but it does not compile
outside of exception-64s.S (with fixed section definitions).
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

2563a70c

21 4月, 2017 1 次提交

powerpc/mm: Wire up ioremap_cache() · f855b2f5

由 Oliver O'Halloran 提交于 4月 12, 2017

The default implementation of ioremap_cache() is aliased to ioremap().
On powerpc ioremap() creates cache-inhibited mappings by default which
is almost certainly not what you wanted.
Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f855b2f5

20 4月, 2017 2 次提交

kprobes: Convert kprobe_lookup_name() to a function · 49e0b465

由 Naveen N. Rao 提交于 4月 19, 2017

The macro is now pretty long and ugly on powerpc. In the light of further
changes needed here, convert it to a __weak variant to be over-ridden with a
nicer looking function.
Suggested-by: NMasami Hiramatsu <mhiramat@kernel.org>
Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

49e0b465

powerpc/64s: Use relon prolog for EXC_VIRT_OOL_MASKABLE_HV handlers · a050d20d

由 Nicholas Piggin 提交于 4月 13, 2017

Hypervisor Virtualization and Directed Hypervisor Doorbell interrupt handlers
use the macro EXC_VIRT_OOL_MASKABLE_HV for their relocation-on handlers, which
calls MASKABLE_RELON_EXCEPTION_HV_OOL, which uses the *real mode* interrupt
prolog. This means we needlessly rfid from virtual mode to virtual mode.

For POWER8 it only affects doorbell IPIs. Context switch microbenchmark between
threads with snooze disabled (which causes IPI) gets about 3% faster, about 370
cycles. Should be more important on POWER9 with global doorbells and HVI for
host interrupts.

Use the RELON variant instead to reduce overhead.

Fixes: 1707dd16 ("powerpc: Save CFAR before branching in interrupt entry paths")
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Fold some more detail into the change log]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a050d20d

19 4月, 2017 4 次提交

powerpc/64s: Remove SAO feature from Power9 DD1 · ca80d5d0

由 Nicholas Piggin 提交于 4月 19, 2017

Power9 DD1 does not implement SAO. Although it's not widely used, its presence
or absence is visible to user space via arch_validate_prot() so it's moderately
important that we get the value right.

Fixes: 7dccfbc3 ("powerpc/book3s: Add a cpu table entry for different POWER9 revs")
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

ca80d5d0

powerpc/64s: Remove ICSWX feature from Power9 · 2384d2d7

由 Nicholas Piggin 提交于 4月 19, 2017

Power9 does not implement the icswx instruction. This CPU feature is not visible
to userspace and is only used in the CONFIG_PPC_ICSWX code, which is generally
not enabled, and can only be triggered by other code using icswx, which should
not happen on Power9 systems in the first place. So impact should be minimal.

Fixes: c3ab300e ("powerpc: Add POWER9 cputable entry")
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

2384d2d7

powerpc/perf: Support to export MMCRA[TEC*] field to userspace · 170a315f

由 Madhavan Srinivasan 提交于 4月 11, 2017

Threshold feature when used with MMCRA [Threshold Event Counter Event],
MMCRA[Threshold Start event] and MMCRA[Threshold End event] will update
MMCRA[Threashold Event Counter Exponent] and MMCRA[Threshold Event
Counter Multiplier] with the corresponding threshold event count values.
Patch to export MMCRA[TECX/TECM] to userspace in 'weight' field of
struct perf_sample_data.
Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

170a315f

powerpc/perf: Export memory hierarchy info to user space · 79e96f8f

由 Madhavan Srinivasan 提交于 4月 11, 2017

The LDST field and DATA_SRC in SIER identifies the memory hierarchy level
(eg: L1, L2 etc), from which a data-cache miss for a marked instruction
was satisfied. Use the 'perf_mem_data_src' object to export this
hierarchy level to user space.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

79e96f8f

13 4月, 2017 6 次提交

powerpc: Drop include of linux/io.h from asm/io.h · 590c369e

由 Michael Ellerman 提交于 4月 13, 2017

Currently powerpc's asm/io.h includes linux/io.h, and linux/io.h
includes asm/io.h.

This can cause problems because depending on which is included first the
order of definitions between the two files will change.

The include of linux/io.h was added back in 2008 in commit b41e5fff
("[POWERPC] devres: Add devm_ioremap_prot()"). It's not entirely clear
it was needed then, but devm_ioremap_prot() has since been removed
entirely as unused, in dedd24a1 ("powerpc: Remove unused
devm_ioremap_prot()").

So it seems to be unnecessary and can potentially cause problems, so
remove the include of linux/io.h from asm/io.h
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

590c369e

powerpc/powernv: POWER9 support for msgsnd/doorbell IPI · 6b3edefe

由 Nicholas Piggin 提交于 4月 13, 2017

POWER9 requires msgsync for receiver-side synchronization, and a DD1
workaround restricts IPIs to core-local.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Drop no longer needed asm feature macro changes]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6b3edefe

powerpc/64s: Avoid a branch for ppc_msgsnd · a5adf282

由 Nicholas Piggin 提交于 4月 13, 2017

IPIs are a pretty hot path and we already have the ability to do asm feature
patching, so use it.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Change log detail]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a5adf282

powerpc: Introduce msgsnd/doorbell barrier primitives · b87ac021

由 Nicholas Piggin 提交于 4月 13, 2017

POWER9 changes requirements and adds new instructions for
synchronization.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b87ac021

powerpc: Change the doorbell IPI calling convention · b866cc21

由 Nicholas Piggin 提交于 4月 13, 2017

Change the doorbell callers to know about their msgsnd addressing,
rather than have them set a per-cpu target data tag at boot that gets
sent to the cause_ipi functions. The data is only used for doorbell IPI
functions, no other IPI types, so it makes sense to keep that detail
local to doorbell.

Have the platform code understand doorbell IPIs, rather than the
interrupt controller code understand them. Platform code can look at
capabilities it has available and decide which to use.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b866cc21

powerpc/64s: Add SCV FSCR bit for ISA v3.0 · 9b7ff0c6

由 Nicholas Piggin 提交于 4月 07, 2017

Add the bit definition and use it in facility_unavailable_exception() so we can
intelligently report the cause if we take a fault for SCV. This doesn't actually
enable SCV.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Drop whitespace changes to the existing entries, flush out change log]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9b7ff0c6

12 4月, 2017 2 次提交

powerpc/tracing: Allow tracing of mmap syscalls · 9c355917

由 Balbir Singh 提交于 4月 12, 2017

Currently sys_mmap() and sys_mmap2() (32-bit only), are not visible to the
syscall tracing machinery. This means users are not able to see the execution of
mmap() syscalls using the syscall tracer.

Fix that by using SYSCALL_DEFINE6 for sys_mmap() and sys_mmap2() so that the
meta-data associated with these syscalls is visible to the syscall tracer.

A side-effect of this change is that the return type has changed from unsigned
long to long. However this should have no effect, the only code in the kernel
which uses the result of these syscalls is in the syscall return path, which is
written in asm and treats the result as unsigned regardless.

Example output:
  cat-3399  [001] ....   196.542410: sys_mmap(addr: 7fff922a0000, len: 20000, prot: 3, flags: 812, fd: 3, offset: 1b0000)
  cat-3399  [001] ....   196.542443: sys_mmap -> 0x7fff922a0000
  cat-3399  [001] ....   196.542668: sys_munmap(addr: 7fff922c0000, len: 6d2c)
  cat-3399  [001] ....   196.542677: sys_munmap -> 0x0
Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
[mpe: Massage change log, add detail on return type change]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9c355917

powerpc/mm: Fix swapper_pg_dir size on 64-bit hash w/64K pages · 03dfee6d

由 Michael Ellerman 提交于 4月 12, 2017

Recently in commit f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB"),
we increased H_PGD_INDEX_SIZE to 15 when we're building with 64K pages. This
makes it larger than RADIX_PGD_INDEX_SIZE (13), which means the logic to
calculate MAX_PGD_INDEX_SIZE in book3s/64/pgtable.h is wrong.

The end result is that the PGD (Page Global Directory, ie top level page table)
of the kernel (aka. swapper_pg_dir), is too small.

This generally doesn't lead to a crash, as we don't use the full range in normal
operation. However if we try to dump the kernel pagetables we can trigger a
crash because we walk off the end of the pgd into other memory and eventually
try to dereference something bogus:

  $ cat /sys/kernel/debug/kernel_pagetables
  Unable to handle kernel paging request for data at address 0xe8fece0000000000
  Faulting instruction address: 0xc000000000072314
  cpu 0xc: Vector: 380 (Data SLB Access) at [c0000000daa13890]
      pc: c000000000072314: ptdump_show+0x164/0x430
      lr: c000000000072550: ptdump_show+0x3a0/0x430
     dar: e802cf0000000000
  seq_read+0xf8/0x560
  full_proxy_read+0x84/0xc0
  __vfs_read+0x6c/0x1d0
  vfs_read+0xbc/0x1b0
  SyS_read+0x6c/0x110
  system_call+0x38/0xfc

The root cause is that MAX_PGD_INDEX_SIZE isn't actually computed to be
the max of H_PGD_INDEX_SIZE or RADIX_PGD_INDEX_SIZE. To fix that move
the calculation into asm-offsets.c where we can do it easily using
max().

Fixes: f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB")
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

03dfee6d

11 4月, 2017 3 次提交

powerpc/powernv: Recover correct PACA on wakeup from a stop on P9 DD1 · 17ed4c8f

由 Gautham R. Shenoy 提交于 3月 22, 2017

POWER9 DD1.0 hardware has a bug where the SPRs of a thread waking up
from stop 0,1,2 with ESL=1 can endup being misplaced in the core. Thus
the HSPRG0 of a thread waking up from can contain the paca pointer of
its sibling.

This patch implements a context recovery framework within threads of a
core, by provisioning space in paca_struct for saving every sibling
threads's paca pointers. Basically, we should be able to arrive at the
right paca pointer from any of the thread's existing paca pointer.

At bootup, during powernv idle-init, we save the paca address of every
CPU in each one its siblings paca_struct in the slot corresponding to
this CPU's index in the core.

On wakeup from a stop, the thread will determine its index in the core
from the TIR register and recover its PACA pointer by indexing into
the correct slot in the provisioned space in the current PACA.

Furthermore, ensure that the NVGPRs are restored from the stack on the
way out by setting the NAPSTATELOST in paca.

[Changelog written with inputs from svaidy@linux.vnet.ibm.com]
Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Call it a bug]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

17ed4c8f

powerpc/powernv: Move CPU-Offline idle state invocation from smp.c to idle.c · a7cd88da

由 Gautham R. Shenoy 提交于 3月 22, 2017

Move the piece of code in powernv/smp.c::pnv_smp_cpu_kill_self() which
transitions the CPU to the deepest available platform idle state to a
new function named pnv_cpu_offline() in powernv/idle.c. The rationale
behind this code movement is that the data required to determine the
deepest available platform state resides in powernv/idle.c.
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a7cd88da

powerpc: Create asm/debugfs.h and move powerpc_debugfs_root there · 7644d581

由 Michael Ellerman 提交于 2月 10, 2017

powerpc_debugfs_root is the dentry representing the root of the
"powerpc" directory tree in debugfs.

Currently it sits in asm/debug.h, a long with some other things that
have "debug" in the name, but are otherwise unrelated.

Pull it out into a separate header, which also includes linux/debugfs.h,
and convert all the users to include debugfs.h instead of debug.h.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

7644d581

10 4月, 2017 5 次提交

powerpc: Fixup LPCR:PECE and HEIC setting on POWER9 · 08a1e650

由 Benjamin Herrenschmidt 提交于 4月 05, 2017

We need to set LPES in order for normal external interrupts (0x500)
to be directed to the guest while running in guest state.

We also need HEIC set to prevent them to be sent to the host while
in host state.

With XIVE the host never gets one of these and wouldn't know how to
handle it. All host external interrupts come in via the new
hypervisor virtualization interrupts vector.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

08a1e650

powerpc: Consolidate variants of real-mode MMIOs · d381d7ca

由 Benjamin Herrenschmidt 提交于 4月 05, 2017

We have all sort of variants of MMIO accessors for the real mode
instructions. This creates a clean set of accessors based on
Linux normal naming conventions, replacing all occurrences of
the old ones in the tree.

I have purposefully removed the "out/in" variants in favor of
only including __raw variants. Any code using these is already
pretty much hand tuned to operate in a very specific environment.
I've fixed up the 2 users (only one of them actually needed
a barrier in the first place).
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d381d7ca

powerpc/kvm: Remove obsolete kvm_vm_ioctl_xics_irq declaration · f50d6bd3

由 Benjamin Herrenschmidt 提交于 4月 05, 2017

The function doesn't exist anymore
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f50d6bd3

powerpc/kvm: Make kvmppc_xics_create_icp static · 936774cd

由 Benjamin Herrenschmidt 提交于 4月 05, 2017

It's only used within the same file it's defined
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

936774cd

powerpc/xive: Native exploitation of the XIVE interrupt controller · 243e2511

由 Benjamin Herrenschmidt 提交于 4月 05, 2017

The XIVE interrupt controller is the new interrupt controller
found in POWER9. It supports advanced virtualization capabilities
among other things.

Currently we use a set of firmware calls that simulate the old
"XICS" interrupt controller but this is fairly inefficient.

This adds the framework for using XIVE along with a native
backend which OPAL for configuration. Later, a backend allowing
the use in a KVM or PowerVM guest will also be provided.

This disables some fast path for interrupts in KVM when XIVE is
enabled as these rely on the firmware emulation code which is no
longer available when the XIVE is used natively by Linux.

A latter patch will make KVM also directly exploit the XIVE, thus
recovering the lost performance (and more).
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Fixup pr_xxx("XIVE:"...), don't split pr_xxx() strings,
 tweak Kconfig so XIVE_NATIVE selects XIVE and depends on POWERNV,
 fix build errors when SMP=n, fold in fixes from Ben:
   Don't call cpu_online() on an invalid CPU number
   Fix irq target selection returning out of bounds cpu#
   Extra sanity checks on cpu numbers
 ]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

243e2511

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功