提交 · 8e3f1b1d8255105f31556aacf8aeb6071b00d469 · openanolis / cloud-kernel

27 6月, 2017 8 次提交

powerpc/powernv/pci: Enable 64-bit devices to access >4GB DMA space · 8e3f1b1d

由 Russell Currey 提交于 6月 21, 2017

On PHB3/POWER8 systems, devices can select between two different sections
of address space, TVE#0 and TVE#1. TVE#0 is intended for 32bit devices
that aren't capable of addressing more than 4GB. Selecting TVE#1 instead,
with the capability of addressing over 4GB, is performed by setting bit 59
of a PCI address.

However, some devices aren't capable of addressing at least 59 bits, but
still want more than 4GB of DMA space. In order to enable this, reconfigure
TVE#0 to be suitable for 64-bit devices by allocating memory past the
initial 4GB that is inaccessible by 64-bit DMAs.

This bypass mode is only enabled if a device requests 4GB or more of DMA
address space, if the system has PHB3 (POWER8 systems), and if the device
does not share a PE with any devices from different vendors.
Signed-off-by: NRussell Currey <ruscur@russell.cc>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

8e3f1b1d

powerpc/powernv/pci: Add helper to check if a PE has a single vendor · a0f98629

由 Russell Currey 提交于 6月 21, 2017

Add a helper that determines if all the devices contained in a given PE
are all from the same vendor or not.  This can be useful in determining
if it's okay to make PE-wide changes that may be suitable for some
devices but not for others.

This is used later in the series.
Signed-off-by: NRussell Currey <ruscur@russell.cc>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a0f98629

powerpc/powernv/pci: Add support for PHB4 diagnostics · a4b48ba9

由 Russell Currey 提交于 6月 14, 2017

As with P7IOC and PHB3, add kernel-side support for decoding and printing
diagnostic data for PHB4.
Signed-off-by: NRussell Currey <ruscur@russell.cc>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a4b48ba9

powerpc/powernv/pci: Dynamically allocate PHB diag data · 5cb1f8fd

由 Russell Currey 提交于 6月 14, 2017

Diagnostic data for PHBs currently works by allocated a fixed-sized buffer.
This is simple, but either wastes memory (though only a few kilobytes) or
in the case of PHB4 isn't enough to fit the whole data blob.

For machines that don't describe the diagnostic data size in the device
tree, use the hardcoded buffer size as before.  For those that do, only
allocate exactly what's needed.

In the special case of P7IOC (which has two types of diag data), the larger
should be specified in the device tree.
Signed-off-by: NRussell Currey <ruscur@russell.cc>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

5cb1f8fd

powerpc/powernv/pci: Reduce spam when dumping PEST · 31bbd45a

由 Russell Currey 提交于 6月 14, 2017

Dumping the PE State Tables (PEST) can be highly verbose if a number of PEs
are affected, especially in the case where the whole PHB is frozen and 512
lines get printed.  Check for duplicates when dumping the PEST to reduce
useless output.

For example:

    PE[0f8] A/B: 9700002600000000 80000080d00000f8
    PE[0f9] A/B: 8000000000000000 0000000000000000
    PE[..0fe] A/B: as above
    PE[0ff] A/B: 8440002b00000000 0000000000000000

instead of:

    PE[0f8] A/B: 9700002600000000 80000080d00000f8
    PE[0f9] A/B: 8000000000000000 0000000000000000
    PE[0fa] A/B: 8000000000000000 0000000000000000
    PE[0fb] A/B: 8000000000000000 0000000000000000
    PE[0fc] A/B: 8000000000000000 0000000000000000
    PE[0fd] A/B: 8000000000000000 0000000000000000
    PE[0fe] A/B: 8000000000000000 0000000000000000
    PE[0ff] A/B: 8440002b00000000 0000000000000000

and you can imagine how much worse it can get for 512 PEs.
Signed-off-by: NRussell Currey <ruscur@russell.cc>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

31bbd45a

powerpc/tm: Fix comment · 2bafb7ff

由 Michael Neuling 提交于 5月 08, 2017

Update to real function name.
Signed-off-by: NMichael Neuling <mikey@neuling.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

2bafb7ff

powerpc: Fix asm offsets to point to actual FP and VMX regs · aa9a9516

由 Michael Neuling 提交于 5月 08, 2017

The asm code assumes the FP regs are at the start of fp_state. While
this is true now, it may not always be the case and there is nothing
enforcing it.

This fixes the asm-offsets to point to the actual FP registers inside
the fp_state.  Similarly for VMX.
Signed-off-by: NMichael Neuling <mikey@neuling.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

aa9a9516

powerpc: Fix /proc/cpuinfo revision for POWER9 DD2 · 64ebb9a2

由 Michael Neuling 提交于 6月 15, 2017

The P9 PVR bits 12-15 don't indicate a revision but instead different
chip configurations.  From BookIV we have:
   Bits      Configuration
    0 :    Scale out 12 cores
    1 :    Scale out 24 cores
    2 :    Scale up  12 cores
    3 :    Scale up  24 cores

DD1 doesn't use this but DD2 does. Linux will mostly use the "Scale
out 24 core" configuration (ie. SMT4 not SMT8) which results in a PVR
of 0x004e1200. The reported revision in /proc/cpuinfo is hence
reported incorrectly as "18.0".

This patch fixes this to mask off only the relevant bits for the major
revision (ie. bits 8-11) for POWER9.
Signed-off-by: NMichael Neuling <mikey@neuling.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

64ebb9a2

23 6月, 2017 1 次提交

powerpc/mm: Trace tlbie(l) instructions · 0428491c

由 Balbir Singh 提交于 4月 11, 2017

Add a trace point for tlbie(l) (Translation Lookaside Buffer Invalidate
Entry (Local)) instructions.

The tlbie instruction has changed over the years, so not all versions
accept the same operands. Use the ISA v3 field operands because they are
the most verbose, we may change them in future.

Example output:

  qemu-system-ppc-5371  [016]  1412.369519: tlbie:
  	tlbie with lpid 0, local 1, rb=67bd8900174c11c1, rs=0, ric=0 prs=0 r=0
Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
[mpe: Add some missing trace_tlbie()s, reword change log]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

0428491c

22 6月, 2017 1 次提交

powerpc: Convert VDSO update function to use new update_vsyscall interface · d4cfb113

由 Paul Mackerras 提交于 5月 27, 2017

This converts the powerpc VDSO time update function to use the new
interface introduced in commit 576094b7 ("time: Introduce new
GENERIC_TIME_VSYSCALL", 2012-09-11).  Where the old interface gave
us the time as of the last update in seconds and whole nanoseconds,
with the new interface we get the nanoseconds part effectively in
a binary fixed-point format with tk->tkr_mono.shift bits to the
right of the binary point.

With the old interface, the fractional nanoseconds got truncated,
meaning that the value returned by the VDSO clock_gettime function
would have about 1ns of jitter in it compared to the value computed
by the generic timekeeping code in the kernel.

The powerpc VDSO time functions (clock_gettime and gettimeofday)
already work in units of 2^-32 seconds, or 0.23283 ns, because that
makes it simple to split the result into seconds and fractional
seconds, and represent the fractional seconds in either microseconds
or nanoseconds.  This is good enough accuracy for now, so this patch
avoids changing how the VDSO works or the interface in the VDSO data
page.

This patch converts the powerpc update_vsyscall_old to be called
update_vsyscall and use the new interface.  We convert the fractional
second to units of 2^-32 seconds without truncating to whole nanoseconds.
(There is still a conversion to whole nanoseconds for any legacy users
of the vdso_data/systemcfg stamp_xtime field.)

In addition, this improves the accuracy of the computation of tb_to_xs
for those systems with high-frequency timebase clocks (>= 268.5 MHz)
by doing the right shift in two parts, one before the multiplication and
one after, rather than doing the right shift before the multiplication.
(We can't do all of the right shift after the multiplication unless we
use 128-bit arithmetic.)
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
Acked-by: NJohn Stultz <john.stultz@linaro.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d4cfb113

21 6月, 2017 4 次提交

powerpc/time: Fix tracing in time.c · 6b847d79

由 Santosh Sivaraj 提交于 6月 20, 2017

Since trace_clock is in a different file and already marked with notrace,
enable tracing in time.c by removing it from the disabled list in Makefile.
Also annotate clocksource read functions and sched_clock with notrace.

Testing: Timer and ftrace selftests run with different trace clocks.
Acked-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: NSantosh Sivaraj <santosh@fossix.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6b847d79

powerpc/64s: Rename slb_allocate_realmode() to slb_allocate() · fd88b945

由 Michael Ellerman 提交于 6月 19, 2017

As for slb_miss_realmode(), rename slb_allocate_realmode() to avoid
confusion over whether it runs in real or virtual mode - it runs in
both.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>

fd88b945

powerpc/64s: Rename slb_miss_realmode() to slb_miss_common() · 442b6e8e

由 Michael Ellerman 提交于 6月 19, 2017

slb_miss_realmode() doesn't always runs in real mode, which is what the
name implies. So rename it to avoid confusing people.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>

442b6e8e

powerpc/64s: Use BRANCH_TO_COMMON() for slb_miss_realmode · b102063b

由 Michael Ellerman 提交于 6月 19, 2017

All the callers of slb_miss_realmode currently open code the #ifndef
CONFIG_RELOCATABLE check and the branch via CTR in the RELOCATABLE case.
We have a macro to do this, BRANCH_TO_COMMON(), so use it.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>

b102063b

20 6月, 2017 9 次提交

N
powerpc/64s/paca: EX_CTR is not used with RELOCATABLE=n, remove it · 8568f1e0
由 Nicholas Piggin 提交于 5月 21, 2017
```
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
```
8568f1e0

powerpc/64s/paca: EX_R3 can be merged with EX_DAR · 635942ae

由 Nicholas Piggin 提交于 5月 21, 2017

EX_R3 is used only for a small section of the bad stack handler.
Merge it with EX_DAR.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

635942ae

powerpc/64s/paca: EX_LR can be merged with EX_DAR · dbeea1d6

由 Nicholas Piggin 提交于 5月 21, 2017

EX_LR is used only for a small section of the SLB miss handler.
Merge it with EX_DAR.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

dbeea1d6

powerpc/64s/paca: EX_SRR0 is unused, remove it · 36670fcf

由 Nicholas Piggin 提交于 5月 21, 2017

Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

36670fcf

powerpc/64s: Add EX_SIZE definition for paca exception save areas · 8c388514

由 Nicholas Piggin 提交于 5月 21, 2017

Rather than open-coding it 4 times.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Move __ASSEMBLY__ guards into head-64.h where they're really needed]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

8c388514

powerpc/64s: Avoid r3 save/restore in SLB miss handler · 4d7cd3b9

由 Nicholas Piggin 提交于 5月 21, 2017

The SLB miss handler uses r3 for the faulting address but r12 is
mostly able to be freed up to save r3 in. It just requires SRR1
be reloaded again on error.

It would be more conventional to use r12 for SRR1 (and use r11 to
save r3), but slb_allocate_realmode clobbers r11 and not r12.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

4d7cd3b9

powerpc/64s: SLB miss already has CTR saved for relocatable kernel · fe5482c0

由 Nicholas Piggin 提交于 5月 21, 2017

The EXCEPTION_PROLOG_1 used by SLB miss already saves CTR when the
kernel is built with CONFIG_RELOCATABLE. So it does not have to be
saved and reloaded when branching to slb_miss_realmode. It can be
restored from the PACA as usual.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

fe5482c0

powerpc/64s: Avoid saving faulting address into EX_DAR in SLB miss · 7c28f048

由 Nicholas Piggin 提交于 5月 21, 2017

The EX_DAR save area is only used in exceptional cases. With r3 no
longer clobbered by slb_allocate_realmode, saving faulting address to
EX_DAR can be deferred to those cases.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

7c28f048

powerpc/64s: Preserve r3 in slb_allocate_realmode() · d59afffd

由 Nicholas Piggin 提交于 5月 21, 2017

One fewer registers clobbered by this function means the SLB miss
handler can save one fewer.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d59afffd

19 6月, 2017 9 次提交

powerpc/64s/idle: Run latch switch is done with MSR[EE]=0 · 40d24343

由 Nicholas Piggin 提交于 6月 13, 2017

In the idle sleep/wake code we know that MSR[EE] is clear, so we can
avoid 2 x mfmsr and 2 x mtmsr by calling the double-underscore
versions of the run latch routines which assume interrupts are already
disabled.
Acked-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

40d24343

powerpc/64s/idle: Predict HMI wakeup as unlikely · 95acdc07

由 Nicholas Piggin 提交于 6月 13, 2017

In a busy system, idle wakeups can be expected from IPIs and device
interrupts.
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

95acdc07

powerpc/64s/idle: Avoid SRR usage in idle sleep/wake paths · 9d292501

由 Nicholas Piggin 提交于 6月 13, 2017

Idle code now always runs at the 0xc... effective address whether
in real or virtual mode. This means rfid can be ditched, along
with a lot of SRR manipulations.

In the wakeup path, carry SRR1 around in r12. Use mtmsrd to change
MSR states as required.

This also balances the return prediction for the idle call, by
doing blr rather than rfid to return to the idle caller.

On POWER9, 2-process context switch on different cores, with snooze
disabled, increases performance by 2%.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Incorporate v2 fixes from Nick]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9d292501

powerpc/64s/idle: Branch to handler with virtual mode offset · b51351e2

由 Nicholas Piggin 提交于 6月 13, 2017

Have the system reset idle wakeup handlers branched to in real mode
with the 0xc... kernel address applied. This allows simplifications of
avoiding rfid when switching to virtual mode in the wakeup handler.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b51351e2

powerpc/64s: Don't unbalance the return branch predictor in __replay_interrupt() · b48bbb82

由 Nicholas Piggin 提交于 6月 13, 2017

The __replay_interrupt() code is branched to with bl, but the caller is
returned to directly with rfid from the interrupt.

Instead, rfid to a stub that returns to the caller with blr, which
should keep the return branch predictor balanced.
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b48bbb82

powerpc/64s: msgclr when handling doorbell exceptions from system reset · a9af97aa

由 Nicholas Piggin 提交于 6月 13, 2017

msgsnd doorbell exceptions are cleared when the doorbell interrupt is
taken. However if a doorbell exception causes a system reset interrupt
wake from power saving state, the message is not cleared. Processing
the doorbell from the system reset interrupt requires msgclr to avoid
taking the exception again.

Testing this plus the previous wakup direct patch gives:

original wakeup direct msgclr
Different threads, same core: 315k/s 264k/s 345k/s
Different cores: 235k/s 242k/s 242k/s

Net speedup is +10% for same core, and +3% for different core.
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a9af97aa

powerpc/64s/idle: Process interrupts from system reset wakeup · 771d4304

由 Nicholas Piggin 提交于 6月 13, 2017

When the CPU wakes from low power state, it begins at the system reset
interrupt with the exception that caused the wakeup encoded in SRR1.

Today, powernv idle wakeup ignores the wakeup reason (except a special
case for HMI), and the regular interrupt corresponding to the
exception will fire after the idle wakeup exits.

Change this to replay the interrupt from the idle wakeup before
interrupts are hard-enabled.

Test on POWER8 of context_switch selftests benchmark with polling idle
disabled (e.g., always nap, giving cross-CPU IPIs) gives the following
results:

                                original         wakeup direct
Different threads, same core:   315k/s           264k/s
Different cores:                235k/s           242k/s

There is a slowdown for doorbell IPI (same core) case because system
reset wakeup does not clear the message and the doorbell interrupt
fires again needlessly.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

771d4304

powerpc/powernv: Simplify lazy IRQ handling in CPU offline · 2525db04

由 Nicholas Piggin 提交于 6月 13, 2017

Rather than concern ourselves with any soft-mask logic in the CPU
hotplug handler, just hard disable interrupts. This ensures there
are no lazy-irqs pending, which means we can call directly to idle
instruction in order to sleep.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

2525db04

powerpc/64s/idle: Move soft interrupt mask logic into C code · 2201f994

由 Nicholas Piggin 提交于 6月 13, 2017

This simplifies the asm and fixes irq-off tracing over sleep
instructions.

Also move powersave_nap check for POWER8 into C code, and move
PSSCR register value calculation for POWER9 into C.
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

2201f994

15 6月, 2017 8 次提交

drivers/watchdog/Kconfig: Update CONFIG_WATCHDOG_RTAS dependencies · 42bed042

由 Murilo Opsfelder Araujo 提交于 5月 29, 2017

drivers/watchdog/wdrtas.c uses symbols defined in arch/powerpc/kernel/rtas.c,
which are exported iff CONFIG_PPC_RTAS is selected. Building wdrtas.c without
setting CONFIG_PPC_RTAS throws the following errors:

ERROR: ".rtas_token" [drivers/watchdog/wdrtas.ko] undefined!
ERROR: "rtas_data_buf" [drivers/watchdog/wdrtas.ko] undefined!
ERROR: "rtas_data_buf_lock" [drivers/watchdog/wdrtas.ko] undefined!
ERROR: ".rtas_get_sensor" [drivers/watchdog/wdrtas.ko] undefined!
ERROR: ".rtas_call" [drivers/watchdog/wdrtas.ko] undefined!

This was identified during a randconfig build where CONFIG_WATCHDOG_RTAS=m and
CONFIG_PPC_RTAS was not set. Logs are here:

http://kisskb.ellerman.id.au/kisskb/buildresult/12982152/

This patch fixes the issue by updating CONFIG_WATCHDOG_RTAS to depend on just
CONFIG_PPC_RTAS, removing COMPILE_TEST entirely.
Signed-off-by: NMurilo Opsfelder Araujo <mopsfelder@gmail.com>
Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

42bed042

powerpc/64s: Avoid cpabort in context switch when possible · 07d2a628

由 Nicholas Piggin 提交于 6月 09, 2017

The ISA v3.0B copy-paste facility only requires cpabort when switching
to a process that has foreign real addresses mapped (direct access to
accelerators), to clear a potential copy buffer filled by a previous
thread. There is no accelerator driver implemented yet, so cpabort can
be removed. It can be be re-added when a driver is implemented.

POWER9 DD1 requires the copy buffer to always be cleared on context
switch, but if accelerators are not in use, then an unpaired copy from
a dummy region is sufficient to clear data out of the copy buffer.

This increases context switch performance by about 5% on POWER9.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

07d2a628

powerpc/64: Drop explicit hwsync in context switch · 9145effd

由 Nicholas Piggin 提交于 6月 09, 2017

The sync (aka. hwsync, aka. heavyweight sync) in the context switch
code to prevent MMIO access being reordered from the point of view of
a single process if it gets migrated to a different CPU is not
required because there is an hwsync performed earlier in the context
switch path.

Comment this so it's clear enough if anything changes on the scheduler
or the powerpc sides. Remove the hwsync from _switch.

This improves context switch performance by 2-3% on POWER8.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9145effd

powerpc/64: Drop reservation-clearing ldarx in context switch · 837e72f7

由 Nicholas Piggin 提交于 6月 09, 2017

There is no need to explicitly break the reservation in _switch,
because we are guaranteed that the context switch path will include a
larx/stcx.

Comment the guarantee and remove the reservation clear from _switch.

This is worth 1-2% in context switch performance.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

837e72f7

powerpc/64s: Leave interrupts hard enabled in context switch for radix · e4c0fc5f

由 Nicholas Piggin 提交于 6月 09, 2017

Commit 4387e9ff25 ("[POWERPC] Fix PMU + soft interrupt disable bug")
hard disabled interrupts over the low level context switch, because
the SLB management can't cope with a PMU interrupt accesing the stack
in that window.

Radix based kernel mapping does not use the SLB so it does not require
interrupts hard disabled here.

This is worth 1-2% in context switch performance on POWER9.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

e4c0fc5f

powerpc/64: Avoid restore_math call if possible in syscall exit · bc4f65e4

由 Nicholas Piggin 提交于 6月 09, 2017

The syscall exit code that branches to restore_math is quite heavy on
Book3S, consisting of 2 mtmsr instructions. Threads that don't use both
FP and vector can get caught here if the kernel ever uses FP or vector.
Lazy-FP/vec context switching also trips this case.

So check for lazy FP and vector before switching RI for restore_math.
Move most of this case out of line.

For threads that do want to restore math registers, the MSR switches are
still suboptimal. Future direction may be to use a soft-RI bit to avoid
MSR switches in kernel (similar to soft-EE), but for now at least the
no-restore

POWER9 context switch rate increases by about 5% due to sched_yield(2)
return performance. I haven't constructed a test to measure the syscall
cost.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

bc4f65e4

powerpc/64s: Optimize hypercall/syscall entry · acd7d8ce

由 Nicholas Piggin 提交于 6月 09, 2017

After bc355125 ("powerpc/64: Allow for relocation-on interrupts from
guest to host"), a getppid() system call goes from 307 cycles to 358
cycles (+17%) on POWER8. This is due significantly to the scratch SPR
used by the hypercall check.

It turns out there are a some volatile registers common to both system
call and hypercall (in particular, r12, cr0, ctr), which can be used to
avoid the SPR and some other overheads. This brings getppid to 320 cycles
(+4%).

Testing hcall entry performance by running "sc 1" in guest userspace
before this patch is 854 cycles, afterwards is 826. Also a small win
there.

POWER9 syscall is improved by about the same amount, hcall not tested.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

acd7d8ce

powerpc/mm/radix: Only add X for pages overlapping kernel text · 9abcc981

由 Michael Ellerman 提交于 6月 06, 2017

Currently we map the whole linear mapping with PAGE_KERNEL_X. Instead we
should check if the page overlaps the kernel text and only then add
PAGE_KERNEL_X.

Note that we still use 1G pages if they're available, so this will
typically still result in a 1G executable page at KERNELBASE. So this fix is
primarily useful for catching stray branches to high linear mapping addresses.

Without this patch, we can execute at 1G in xmon using:

  0:mon> m c000000040000000
  c000000040000000  00 l
  c000000040000000  00000000 01006038
  c000000040000004  00000000 2000804e
  c000000040000008  00000000 x
  0:mon> di c000000040000000
  c000000040000000  38600001      li      r3,1
  c000000040000004  4e800020      blr
  0:mon> p c000000040000000
  return value is 0x1

After we get a 400 as expected:

  0:mon> p c000000040000000
  *** 400 exception occurred

Fixes: 2bfd65e4 ("powerpc/mm/radix: Add radix callbacks for early init routines")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: NBalbir Singh <bsingharora@gmail.com>

9abcc981

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功