提交 · bd00a240dc52e28706fbbe3aceda63e6c291b433 · openanolis / cloud-kernel

12 9月, 2016 1 次提交

powerpc/powernv: Fix restore of SPRs upon wake up from hypervisor state loss · bd00a240

由 Gautham R. Shenoy 提交于 9月 07, 2016

pnv_wakeup_tb_loss() currently expects cr4 to be "eq" if the CPU is
waking up from a complete hypervisor state loss. Hence, it currently
restores the SPR contents only if cr4 is "eq".

However, after commit bcef83a0 ("powerpc/powernv: Add platform
support for stop instruction"), on ISA v3.0 CPUs, the function
pnv_restore_hyp_resource() sets cr4 to contain the result of the
comparison between the state the CPU has woken up from and the first
deep stop state before calling pnv_wakeup_tb_loss().

Thus if the CPU woke up from a state that is deeper than the first
deep stop state, cr4 will have "gt" set and hence, pnv_wakeup_tb_loss()
will fail to restore the SPRs on waking up from such a state.

Fix the code in pnv_wakeup_tb_loss() to restore the SPR states when cr4
is "eq" or "gt".

Fixes: bcef83a0 ("powerpc/powernv: Add platform support for stop instruction")
Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: NShreyas B. Prabhu <shreyasbp@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

bd00a240

09 8月, 2016 2 次提交

powerpc/powernv: Move IDLE_STATE_ENTER_SEQ macro to cpuidle.h · 98d8821a

由 Mahesh Salgaonkar 提交于 8月 05, 2016

Move IDLE_STATE_ENTER_SEQ macro to cpuidle.h so that MCE handler changes
in subsequent patch can use it.

No functionality change.
Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

98d8821a

powerpc/powernv: Load correct TOC pointer while waking up from winkle. · e325d76f

由 Mahesh Salgaonkar 提交于 8月 05, 2016

The function pnv_restore_hyp_resource() loads the TOC into r2 from
the invalid PACA pointer before fixing r13 value. This do not affect
POWER ISA 3.0 but it does have an impact on POWER ISA 2.07 or less
leading CPU to get stuck forever.

	login: [  471.830433] Processor 120 is stuck.

This can be easily reproducible using following steps:
- Turn off SMT
	$ ppc64_cpu --smt=off
- offline/online any online cpu (Thread 0 of any core which is online)
	$ echo 0 > /sys/devices/system/cpu/cpu<num>/online
	$ echo 1 > /sys/devices/system/cpu/cpu<num>/online

For POWER ISA 2.07 or less, the last bit of HSPRG0 is set indicating
that thread is waking up from winkle. Hence, the last bit of HSPRG0(r13)
needs to be clear before accessing it as PACA to avoid loading invalid
values from invalid PACA pointer.

Fix this by loading TOC after r13 register is corrected.

Fixes: bcef83a0 ("powerpc/powernv: Add platform support for stop instruction")
Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

e325d76f

01 8月, 2016 1 次提交

powerpc/mm: Make MMU_FTR_RADIX a MMU family feature · 5a25b6f5

由 Aneesh Kumar K.V 提交于 7月 27, 2016

MMU feature bits are defined such that we use the lower half to
present MMU family features. Remove the strict split of half and
also move Radix to a mmu family feature. Radix introduce a new MMU
model and strictly speaking it is a new MMU family. This also free
up bits which can be used for individual features later.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

5a25b6f5

17 7月, 2016 1 次提交

powerpc/opal: Add real mode call wrappers · 69c592ed

由 Benjamin Herrenschmidt 提交于 7月 08, 2016

Replace the old generic opal_call_realmode() with proper per-call
wrappers similar to the normal ones and convert callers.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

69c592ed

15 7月, 2016 7 次提交

powerpc/powernv: Add platform support for stop instruction · bcef83a0

由 Shreyas B. Prabhu 提交于 7月 08, 2016

POWER ISA v3 defines a new idle processor core mechanism. In summary,
 a) new instruction named stop is added. This instruction replaces
	instructions like nap, sleep, rvwinkle.
 b) new per thread SPR named Processor Stop Status and Control Register
	(PSSCR) is added which controls the behavior of stop instruction.

PSSCR layout:
----------------------------------------------------------
| PLS | /// | SD | ESL | EC | PSLL | /// | TR | MTL | RL |
----------------------------------------------------------
0      4     41   42    43   44     48    54   56    60

PSSCR key fields:
	Bits 0:3  - Power-Saving Level Status. This field indicates the lowest
	power-saving state the thread entered since stop instruction was last
	executed.

	Bit 42 - Enable State Loss
	0 - No state is lost irrespective of other fields
	1 - Allows state loss

	Bits 44:47 - Power-Saving Level Limit
	This limits the power-saving level that can be entered into.

	Bits 60:63 - Requested Level
	Used to specify which power-saving level must be entered on executing
	stop instruction

This patch adds support for stop instruction and PSSCR handling.
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

bcef83a0

powerpc/powernv: abstraction for saving SPRs before entering deep idle states · 0dfffb48

由 Shreyas B. Prabhu 提交于 7月 08, 2016

Create a function for saving SPRs before entering deep idle states.
This function can be reused for POWER9 deep idle states.
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

0dfffb48

powerpc/powernv: Make pnv_powersave_common more generic · 4eae2c9a

由 Shreyas B. Prabhu 提交于 7月 08, 2016

pnv_powersave_common does common steps needed before entering idle
state and eventually changes MSR to MSR_IDLE and does rfid to
pnv_enter_arch207_idle_mode.

Move the updation of HSTATE_HWTHREAD_STATE to pnv_powersave_common
from pnv_enter_arch207_idle_mode and make it more generic by passing the
rfid address as a function parameter.
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

4eae2c9a

powerpc/powernv: Rename reusable idle functions to hardware agnostic names · 5fa6b6bd

由 Shreyas B. Prabhu 提交于 7月 08, 2016

Functions like power7_wakeup_loss, power7_wakeup_noloss,
power7_wakeup_tb_loss are used by POWER7 and POWER8 hardware. They can
also be used by POWER9. Hence rename these functions hardware agnostic
names.
Suggested-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

5fa6b6bd

powerpc/powernv: Rename idle_power7.S to idle_book3s.S · 83289f90

由 Shreyas B. Prabhu 提交于 7月 08, 2016

idle_power7.S handles idle entry/exit for POWER7, POWER8 and in next
patch for POWER9. Rename the file to a non-hardware specific
name.
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

83289f90

powerpc/kvm: make hypervisor state restore a function · 17065671

由 Shreyas B. Prabhu 提交于 7月 08, 2016

In the current code, when the thread wakes up in reset vector, some
of the state restore code and check for whether a thread needs to
branch to kvm is duplicated. Reorder the code such that this
duplication is avoided.

At a higher level this is what the change looks like-

Before this patch -
power7_wakeup_tb_loss:
	restore hypervisor state
	if (thread needed by kvm)
		goto kvm_start_guest
	restore nvgprs, cr, pc
	rfid to process context

power7_wakeup_loss:
	restore nvgprs, cr, pc
	rfid to process context

reset vector:
	if (waking from deep idle states)
		goto power7_wakeup_tb_loss
	else
		if (thread needed by kvm)
			goto kvm_start_guest
		goto power7_wakeup_loss

After this patch -
power7_wakeup_tb_loss:
	restore hypervisor state
	return

power7_restore_hyp_resource():
	if (waking from deep idle states)
		goto power7_wakeup_tb_loss
	return

power7_wakeup_loss:
	restore nvgprs, cr, pc
	rfid to process context

reset vector:
	power7_restore_hyp_resource()
	if (thread needed by kvm)
                goto kvm_start_guest
	goto power7_wakeup_loss
Reviewed-by: NPaul Mackerras <paulus@samba.org>
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

17065671

S
powerpc/powernv: Use PNV_THREAD_WINKLE macro while requesting for winkle · bfd1b7ae
由 Shreyas B. Prabhu 提交于 7月 08, 2016
```
Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
```
bfd1b7ae

20 6月, 2016 1 次提交

KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt · fd7bacbc

由 Mahesh Salgaonkar 提交于 5月 15, 2016

When a guest is assigned to a core it converts the host Timebase (TB)
into guest TB by adding guest timebase offset before entering into
guest. During guest exit it restores the guest TB to host TB. This means
under certain conditions (Guest migration) host TB and guest TB can differ.

When we get an HMI for TB related issues the opal HMI handler would
try fixing errors and restore the correct host TB value. With no guest
running, we don't have any issues. But with guest running on the core
we run into TB corruption issues.

If we get an HMI while in the guest, the current HMI handler invokes opal
hmi handler before forcing guest to exit. The guest exit path subtracts
the guest TB offset from the current TB value which may have already
been restored with host value by opal hmi handler. This leads to incorrect
host and guest TB values.

With split-core, things become more complex. With split-core, TB also gets
split and each subcore gets its own TB register. When a hmi handler fixes
a TB error and restores the TB value, it affects all the TB values of
sibling subcores on the same core. On TB errors all the thread in the core
gets HMI. With existing code, the individual threads call opal hmi handle
independently which can easily throw TB out of sync if we have guest
running on subcores. Hence we will need to co-ordinate with all the
threads before making opal hmi handler call followed by TB resync.

This patch introduces a sibling subcore state structure (shared by all
threads in the core) in paca which holds information about whether sibling
subcores are in Guest mode or host mode. An array in_guest[] of size
MAX_SUBCORE_PER_CORE=4 is used to maintain the state of each subcore.
The subcore id is used as index into in_guest[] array. Only primary
thread entering/exiting the guest is responsible to set/unset its
designated array element.

On TB error, we get HMI interrupt on every thread on the core. Upon HMI,
this patch will now force guest to vacate the core/subcore. Primary
thread from each subcore will then turn off its respective bit
from the above bitmap during the guest exit path just after the
guest->host partition switch is complete.

All other threads that have just exited the guest OR were already in host
will wait until all other subcores clears their respective bit.
Once all the subcores turn off their respective bit, all threads will
will make call to opal hmi handler.

It is not necessary that opal hmi handler would resync the TB value for
every HMI interrupts. It would do so only for the HMI caused due to
TB errors. For rest, it would not touch TB value. Hence to make things
simpler, primary thread would call TB resync explicitly once for each
core immediately after opal hmi handler instead of subtracting guest
offset from TB. TB resync call will restore the TB with host value.
Thus we can be sure about the TB state.

One of the primary threads exiting the guest will take up the
responsibility of calling TB resync. It will use one of the top bits
(bit 63) from subcore state flags bitmap to make the decision. The first
primary thread (among the subcores) that is able to set the bit will
have to call the TB resync. Rest all other threads will wait until TB
resync is complete. Once TB resync is complete all threads will then
proceed.
Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

fd7bacbc

03 3月, 2016 1 次提交

powerpc/mm: Move hash related mmu-*.h headers to book3s/ · f64e8084

由 Aneesh Kumar K.V 提交于 3月 01, 2016

No code changes.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f64e8084

01 12月, 2015 1 次提交

powerpc: Remove UP only lazy floating point and vector optimisations · af1bbc3d

由 Anton Blanchard 提交于 10月 29, 2015

The UP only lazy floating point and vector optimisations were written
back when SMP was not common, and neither glibc nor gcc used vector
instructions. Now SMP is very common, glibc aggressively uses vector
instructions and gcc autovectorises.

We want to add new optimisations that apply to both UP and SMP, but
in preparation for that remove these UP only optimisations.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

af1bbc3d

07 7月, 2015 1 次提交

powerpc/powernv: Fix race in updating core_idle_state · b32aadc1

由 Shreyas B. Prabhu 提交于 7月 07, 2015

core_idle_state is maintained for each core. It uses 0-7 bits to track
whether a thread in the core has entered fastsleep or winkle. 8th bit is
used as a lock bit.
The lock bit is set in these 2 scenarios-
 - The thread is first in subcore to wakeup from sleep/winkle.
 - If its the last thread in the core about to enter sleep/winkle

While the lock bit is set, if any other thread in the core wakes up, it
loops until the lock bit is cleared before proceeding in the wakeup
path. This helps prevent race conditions w.r.t fastsleep workaround and
prevents threads from switching to process context before core/subcore
resources are restored.

But, in the path to sleep/winkle entry, we currently don't check for
lock-bit. This exposes us to following race when running with subcore
on-

First thread in the subcorea		Another thread in the same
waking up		   		core entering sleep/winkle

lwarx   r15,0,r14
ori     r15,r15,PNV_CORE_IDLE_LOCK_BIT
stwcx.  r15,0,r14
[Code to restore subcore state]

						lwarx   r15,0,r14
						[clear thread bit]
						stwcx.  r15,0,r14

andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
stw     r15,0(r14)

Here, after the thread entering sleep clears its thread bit in
core_idle_state, the value is overwritten by the thread waking up.
In such cases when the core enters fastsleep, code mistakes an idle
thread as running. Because of this, the first thread waking up from
fastsleep which is supposed to resync timebase skips it. So we can
end up having a core with stale timebase value.

This patch fixes the above race by looping on the lock bit even while
entering the idle states.
Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Fixes: 7b54e9f213f76 'powernv/powerpc: Add winkle support for offline cpus'
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b32aadc1

01 5月, 2015 1 次提交

powerpc/powernv: Restore non-volatile CRs after nap · 0aab3747

由 Sam Bobroff 提交于 5月 01, 2015

Patches 7cba160a "powernv/cpuidle: Redesign idle states management"
and 77b54e9f "powernv/powerpc: Add winkle support for offline cpus"
use non-volatile condition registers (cr2, cr3 and cr4) early in the system
reset interrupt handler (system_reset_pSeries()) before it has been determined
if state loss has occurred. If state loss has not occurred, control returns via
the power7_wakeup_noloss() path which does not restore those condition
registers, leaving them corrupted.

Fix this by restoring the condition registers in the power7_wakeup_noloss()
case.

This is apparent when running a KVM guest on hardware that does not
support winkle or sleep and the guest makes use of secondary threads. In
practice this means Power7 machines, though some early unreleased Power8
machines may also be susceptible.

The secondary CPUs are taken off line before the guest is started and
they call pnv_smp_cpu_kill_self(). This checks support for sleep
states (in this case there is no support) and power7_nap() is called.

When the CPU is woken, power7_nap() returns and because the CPU is
still off line, the main while loop executes again. The sleep states
support test is executed again, but because the tested values cannot
have changed, the compiler has optimized the test away and instead we
rely on the result of the first test, which has been left in cr3
and/or cr4. With the result overwritten, the wrong branch is taken and
power7_winkle() is called on a CPU that does not support it, leading
to it stalling.

Fixes: 7cba160a ("powernv/cpuidle: Redesign idle states management")
Fixes: 77b54e9f ("powernv/powerpc: Add winkle support for offline cpus")
[mpe: Massage change log a bit more]
Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

0aab3747

23 3月, 2015 1 次提交

powerpc/powernv: Fix return value from power7_nap() et al. · f57333a7

由 Paul Mackerras 提交于 3月 20, 2015

The power7_nap(), power7_sleep() and power7_winkle() functions are
called from pnv_smp_cpu_kill_self(), which expects them to return the
SRR1 value set by the hardware on wakeup, or 0 if no nap/sleep/winkle
occurred. However, in the case where an interrupt needs to be
replayed, the logic in power7_powersave_common (the common code for
power7_nap et al.) doesn't set r3 to 0 in this case. Instead what we
get as the return value is the selector for the type of power-saving
mode requested (1, 2 or 3). In fact this should not affect the
operation of pnv_smp_cpu_kill_self(), but it is better to get this
correct, so this adds an instruction to set r3 to 0 in this case.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f57333a7

15 12月, 2014 3 次提交

powernv/powerpc: Add winkle support for offline cpus · 77b54e9f

由 Shreyas B. Prabhu 提交于 12月 10, 2014

Winkle is a deep idle state supported in power8 chips. A core enters
winkle when all the threads of the core enter winkle. In this state
power supply to the entire chiplet i.e core, private L2 and private L3
is turned off. As a result it gives higher powersavings compared to
sleep.

But entering winkle results in a total hypervisor state loss. Hence the
hypervisor context has to be preserved before entering winkle and
restored upon wake up.

Power-on Reset Engine (PORE) is a dedicated engine which is responsible
for powering on the chiplet during wake up. It can be programmed to
restore the register contests of a few specific registers. This patch
uses PORE to restore register state wherever possible and uses stack to
save and restore rest of the necessary registers.

With hypervisor state restore things fall under three categories-
per-core state, per-subcore state and per-thread state. To manage this,
extend the infrastructure introduced for sleep. Mainly we add a paca
variable subcore_sibling_mask. Using this and the core_idle_state we can
distingush first thread in core and subcore.
Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

77b54e9f

powernv/cpuidle: Redesign idle states management · 7cba160a

由 Shreyas B. Prabhu 提交于 12月 10, 2014

Deep idle states like sleep and winkle are per core idle states. A core
enters these states only when all the threads enter either the
particular idle state or a deeper one. There are tasks like fastsleep
hardware bug workaround and hypervisor core state save which have to be
done only by the last thread of the core entering deep idle state and
similarly tasks like timebase resync, hypervisor core register restore
that have to be done only by the first thread waking up from these
state.

The current idle state management does not have a way to distinguish the
first/last thread of the core waking/entering idle states. Tasks like
timebase resync are done for all the threads. This is not only is
suboptimal, but can cause functionality issues when subcores and kvm is
involved.

This patch adds the necessary infrastructure to track idle states of
threads in a per-core structure. It uses this info to perform tasks like
fastsleep workaround and timebase resync only once per core.
Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Originally-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: linux-pm@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

7cba160a

powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode · 8117ac6a

由 Paul Mackerras 提交于 12月 10, 2014

Currently, when going idle, we set the flag indicating that we are in
nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
(or sleep or rvwinkle) instruction, all with the MMU on.  This is bad
for two reasons: (a) the architecture specifies that those instructions
must be executed with the MMU off, and in fact with only the SF, HV, ME
and possibly RI bits set, and (b) this introduces a race, because as
soon as we set the flag, another thread can switch the MMU to a guest
context.  If the race is lost, this thread will typically start looping
on relocation-on ISIs at 0xc...4400.

This fixes it by setting the MSR as required by the architecture before
setting the flag or executing the nap/sleep/rvwinkle instruction.

Cc: stable@vger.kernel.org
[ shreyas@linux.vnet.ibm.com: Edited to handle LE ]
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

8117ac6a

08 12月, 2014 1 次提交

powerpc/powernv: Return to cpu offline loop when finished in KVM guest · 56548fc0

由 Paul Mackerras 提交于 12月 03, 2014

When a secondary hardware thread has finished running a KVM guest, we
currently put that thread into nap mode using a nap instruction in
the KVM code.  This changes the code so that instead of doing a nap
instruction directly, we instead cause the call to power7_nap() that
put the thread into nap mode to return.  The reason for doing this is
to avoid having the KVM code having to know what low-power mode to
put the thread into.

In the case of a secondary thread used to run a KVM guest, the thread
will be offline from the point of view of the host kernel, and the
relevant power7_nap() call is the one in pnv_smp_cpu_disable().
In this case we don't want to clear pending IPIs in the offline loop
in that function, since that might cause us to miss the wakeup for
the next time the thread needs to run a guest.  To tell whether or
not to clear the interrupt, we use the SRR1 value returned from
power7_nap(), and check if it indicates an external interrupt.  We
arrange that the return from power7_nap() when we have finished running
a guest returns 0, so pending interrupts don't get flushed in that
case.

Note that it is important a secondary thread that has finished
executing in the guest, or that didn't have a guest to run, should
not return to power7_nap's caller while the kvm_hstate.hwthread_req
flag in the PACA is non-zero, because the return from power7_nap
will reenable the MMU, and the MMU might still be in guest context.
In this situation we spin at low priority in real mode waiting for
hwthread_req to become zero.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

56548fc0

25 9月, 2014 1 次提交

powerpc/powernv: Don't call generic code on offline cpus · d6a4f709

由 Paul Mackerras 提交于 9月 02, 2014

On PowerNV platforms, when a CPU is offline, we put it into nap mode.
It's possible that the CPU wakes up from nap mode while it is still
offline due to a stray IPI. A misdirected device interrupt could also
potentially cause it to wake up. In that circumstance, we need to clear
the interrupt so that the CPU can go back to nap mode.

In the past the clearing of the interrupt was accomplished by briefly
enabling interrupts and allowing the normal interrupt handling code
(do_IRQ() etc.) to handle the interrupt. This has the problem that
this code calls irq_enter() and irq_exit(), which call functions such
as account_system_vtime() which use RCU internally. Use of RCU is not
permitted on offline CPUs and will trigger errors if RCU checking is
enabled.

To avoid calling into any generic code which might use RCU, we adopt
a different method of clearing interrupts on offline CPUs. Since we
are on the PowerNV platform, we know that the system interrupt
controller is a XICS being driven directly (i.e. not via hcalls) by
the kernel. Hence this adds a new icp_native_flush_interrupt()
function to the native-mode XICS driver and arranges to call that
when an offline CPU is woken from nap. This new function reads the
interrupt from the XICS. If it is an IPI, it clears the IPI; if it
is a device interrupt, it prints a warning and disables the source.
Then it does the end-of-interrupt processing for the interrupt.

The other thing that briefly enabling interrupts did was to check and
clear the irq_happened flag in this CPU's PACA. Therefore, after
flushing the interrupt from the XICS, we also clear all bits except
the PACA_IRQ_HARD_DIS (interrupts are hard disabled) bit from the
irq_happened flag. The PACA_IRQ_HARD_DIS flag is set by power7_nap()
and is left set to indicate that interrupts are hard disabled. This
means we then have to ignore that flag in power7_nap(), which is
reasonable since it doesn't indicate that any interrupt event needs
servicing.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d6a4f709

05 8月, 2014 2 次提交

powerpc/book3s: Fix endianess issue for HMI handling on napping cpus. · db97efff

由 Mahesh Salgaonkar 提交于 7月 31, 2014

(NOTE: This patch depends on upstream HMI handling patchset at
 https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-July/119731.html)

The current HMI handling on napping cpus does not take care of endianess
issue. On LE host kernel when we wake up from nap due to HMI interrupt we
would checkstop while jumping into opal call. There is a similar issue in
case of fast sleep wakeup where the code invokes opal_resync_tb opal call
without handling LE issue. This patch fixes that as well.

With this patch applied, HMIs handling on LE host kernel works fine.
Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

db97efff

powerpc/book3s: handle HMIs for cpus in nap mode. · bbdb760d

由 Mahesh Salgaonkar 提交于 7月 29, 2014

HMIs are thread specific and can come while thread is in sleep/nap mode.
Hence with SMT=off mode we can receive HMIs on sleeping threads. For
interrupt received in nap mode, cpu wakes up at system reset vector, clears
the interrupt and go back to nap mode again. But HMIs are sticky and they
keep happening until we clear reason bits from HMER. Hence add a special
check for HMI in reset vector (through power7_wakeup_* functions) and
invoke opal call to handle HMI.
Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

bbdb760d

11 7月, 2014 1 次提交

powerpc/powernv: Check for IRQHAPPENED before sleeping · c733cf83

由 Preeti U Murthy 提交于 7月 02, 2014

Commit 8d6f7c5a: "powerpc/powernv: Make it possible to skip the IRQHAPPENED
check in power7_nap()" added code that prevents cpus from checking for
pending interrupts just before entering sleep state, which is wrong. These
interrupts are delivered during the soft irq disabled state of the cpu.

A cpu cannot enter any idle state with pending interrupts because they will
never be serviced until the next time the cpu is woken up by some other
interrupt. Its only then that the pending interrupts are replayed. This can result
in device timeouts or warnings about this cpu being stuck.

This patch fixes ths issue by ensuring that cpus check for pending interrupts
just before entering any idle state as long as they are not in the path of split
core operations.
Signed-off-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

c733cf83

28 5月, 2014 1 次提交

powerpc/powernv: Make it possible to skip the IRQHAPPENED check in power7_nap() · 8d6f7c5a

由 Michael Ellerman 提交于 5月 23, 2014

To support split core we need to be able to force all secondaries into
nap, so the core can detect they are idle and do an unsplit.

Currently power7_nap() will return without napping if there is an irq
pending. We want to ignore the pending irq and nap anyway, we will deal
with the interrupt later.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NMichael Neuling <mikey@neuling.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

8d6f7c5a

23 4月, 2014 1 次提交

powerpc: No need to use dot symbols when branching to a function · b1576fec

由 Anton Blanchard 提交于 2月 04, 2014

binutils is smart enough to know that a branch to a function
descriptor is actually a branch to the functions text address.

Alan tells me that binutils has been doing this for 9 years.
Signed-off-by: NAnton Blanchard <anton@samba.org>

b1576fec

05 3月, 2014 2 次提交

powerpc/powernv: Add OPAL call to resync timebase on wakeup · 97eb001f

由 Vaidyanathan Srinivasan 提交于 2月 26, 2014

During "Fast-sleep" and deeper power savings state, decrementer and
timebase could be stopped making it out of sync with rest
of the cores in the system.

Add a firmware call to request platform to resync timebase
using low level platform methods.
Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

97eb001f

powerpc/powernv: Add context management for Fast Sleep · aca79d2b

由 Vaidyanathan Srinivasan 提交于 2月 26, 2014

Before adding Fast-Sleep into the cpuidle framework, some low level
support needs to be added to enable it. This includes saving and
restoring of certain registers at entry and exit time of this state
respectively just like we do in the NAP idle state.
Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
[Changelog modified by Preeti U. Murthy <preeti@linux.vnet.ibm.com>]
Signed-off-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

aca79d2b

05 12月, 2013 1 次提交

powerpc/book3s: Return from interrupt if coming from evil context. · 1c51089f

由 Mahesh Salgaonkar 提交于 10月 30, 2013

We can get machine checks from any context. We need to make sure that
we handle all of them correctly. If we are coming from hypervisor user-space,
we can continue in host kernel in virtual mode to deliver the MC event.
If we got woken up from power-saving mode then we may come in with one of
the following state:
 a. No state loss
 b. Supervisor state loss
 c. Hypervisor state loss
For (a) and (b), we go back to nap again. State (c) is fatal, keep spinning.

For all other context which we not sure of queue up the MCE event and return
from the interrupt.
Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

1c51089f

17 10月, 2013 1 次提交

kvm: powerpc: book3s: Add a new config variable CONFIG_KVM_BOOK3S_HV_POSSIBLE · 9975f5e3

由 Aneesh Kumar K.V 提交于 10月 07, 2013

This help ups to select the relevant code in the kernel code
when we later move HV and PR bits as seperate modules. The patch
also makes the config options for PR KVM selectable
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NAlexander Graf <agraf@suse.de>

9975f5e3

05 9月, 2012 1 次提交

powerpc/powernv: Always go into nap mode when CPU is offline · 375f561a

由 Paul Mackerras 提交于 7月 26, 2012

The CPU hotplug code for the powernv platform currently only puts
offline CPUs into nap mode if the powersave_nap variable is set.
However, HV-style KVM on this platform requires secondary CPU threads
to be offline and in nap mode.  Since we know nap mode works just
fine on all POWER7 machines, and the only machines that support the
powernv platform are POWER7 machines, this changes the code to
always put offline CPUs into nap mode, regardless of powersave_nap.
Powersave_nap still controls whether or not CPUs go into nap mode
when idle, as before.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

375f561a

08 4月, 2012 1 次提交

KVM: PPC: Book3S HV: Make secondary threads more robust against stray IPIs · f0888f70

由 Paul Mackerras 提交于 2月 03, 2012

Currently on POWER7, if we are running the guest on a core and we don't
need all the hardware threads, we do nothing to ensure that the unused
threads aren't executing in the kernel (other than checking that they
are offline). We just assume they're napping and we don't do anything
to stop them trying to enter the kernel while the guest is running.
This means that a stray IPI can wake up the hardware thread and it will
then try to enter the kernel, but since the core is in guest context,
it will execute code from the guest in hypervisor mode once it turns the
MMU on, which tends to lead to crashes or hangs in the host.

This fixes the problem by adding two new one-byte flags in the
kvmppc_host_state structure in the PACA which are used to interlock
between the primary thread and the unused secondary threads when entering
the guest. With these flags, the primary thread can ensure that the
unused secondaries are not already in kernel mode (i.e. handling a stray
IPI) and then indicate that they should not try to enter the kernel
if they do get woken for any reason. Instead they will go into KVM code,
find that there is no vcpu to run, acknowledge and clear the IPI and go
back to nap mode.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

f0888f70

09 3月, 2012 1 次提交

powerpc: Rework lazy-interrupt handling · 7230c564

由 Benjamin Herrenschmidt 提交于 3月 06, 2012

The current implementation of lazy interrupts handling has some
issues that this tries to address.

We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.

The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.

Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.

This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.

The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.

When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.

We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).

This removes the need to play with the decrementer to try to create
fake interrupts, among others.

In addition, this adds a few refinements:

 - We no longer  hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.

 - Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.

 - On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)

 - We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.

Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

v2:

- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
  to retrigger an interrupt without preventing hard-enable

v3:

 - Fix or vs. ori bug on Book3E
 - Fix enabling of interrupts for some exceptions on Book3E

v4:

 - Fix resend of doorbells on return from interrupt on Book3E

v5:

 - Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.

v6:
 - 32-bit compile fix
 - more compile fixes with various .config combos
 - factor out the asm code to soft-disable interrupts
 - remove the C wrapper around preempt_schedule_irq

v7:
 - Fix a bug with hard irq state tracking on native power7

7230c564

08 12月, 2011 1 次提交

powerpc: Provide a way for KVM to indicate that NV GPR values are lost · 2fde6d20

由 Paul Mackerras 提交于 12月 05, 2011

This fixes a problem where a CPU thread coming out of nap mode can
think it has valid values in the nonvolatile GPRs (r14 - r31) as saved
away in power7_idle, but in fact the values have been trashed because
the thread was used for KVM in the mean time.  The result is that the
thread crashes because code that called power7_idle (e.g.,
pnv_smp_cpu_kill_self()) goes to use values in registers that have
been trashed.

The bit field in SRR1 that tells whether state was lost only reflects
the most recent nap, which may not have been the nap instruction in
power7_idle.  So we need an extra PACA field to indicate that state
has been lost even if SRR1 indicates that the most recent nap didn't
lose state.  We clear this field when saving the state in power7_idle,
we set it to a non-zero value when we use the thread for KVM, and we
test it in power7_wakeup_noloss.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

2fde6d20

12 7月, 2011 1 次提交

KVM: PPC: Allow book3s_hv guests to use SMT processor modes · 371fefd6

由 Paul Mackerras 提交于 6月 29, 2011

This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.

This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.

To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.

When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.

We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.

When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).

It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.

Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>

371fefd6

20 4月, 2011 1 次提交

powerpc: Add NAP mode support on Power7 in HV mode · 948cf67c

由 Benjamin Herrenschmidt 提交于 1月 24, 2011

Wakeup comes from the system reset handler with a potential loss of
the non-hypervisor CPU state. We save the non-volatile state on the
stack and a pointer to it in the PACA, which the system reset handler
uses to restore things
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

948cf67c

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功