提交 · 9d2a4d71332cfdf4ea90754ad9b2f05a5ee5f6c7 · openanolis / cloud-kernel

12 11月, 2017 13 次提交

powerpc: Define set_thread_uses_vas() · 9d2a4d71

由 Sukadev Bhattiprolu 提交于 11月 07, 2017

A CP_ABORT instruction is required in processes that have mapped a VAS
"paste address" with the intention of using COPY/PASTE instructions.
But since CP_ABORT is expensive, we want to restrict it to only
processes that use/intend to use COPY/PASTE.

Define an interface, set_thread_uses_vas(), that VAS can use to
indicate that the current process opened a send window. During context
switch, issue CP_ABORT only for processes that have the flag set.

Thanks for input from Nick Piggin, Michael Ellerman.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
[mpe: Fix to not use new_thread after _switch() returns]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9d2a4d71

powerpc: Add support for setting SPRN_TIDR · ec233ede

由 Sukadev Bhattiprolu 提交于 11月 07, 2017

We need the SPRN_TIDR to be set for use with fast thread-wakeup (core-
to-core wakeup) and also with CAPI.

Each thread in a process needs to have a unique id within the process.
But for now, we assign globally unique thread ids to all threads in
the system.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NPhilippe Bergheaud <felix@linux.vnet.ibm.com>
Signed-off-by: NChristophe Lombard <clombard@linux.vnet.ibm.com>
[mpe: Simplify tidr clearing on fork() and ctx switch code]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

ec233ede

powerpc/vas: Export HVWC to debugfs · ece4e512

由 Sukadev Bhattiprolu 提交于 11月 07, 2017

Export the VAS Window context information to debugfs.

We need to hold a mutex when closing the window to prevent a race
with the debugfs read(). Rather than introduce a per-instance mutex,
we use the global vas_mutex for now, since it is not heavily contended.

The window->cop field is only relevant to a receive window so we were
not setting it for a send window (which is is paired to a receive window
anyway). But to simplify reporting in debugfs, set the 'cop' field for the
send window also.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

ece4e512

powerpc/vas, nx-842: Define and use chip_to_vas_id() · d4ef61b5

由 Sukadev Bhattiprolu 提交于 11月 07, 2017

Define a helper, chip_to_vas_id() to map a given chip id to corresponding
vas id.

Normally, callers of vas_rx_win_open() and vas_tx_win_open() want the VAS
window to be on the same chip where the calling thread is executing. These
callers can pass in -1 for the VAS id.

This interface will be useful if a thread running on one chip wants to open
a window on another chip (like the NX-842 driver does during start up).
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d4ef61b5

powerpc/vas: Create cpu to vas id mapping · ca03258b

由 Sukadev Bhattiprolu 提交于 11月 07, 2017

Create a cpu to vasid mapping so callers can specify -1 instead of
trying to find a VAS id.

Changelog[v2]
	[Michael Ellerman] Use per-cpu variables to simplify code.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

ca03258b

powerpc/vas: poll for return of window credits · 6fccac16

由 Sukadev Bhattiprolu 提交于 11月 07, 2017

Normally, the NX driver waits for the CRBs to be processed before closing
the window. But it is better to ensure that the credits are returned before
the window gets reassigned later.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6fccac16

powerpc/vas: Save configured window credits · 62f659e0

由 Sukadev Bhattiprolu 提交于 11月 07, 2017

Save the configured max window credits for a window in the vas_window
structure. We will need this when polling for return of window credits.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

62f659e0

powerpc/vas: Reduce polling interval for busy state · dfe954e4

由 Sukadev Bhattiprolu 提交于 11月 07, 2017

A VAS window is normally in "busy" state for only a short duration.
Reduce the time we wait for the window to go to "not-busy" state to
speed-up vas_win_close() a bit.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

dfe954e4

powerpc/vas: Use helper to unpin/close window · 36a288fe

由 Sukadev Bhattiprolu 提交于 11月 07, 2017

Use a helper to have the hardware unpin and mark a window closed.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

36a288fe

powerpc/vas: Drop poll_window_cast_out(). · 4963ac36

由 Sukadev Bhattiprolu 提交于 11月 07, 2017

Polling for window cast out is listed in the spec, but turns out that
it is not strictly necessary and slows down window close. Making it a
stub for now.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

4963ac36

powerpc/vas: Cleanup some debug code · 0a2c2c24

由 Sukadev Bhattiprolu 提交于 11月 07, 2017

Clean up vas.h and the debug code around ifdef vas_debug.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

0a2c2c24

powerpc/vas: Validate window credits · 51b53712

由 Sukadev Bhattiprolu 提交于 11月 07, 2017

NX-842, the only user of VAS, sets the window credits to default values
but VAS should check the credits against the possible max values.

The VAS_WCREDS_MIN is not needed and can be dropped.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

51b53712

powerpc/vas: init missing fields from [rt]xattr · e34917fb

由 Sukadev Bhattiprolu 提交于 11月 07, 2017

Initialize a few missing window context fields from the window attributes
specified by the caller. These fields are currently set to their default
values by the caller (NX-842), but would be good to apply them anyway.
Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

e34917fb

10 11月, 2017 8 次提交

powerpc/64: Set DSCR default initially from SPR · 1696d0fb

由 Nicholas Piggin 提交于 10月 24, 2017

Take the DSCR value set by firmware as the dscr_default value,
rather than zero.

POWER9 recommends DSCR default to a non-zero value.
Signed-off-by: NFrom: Nicholas Piggin <npiggin@gmail.com>
[mpe: Make record_spr_defaults() __init]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

1696d0fb

powerpc/powernv: Avoid waiting for secondary hold spinloop with OPAL · 339a3293

由 Nicholas Piggin 提交于 10月 23, 2017

OPAL boot does not insert secondaries at 0x60 to wait at the secondary
hold spinloop. Instead they are started later, and inserted at
generic_secondary_smp_init(), which is after the secondary hold
spinloop.

Avoid waiting on this spinloop when booting with OPAL firmware. This
wait always times out that case.

This saves 100ms boot time on powernv, and 10s of seconds of real time
when booting on the simulator in SMP.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

339a3293

powerpc/64s/radix: Improve TLB flushing for page table freeing · 0b2f5a8a

由 Nicholas Piggin 提交于 11月 07, 2017

Unmaps that free page tables always flush the entire PID, which is
sub-optimal. Provide TLB range flushing with an additional PWC flush
that can be use for va range invalidations with PWC flush.

     Time to munmap N pages of memory including last level page table
     teardown (after mmap, touch), local invalidate:
     N           1       2      4      8     16     32     64
     vanilla  3.2us  3.3us  3.4us  3.6us  4.1us  5.2us  7.2us
     patched  1.4us  1.5us  1.7us  1.9us  2.6us  3.7us  6.2us

     Global invalidate:
     N           1       2      4      8     16      32     64
     vanilla  2.2us  2.3us  2.4us  2.6us  3.2us   4.1us  6.2us
     patched  2.1us  2.5us  3.4us  5.2us  8.7us  15.7us  6.2us

Local invalidates get much better across the board. Global ones have
the same issue where multiple tlbies for va flush do get slower than
the single tlbie to invalidate the PID. None of this test captures
the TLB benefits of avoiding killing everything.

Global gets worse, but it is brought in to line with global invalidate
for munmap()s that do not free page tables.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

0b2f5a8a

powerpc/64s/radix: Introduce local single page ceiling for TLB range flush · f6f27951

由 Nicholas Piggin 提交于 11月 07, 2017

The single page flush ceiling is the cut-off point at which we switch
from invalidating individual pages, to invalidating the entire process
address space in response to a range flush.

Introduce a local variant of this heuristic because local and global
tlbie have significantly different properties:
- Local tlbiel requires 128 instructions to invalidate a PID, global
  tlbie only 1 instruction.
- Global tlbie instructions are expensive broadcast operations.

The local ceiling has been made much higher, 2x the number of
instructions required to invalidate the entire PID (i.e., 256 pages).

     Time to mprotect N pages of memory (after mmap, touch), local invalidate:
     N           32     34      64     128     256     512
     vanilla  7.4us  9.0us  14.6us  26.4us  50.2us  98.3us
     patched  7.4us  7.8us  13.8us  26.4us  51.9us  98.3us

The behaviour of both is identical at N=32 and N=512. Between there,
the vanilla kernel does a PID invalidate and the patched kernel does
a va range invalidate.

At N=128, these require the same number of tlbiel instructions, so
the patched version can be sen to be cheaper when < 128, and more
expensive when > 128. However this does not well capture the cost
of invalidated TLB.

The additional cost at 256 pages does not seem prohibitive. It may
be the case that increasing the limit further would continue to be
beneficial to avoid invalidating all of the process's TLB entries.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f6f27951

powerpc/64s/radix: Optimize flush_tlb_range · cbf09c83

由 Nicholas Piggin 提交于 11月 07, 2017

Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.

So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.

There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.

Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us   1.8us
patched 1.2us   1.6us

Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us   7.2us
patched 6.9us   17.9us

Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us   8.0us
patched 9.0us   8.0us

34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

cbf09c83

powerpc/64s/radix: Implement _tlbie(l)_va_range flush functions · d665767e

由 Nicholas Piggin 提交于 11月 07, 2017

Move the barriers and range iteration down into the _tlbie* level,
which improves readability.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d665767e

powerpc/64s/radix: Optimize TLB range flush barriers · 14001c60

由 Nicholas Piggin 提交于 11月 07, 2017

Short range flushes issue a sequences of tlbie(l) instructions for
individual effective addresses. These do not all require individual
barrier sequences, only one covering all tlbie(l) instructions.

Commit f7327e0b ("powerpc/mm/radix: Remove unnecessary ptesync")
made a similar optimization for tlbiel for PID flushing.

For tlbie, the ISA says:

    The tlbsync instruction provides an ordering function for the
    effects of all tlbie instructions executed by the thread executing
    the tlbsync instruction, with respect to the memory barrier
    created by a subsequent ptesync instruction executed by the same
    thread.

Time to munmap 30 pages of memory (after mmap, touch):
         local   global
vanilla  10.9us  22.3us
patched   3.4us  14.4us
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

14001c60

Merge branch 'fixes' into next · a54c61f4

由 Michael Ellerman 提交于 11月 10, 2017

We have some dependencies & conflicts between patches in fixes and
things to go in next, both in the radix TLB flush code and the IMC PMU
driver. So merge fixes into next.

a54c61f4

09 11月, 2017 1 次提交

selftests/powerpc: Check FP/VEC on exception in TM · 77fad8bf

由 Gustavo Romero 提交于 11月 01, 2017

Add a self test to check if FP/VEC/VSX registers are sane (restored
correctly) after a FP/VEC/VSX unavailable exception is caught during a
transaction.

This test checks all possibilities in a thread regarding the combination
of MSR.[FP|VEC] states in a thread and for each scenario raises a
FP/VEC/VSX unavailable exception in transactional state, verifying if
vs0 and vs32 registers, which are representatives of FP/VEC/VSX reg
sets, are not corrupted.
Signed-off-by: NGustavo Romero <gromero@linux.vnet.ibm.com>
Signed-off-by: NBreno Leitao <leitao@debian.org>
Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

77fad8bf

08 11月, 2017 1 次提交

powerpc/xmon: Support dumping software pagetables · 80eff6c4

由 Balbir Singh 提交于 10月 30, 2017

It would be nice to be able to dump page tables in a particular
context.

eg: dumping vmalloc space:

  0:mon> dv 0xd00037fffff00000
  pgd  @ 0xc0000000017c0000
  pgdp @ 0xc0000000017c00d8 = 0x00000000f10b1000
  pudp @ 0xc0000000f10b13f8 = 0x00000000f10d0000
  pmdp @ 0xc0000000f10d1ff8 = 0x00000000f1102000
  ptep @ 0xc0000000f1102780 = 0xc0000000f1ba018e
  Maps physical address = 0x00000000f1ba0000
  Flags = Accessed Dirty Read Write

This patch does not replicate the complex code of dump_pagetable and
has no support for bolted linear mapping, thats why I've it's called
dump virtual page table support. The format of the PTE can be expanded
even further to add more useful information about the flags in the PTE
if required.
Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
[mpe: Bike shed the output format, show the pgdir, fix build failures]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

80eff6c4

07 11月, 2017 3 次提交

powerpc/mm/hash: Remove stale comment. · bf751e30

由 Michal Suchanek 提交于 7月 11, 2017

In commit e6f81a92 ("powerpc/mm/hash: Support 68 bit VA") the
masking is folded into ASM_VSID_SCRAMBLE but the comment about masking
is removed only from the firt use of ASM_VSID_SCRAMBLE.
Signed-off-by: NMichal Suchanek <msuchanek@suse.de>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

bf751e30

powerpc/powernv/ioda: Remove explicit max window size check · 9003a249

由 Alexey Kardashevskiy 提交于 11月 07, 2017

DMA windows can only have a size of power of two on IODA2 hardware and
using memory_hotplug_max() to determine the upper limit won't work
correcly if it returns not power of two value.

This removes the check as the platform code does this check in
pnv_pci_ioda2_setup_default_config() anyway; the other client is VFIO
and that thing checks against locked_vm limit which prevents the userspace
from locking too much memory.

It is expected to impact DPDK on machines with non-power-of-two RAM size,
mostly. KVM guests are less likely to be affected as usually guests get
less than half of hosts RAM.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9003a249

powerpc/powernv/cpufreq: Fix the frequency read by /proc/cpuinfo · cd77b5ce

由 Shriya 提交于 10月 13, 2017

The call to /proc/cpuinfo in turn calls cpufreq_quick_get() which
returns the last frequency requested by the kernel, but may not
reflect the actual frequency the processor is running at. This patch
makes a call to cpufreq_get() instead which returns the current
frequency reported by the hardware.

Fixes: fb5153d0 ("powerpc: powernv: Implement ppc_md.get_proc_freq()")
Signed-off-by: NShriya <shriyak@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

cd77b5ce

06 11月, 2017 14 次提交

powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 PMU workaround on DD2.1 · e3646330

由 Nicholas Piggin 提交于 11月 03, 2017

DD2.1 does not have to save MMCR0 for all state-loss idle states,
only after deep idle states (like other PMU registers).
Reviewed-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

e3646330

powerpc/64s/idle: avoid POWER9 DD1 and DD2.0 ERAT workaround on DD2.1 · 9d2f510a

由 Nicholas Piggin 提交于 11月 03, 2017

DD2.1 does not have to flush the ERAT after a state-loss idle.

Performance testing was done on a DD2.1 using only the stop0 idle state
(the shallowest state which supports state loss), using context_switch
selftest configured to ping-poing between two threads on the same core
and two different cores.

Performance improvement for same core is 7.0%, different cores is 14.8%.
Reviewed-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9d2f510a

powerpc: add POWER9_DD20 feature · b6b3755e

由 Nicholas Piggin 提交于 11月 03, 2017

Cc: Michael Neuling <mikey@neuling.org>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b6b3755e

powerpc: Remove facility loadups on transactional {fp, vec, vsx} unavailable · 6f700d38

由 Cyril Bur 提交于 11月 02, 2017

After handling a transactional FP, Altivec or VSX unavailable exception.
The return to userspace code will detect that the TIF_RESTORE_TM bit is
set and call restore_tm_state(). restore_tm_state() will call
restore_math() to ensure that the correct facilities are loaded.

This means that all the loadup code in {fp,altivec,vsx}_unavailable_tm()
is doing pointless work and can simply be removed.
Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6f700d38

powerpc: Always save/restore checkpointed regs during treclaim/trecheckpoint · eb5c3f1c

由 Cyril Bur 提交于 11月 02, 2017

Lazy save and restore of FP/Altivec means that a userspace process can
be sent to userspace with FP or Altivec disabled and loaded only as
required (by way of an FP/Altivec unavailable exception). Transactional
Memory complicates this situation as a transaction could be started
without FP/Altivec being loaded up. This causes the hardware to
checkpoint incorrect registers. Handling FP/Altivec unavailable
exceptions while a thread is transactional requires a reclaim and
recheckpoint to ensure the CPU has correct state for both sets of
registers.

tm_reclaim() has optimisations to not always save the FP/Altivec
registers to the checkpointed save area. This was originally done
because the caller might have information that the checkpointed
registers aren't valid due to lazy save and restore. We've also been a
little vague as to how tm_reclaim() leaves the FP/Altivec state since it
doesn't necessarily always save it to the thread struct. This has lead
to an (incorrect) assumption that it leaves the checkpointed state on
the CPU.

tm_recheckpoint() has similar optimisations in reverse. It may not
always reload the checkpointed FP/Altivec registers from the thread
struct before the trecheckpoint. It is therefore quite unclear where it
expects to get the state from. This didn't help with the assumption
made about tm_reclaim().

These optimisations sit in what is by definition a slow path. If a
process has to go through a reclaim/recheckpoint then its transaction
will be doomed on returning to userspace. This mean that the process
will be unable to complete its transaction and be forced to its failure
handler. This is already an out if line case for userspace. Furthermore,
the cost of copying 64 times 128 bits from registers isn't very long[0]
(at all) on modern processors. As such it appears these optimisations
have only served to increase code complexity and are unlikely to have
had a measurable performance impact.

Our transactional memory handling has been riddled with bugs. A cause
of this has been difficulty in following the code flow, code complexity
has not been our friend here. It makes sense to remove these
optimisations in favour of a (hopefully) more stable implementation.

This patch does mean that some times the assembly will needlessly save
'junk' registers which will subsequently get overwritten with the
correct value by the C code which calls the assembly function. This
small inefficiency is far outweighed by the reduction in complexity for
general TM code, context switching paths, and transactional facility
unavailable exception handler.

0: I tried to measure it once for other work and found that it was
hiding in the noise of everything else I was working with. I find it
exceedingly likely this will be the case here.
Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

eb5c3f1c

powerpc: Force reload for recheckpoint during tm {fp, vec, vsx} unavailable exception · 91381b9c

由 Cyril Bur 提交于 11月 02, 2017

Lazy save and restore of FP/Altivec means that a userspace process can
be sent to userspace with FP or Altivec disabled and loaded only as
required (by way of an FP/Altivec unavailable exception). Transactional
Memory complicates this situation as a transaction could be started
without FP/Altivec being loaded up. This causes the hardware to
checkpoint incorrect registers. Handling FP/Altivec unavailable
exceptions while a thread is transactional requires a reclaim and
recheckpoint to ensure the CPU has correct state for both sets of
registers.

tm_reclaim() has optimisations to not always save the FP/Altivec
registers to the checkpointed save area. This was originally done
because the caller might have information that the checkpointed
registers aren't valid due to lazy save and restore. We've also been a
little vague as to how tm_reclaim() leaves the FP/Altivec state since it
doesn't necessarily always save it to the thread struct. This has lead
to an (incorrect) assumption that it leaves the checkpointed state on
the CPU.

tm_recheckpoint() has similar optimisations in reverse. It may not
always reload the checkpointed FP/Altivec registers from the thread
struct before the trecheckpoint. It is therefore quite unclear where it
expects to get the state from. This didn't help with the assumption
made about tm_reclaim().

This patch is a minimal fix for ease of backporting. A more correct fix
which removes the msr parameter to tm_reclaim() and tm_recheckpoint()
altogether has been upstreamed to apply on top of this patch.

Fixes: dc310669 ("powerpc: tm: Always use fp_state and vr_state to
store live registers")
Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

91381b9c

powerpc: Don't enable FP/Altivec if not checkpointed · a7771176

由 Cyril Bur 提交于 11月 02, 2017

Lazy save and restore of FP/Altivec means that a userspace process can
be sent to userspace with FP or Altivec disabled and loaded only as
required (by way of an FP/Altivec unavailable exception). Transactional
Memory complicates this situation as a transaction could be started
without FP/Altivec being loaded up. This causes the hardware to
checkpoint incorrect registers. Handling FP/Altivec unavailable
exceptions while a thread is transactional requires a reclaim and
recheckpoint to ensure the CPU has correct state for both sets of
registers.

Lazy save and restore of FP/Altivec cannot be done if a process is
transactional. If a facility was enabled it must remain enabled whenever
a thread is transactional.

Commit dc16b553 ("powerpc: Always restore FPU/VEC/VSX if hardware
transactional memory in use") ensures that the facilities are always
enabled if a thread is transactional. A bug in the introduced code may
cause it to inadvertently enable a facility that was (and should remain)
disabled. The problem with this extraneous enablement is that the
registers for the erroneously enabled facility have not been correctly
recheckpointed - the recheckpointing code assumed the facility would
remain disabled.

Further compounding the issue, the transactional {fp,altivec,vsx}
unavailable code has been incorrectly using the MSR to enable
facilities. The presence of the {FP,VEC,VSX} bit in the regs->msr simply
means if the registers are live on the CPU, not if the kernel should
load them before returning to userspace. This has worked due to the bug
mentioned above.

This causes transactional threads which return to their failure handler
to observe incorrect checkpointed registers. Perhaps an example will
help illustrate the problem:

A userspace process is running and uses both FP and Altivec registers.
This process then continues to run for some time without touching
either sets of registers. The kernel subsequently disables the
facilities as part of lazy save and restore. The userspace process then
performs a tbegin and the CPU checkpoints 'junk' FP and Altivec
registers. The process then performs a floating point instruction
triggering a fp unavailable exception in the kernel.

The kernel then loads the FP registers - and only the FP registers.
Since the thread is transactional it must perform a reclaim and
recheckpoint to ensure both the checkpointed registers and the
transactional registers are correct. It then (correctly) enables
MSR[FP] for the process. Later (on exception exist) the kernel also
(inadvertently) enables MSR[VEC]. The process is then returned to
userspace.

Since the act of loading the FP registers doomed the transaction we know
CPU will fail the transaction, restore its checkpointed registers, and
return the process to its failure handler. The problem is that we're
now running with Altivec enabled and the 'junk' checkpointed registers
are restored. The kernel had only recheckpointed FP.

This patch solves this by only activating FP/Altivec if userspace was
using them when it entered the kernel and not simply if the process is
transactional.

Fixes: dc16b553 ("powerpc: Always restore FPU/VEC/VSX if hardware
transactional memory in use")
Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a7771176

mtd: powernv_flash: Use opal_async_wait_response_interruptible() · 6f469b67

由 Cyril Bur 提交于 11月 03, 2017

The OPAL calls performed in this driver shouldn't be using
opal_async_wait_response() as this performs a wait_event() which, on
long running OPAL calls could result in hung task warnings. wait_event()
prevents timely signal delivery which is also undesirable.

This patch also attempts to quieten down the use of dev_err() when
errors haven't actually occurred and also to return better information up
the stack rather than always -EIO.
Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
Acked-by: NBoris Brezillon <boris.brezillon@free-electrons.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6f469b67

powerpc/powernv: Add OPAL_BUSY to opal_error_code() · 77adbd22

由 Cyril Bur 提交于 11月 03, 2017

Also export opal_error_code() so that it can be used in modules
Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

77adbd22

powerpc/opal: Add opal_async_wait_response_interruptible() to opal-async · 9aab2449

由 Cyril Bur 提交于 11月 03, 2017

This patch adds an _interruptible version of opal_async_wait_response().
This is useful when a long running OPAL call is performed on behalf of
a userspace thread, for example, the opal_flash_{read,write,erase}
functions performed by the powernv-flash MTD driver.

It is foreseeable that these functions would take upwards of two
minutes causing the wait_event() to block long enough to cause hung
task warnings. Furthermore, wait_event_interruptible() is preferable
as otherwise there is no way for signals to stop the process which is
going to be confusing in userspace.
Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9aab2449

powernv/opal-sensor: remove not needed lock · 95e1bc1d

由 Stewart Smith 提交于 11月 03, 2017

Parallel sensor reads could run out of async tokens due to
opal_get_sensor_data grabbing tokens but then doing the sensor
read behind a mutex, essentially serializing the (possibly
asynchronous and relatively slow) sensor read.

It turns out that the mutex isn't needed at all, not only
should the OPAL interface allow concurrent reads, the implementation
is certainly safe for that, and if any sensor we were reading
from somewhere isn't, doing the mutual exclusion in the kernel
is the wrong place to do it, OPAL should be doing it for the kernel.

So, remove the mutex.

Additionally, we shouldn't be printing out an error when we don't
get a token as the only way this should happen is if we've been
interrupted in down_interruptible() on the semaphore.
Reported-by: NRobert Lippert <rlippert@google.com>
Signed-off-by: NStewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

95e1bc1d

powerpc/opal: Rework the opal-async interface · 86cd6d98

由 Cyril Bur 提交于 11月 03, 2017

Future work will add an opal_async_wait_response_interruptible()
which will call wait_event_interruptible(). This work requires extra
token state to be tracked as wait_event_interruptible() can return and
the caller could release the token before OPAL responds.

Currently token state is tracked with two bitfields which are 64 bits
big but may not need to be as OPAL informs Linux how many async tokens
there are. It also uses an array indexed by token to store response
messages for each token.

The bitfields make it difficult to add more state and also provide a
hard maximum as to how many tokens there can be - it is possible that
OPAL will inform Linux that there are more than 64 tokens.

Rather than add a bitfield to track the extra state, rework the
internals slightly.
Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
[mpe: Fix __opal_async_get_token() when no tokens are free]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

86cd6d98

powerpc/opal: Make __opal_async_{get, release}_token() static · 59cf9a1c

由 Cyril Bur 提交于 11月 03, 2017

There are no callers of both __opal_async_get_token() and
__opal_async_release_token().

This patch also removes the possibility of "emergency through
synchronous call to __opal_async_get_token()" as such it makes more
sense to initialise opal_sync_sem for the maximum number of async
tokens.
Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

59cf9a1c

mtd: powernv_flash: Don't return -ERESTARTSYS on interrupted token acquisition · efe69414

由 Cyril Bur 提交于 11月 03, 2017

Because the MTD core might split up a read() or write() from userspace
into several calls to the driver, we may fail to get a token but already
have done some work, best to return -EINTR back to userspace and have
them decide what to do.
Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
Acked-by: NBoris Brezillon <boris.brezillon@free-electrons.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

efe69414

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功