提交 · ec531d027ab29b0cfa1c80c8af561b0e74bd4283 · openeuler / Kernel

18 5月, 2018 13 次提交

KVM: PPC: Book3S PR: Enable use on POWER9 inside HPT-mode guests · ec531d02

由 Paul Mackerras 提交于 5月 18, 2018

This relaxes the restriction on using PR KVM on POWER9. The existing
code does work inside a guest partition running in HPT mode, because
hypercalls such as H_ENTER use the old HPTE format, not the new
format used by POWER9, and so no change to PR KVM's HPT manipulation
code is required. PR KVM will still refuse to run if the kernel is
using radix translation or if it is running bare-metal.
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

ec531d02

KVM: PPC: Book3S HV: Send kvmppc_bad_interrupt NMIs to Linux handlers · 7c1bd80c

由 Nicholas Piggin 提交于 5月 18, 2018

It's possible to take a SRESET or MCE in these paths due to a bug
in the host code or a NMI IPI, etc. A recent bug attempting to load
a virtual address from real mode gave th complete but cryptic error,
abridged:

      Oops: Bad interrupt in KVM entry/exit code, sig: 6 [#1]
      LE SMP NR_CPUS=2048 NUMA PowerNV
      CPU: 53 PID: 6582 Comm: qemu-system-ppc Not tainted
      NIP:  c0000000000155ac LR: c0000000000c2430 CTR: c000000000015580
      REGS: c000000fff76dd80 TRAP: 0200   Not tainted
      MSR:  9000000000201003 <SF,HV,ME,RI,LE>  CR: 48082222  XER: 00000000
      CFAR: 0000000102900ef0 DAR: d00017fffd941a28 DSISR: 00000040 SOFTE: 3
      NIP [c0000000000155ac] perf_trace_tlbie+0x2c/0x1a0
      LR [c0000000000c2430] do_tlbies+0x230/0x2f0

Sending the NMIs through the Linux handlers gives a nicer output:

      Severe Machine check interrupt [Not recovered]
        NIP [c0000000000155ac]: perf_trace_tlbie+0x2c/0x1a0
        Initiator: CPU
        Error type: Real address [Load (bad)]
          Effective address: d00017fffcc01a28
      opal: Machine check interrupt unrecoverable: MSR(RI=0)
      opal: Hardware platform error: Unrecoverable Machine Check exception
      CPU: 0 PID: 6700 Comm: qemu-system-ppc Tainted: G   M
      NIP:  c0000000000155ac LR: c0000000000c23c0 CTR: c000000000015580
      REGS: c000000fff9e9d80 TRAP: 0200   Tainted: G   M
      MSR:  9000000000201001 <SF,HV,ME,LE>  CR: 48082222  XER: 00000000
      CFAR: 000000010cbc1a30 DAR: d00017fffcc01a28 DSISR: 00000040 SOFTE: 3
      NIP [c0000000000155ac] perf_trace_tlbie+0x2c/0x1a0
      LR [c0000000000c23c0] do_tlbies+0x1c0/0x280
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

7c1bd80c

KVM: PPC: Book3S HV: Fix kvmppc_bad_host_intr for real mode interrupts · eadce3b4

由 Nicholas Piggin 提交于 5月 18, 2018

When CONFIG_RELOCATABLE=n, the Linux real mode interrupt handlers call
into KVM using real address. This needs to be translated to the kernel
linear effective address before the MMU is switched on.

kvmppc_bad_host_intr misses adding these bits, so when it is used to
handle a system reset interrupt (that always gets delivered in real
mode), it results in an instruction access fault immediately after
the MMU is turned on.

Fix this by ensuring the top 2 address bits are set when the MMU is
turned on.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

eadce3b4

KVM: PPC: Book3S HV: radix: Do not clear partition PTE when RC or write bits do not match · 878cf2bb

由 Nicholas Piggin 提交于 5月 17, 2018

Adding the write bit and RC bits to pte permissions does not require a
pte clear and flush. There should not be other bits changed here,
because restricting access or changing the PFN must have already
invalidated any existing ptes (otherwise the race is already lost).
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

878cf2bb

KVM: PPC: Book3S HV: radix: Refine IO region partition scope attributes · bc64dd0e

由 Nicholas Piggin 提交于 5月 17, 2018

When the radix fault handler has no page from the process address
space (e.g., for IO memory), it looks up the process pte and sets
partition table pte using that to get attributes like CI and guarded.
If the process table entry is to be writable, set _PAGE_DIRTY as well
to avoid an RC update. If not, then ensure _PAGE_DIRTY does not come
across. Set _PAGE_ACCESSED as well to avoid RC update.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

bc64dd0e

KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on · 9a4506e1

由 Nicholas Piggin 提交于 5月 17, 2018

The radix guest code can has fewer restrictions about what context it
can run in, so move this flushing out of assembly and have it use the
Linux TLB flush implementations introduced previously.

This allows powerpc:tlbie trace events to be used.

This changes the tlbiel sequence to only execute RIC=2 flush once on
the first set flushed, then RIC=0 for the rest of the sets. The end
result of the flush should be unchanged. This matches the local PID
flush pattern that was introduced in a5998fcb ("powerpc/mm/radix:
Optimise tlbiel flush all case").
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

9a4506e1

KVM: PPC: Book3S HV: Make radix use the Linux translation flush functions for partition scope · d91cb39f

由 Nicholas Piggin 提交于 5月 17, 2018

This has the advantage of consolidating TLB flush code in fewer
places, and it also implements powerpc:tlbie trace events.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

d91cb39f

KVM: PPC: Book3S HV: Recursively unmap all page table entries when unmapping · a5704e83

由 Nicholas Piggin 提交于 5月 17, 2018

When partition scope mappings are unmapped with kvm_unmap_radix, the
pte is cleared, but the page table structure is left in place. If the
next page fault requests a different page table geometry (e.g., due to
THP promotion or split), kvmppc_create_pte is responsible for changing
the page tables.

When a page table entry is to be converted to a large pte, the page
table entry is cleared, the PWC flushed, then the page table it points
to freed. This will cause pte page tables to leak when a 1GB page is
to replace a pud entry points to a pmd table with pte tables under it:
The pmd table will be freed, but its pte tables will be missed.

Fix this by replacing the simple clear and free code with one that
walks down the page tables and frees children. Care must be taken to
clear the root entry being unmapped then flushing the PWC before
freeing any page tables, as explained in comments.

This requires PWC flush to logically become a flush-all-PWC (which it
already is in hardware, but the KVM API needs to be changed to avoid
confusion).

This code also checks that no unexpected pte entries exist in any page
table being freed, and unmaps those and emits a WARN. This is an
expensive operation for the pte page level, but partition scope
changes are rare, so it's unconditional for now to iron out bugs. It
can be put under a CONFIG option or removed after some time.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

a5704e83

N
KVM: PPC: Book3S HV: Use a helper to unmap ptes in the radix fault path · a5fad1e9
由 Nicholas Piggin 提交于 5月 17, 2018
```
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
```
a5fad1e9

KVM: PPC: Book3S HV: Lockless tlbie for HPT hcalls · b7557451

由 Nicholas Piggin 提交于 5月 17, 2018

tlbies to an LPAR do not have to be serialised since POWER4/PPC970,
after which the MMU_FTR_LOCKLESS_TLBIE feature was introduced to
avoid tlbie locking.

Since commit c17b98cf ("KVM: PPC: Book3S HV: Remove code for
PPC970 processors"), KVM no longer supports processors that do not
have this feature, so the tlbie locking can be removed completely.
A sanity check for the feature is put in kvmppc_mmu_hv_init.

Testing was done on a POWER9 system in HPT mode, with a -smp 32 guest
in HPT mode. 32 instances of the powerpc fork benchmark from selftests
were run with --fork, and the results measured.

Without this patch, total throughput was about 13.5K/sec, and this is
the top of the host profile:

   74.52%  [k] do_tlbies
    2.95%  [k] kvmppc_book3s_hv_page_fault
    1.80%  [k] calc_checksum
    1.80%  [k] kvmppc_vcpu_run_hv
    1.49%  [k] kvmppc_run_core

After this patch, throughput was about 51K/sec, with this profile:

   21.28%  [k] do_tlbies
    5.26%  [k] kvmppc_run_core
    4.88%  [k] kvmppc_book3s_hv_page_fault
    3.30%  [k] _raw_spin_lock_irqsave
    3.25%  [k] gup_pgd_range
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

b7557451

KVM: PPC: Fix a mmio_host_swabbed uninitialized usage issue · f19d1f36

由 Simon Guo 提交于 5月 07, 2018

When KVM emulates VMX store, it will invoke kvmppc_get_vmx_data() to
retrieve VMX reg val. kvmppc_get_vmx_data() will check mmio_host_swabbed
to decide which double word of vr[] to be used. But the
mmio_host_swabbed can be uninitialized during VMX store procedure:

kvmppc_emulate_loadstore
	\- kvmppc_handle_store128_by2x64
		\- kvmppc_get_vmx_data

So vcpu->arch.mmio_host_swabbed is not meant to be used at all for
emulation of store instructions, and this patch makes that true for
VMX stores. This patch also initializes mmio_host_swabbed to avoid
possible future problems.
Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

f19d1f36

KVM: PPC: Move nip/ctr/lr/xer registers to pt_regs in kvm_vcpu_arch · 173c520a

由 Simon Guo 提交于 5月 07, 2018

This patch moves nip/ctr/lr/xer registers from scattered places in
kvm_vcpu_arch to pt_regs structure.

cr register is "unsigned long" in pt_regs and u32 in vcpu->arch.
It will need more consideration and may move in later patches.
Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

173c520a

KVM: PPC: Add pt_regs into kvm_vcpu_arch and move vcpu->arch.gpr[] into it · 1143a706

由 Simon Guo 提交于 5月 07, 2018

Current regs are scattered at kvm_vcpu_arch structure and it will
be more neat to organize them into pt_regs structure.

Also it will enable reimplementation of MMIO emulation code with
analyse_instr() later.
Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

1143a706

17 5月, 2018 14 次提交

KVM: PPC: Book3S: Change return type to vm_fault_t · 16d5c39d

由 Souptick Joarder 提交于 5月 10, 2018

Use new return type vm_fault_t for fault handler
in struct vm_operations_struct. For now, this is
just documenting that the function returns a
VM_FAULT value rather than an errno.  Once all
instances are converted, vm_fault_t will become
a distinct type.

commit 1c8f4220 ("mm: change return type to
vm_fault_t")
Signed-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

16d5c39d

KVM: PPC: Book3S: Check KVM_CREATE_SPAPR_TCE_64 parameters · e45719af

由 Alexey Kardashevskiy 提交于 5月 14, 2018

Although it does not seem possible to break the host by passing bad
parameters when creating a TCE table in KVM, it is still better to get
an early clear indication of that than debugging weird effect this might
bring.

This adds some sanity checks that the page size is 4KB..16GB as this is
what the actual LoPAPR supports and that the window actually fits 64bit
space.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: NBalbir Singh <bsingharora@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

e45719af

KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller physical pages · ca1fc489

由 Alexey Kardashevskiy 提交于 5月 14, 2018

At the moment we only support in the host the IOMMU page sizes which
the guest is aware of, which is 4KB/64KB/16MB. However P9 does not support
16MB IOMMU pages, 2MB and 1GB pages are supported instead. We can still
emulate bigger guest pages (for example 16MB) with smaller host pages
(4KB/64KB/2MB).

This allows the physical IOMMU pages to use a page size smaller or equal
than the guest visible IOMMU page size.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

ca1fc489

KVM: PPC: Book3S: Use correct page shift in H_STUFF_TCE · c6b61661

由 Alexey Kardashevskiy 提交于 5月 14, 2018

The other TCE handlers use page shift from the guest visible TCE table
(described by kvmppc_spapr_tce_iommu_table) so let's make H_STUFF_TCE
handlers do the same thing.

This should cause no behavioral change now but soon we will allow
the iommu_table::it_page_shift being different from from the emulated
table page size so this will play a role.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Acked-by: NBalbir Singh <bsingharora@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

c6b61661

KVM: PPC: Book3S HV: Fix inaccurate comment · 48e70b1c

由 Paul Mackerras 提交于 4月 19, 2018

We now have interrupts hard-disabled when coming back from
kvmppc_hv_entry_trampoline, so this changes the comment to reflect
that.
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

48e70b1c

KVM: PPC: Book3S HV: Set RWMR on POWER8 so PURR/SPURR count correctly · 7aa15842

由 Paul Mackerras 提交于 4月 20, 2018

Although Linux doesn't use PURR and SPURR ((Scaled) Processor
Utilization of Resources Register), other OSes depend on them.
On POWER8 they count at a rate depending on whether the VCPU is
idle or running, the activity of the VCPU, and the value in the
RWMR (Region-Weighting Mode Register).  Hardware expects the
hypervisor to update the RWMR when a core is dispatched to reflect
the number of online VCPUs in the vcore.

This adds code to maintain a count in the vcore struct indicating
how many VCPUs are online.  In kvmppc_run_core we use that count
to set the RWMR register on POWER8.  If the core is split because
of a static or dynamic micro-threading mode, we use the value for
8 threads.  The RWMR value is not relevant when the host is
executing because Linux does not use the PURR or SPURR register,
so we don't bother saving and restoring the host value.

For the sake of old userspace which does not set the KVM_REG_PPC_ONLINE
register, we set online to 1 if it was 0 at the time of a KVM_RUN
ioctl.
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

7aa15842

KVM: PPC: Book3S HV: Add 'online' register to ONE_REG interface · a1f15826

由 Paul Mackerras 提交于 4月 20, 2018

This adds a new KVM_REG_PPC_ONLINE register which userspace can set
to 0 or 1 via the GET/SET_ONE_REG interface to indicate whether it
considers the VCPU to be offline (0), that is, not currently running,
or online (1). This will be used in a later patch to configure the
register which controls PURR and SPURR accumulation.
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

a1f15826

KVM: PPC: Book 3S HV: Do ptesync in radix guest exit path · df158189

由 Paul Mackerras 提交于 5月 17, 2018

A radix guest can execute tlbie instructions to invalidate TLB entries.
After a tlbie or a group of tlbies, it must then do the architected
sequence eieio; tlbsync; ptesync to ensure that the TLB invalidation
has been processed by all CPUs in the system before it can rely on
no CPU using any translation that it just invalidated.

In fact it is the ptesync which does the actual synchronization in
this sequence, and hardware has a requirement that the ptesync must
be executed on the same CPU thread as the tlbies which it is expected
to order. Thus, if a vCPU gets moved from one physical CPU to
another after it has done some tlbies but before it can get to do the
ptesync, the ptesync will not have the desired effect when it is
executed on the second physical CPU.

To fix this, we do a ptesync in the exit path for radix guests. If
there are any pending tlbies, this will wait for them to complete.
If there aren't, then ptesync will just do the same as sync.
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

df158189

KVM: PPC: Book3S HV: XIVE: Resend re-routed interrupts on CPU priority change · 9dc81d6b

由 Benjamin Herrenschmidt 提交于 5月 10, 2018

When a vcpu priority (CPPR) is set to a lower value (masking more
interrupts), we stop processing interrupts already in the queue
for the priorities that have now been masked.

If those interrupts were previously re-routed to a different
CPU, they might still be stuck until the older one that has
them in its queue processes them. In the case of guest CPU
unplug, that can be never.

To address that without creating additional overhead for
the normal interrupt processing path, this changes H_CPPR
handling so that when such a priority change occurs, we
scan the interrupt queue for that vCPU, and for any
interrupt in there that has been re-routed, we replace it
with a dummy and force a re-trigger.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

9dc81d6b

KVM: PPC: Book3S HV: Make radix clear pte when unmapping · 7e3d9a1d

由 Nicholas Piggin 提交于 5月 09, 2018

The current partition table unmap code clears the _PAGE_PRESENT bit
out of the pte, which leaves pud_huge/pmd_huge true and does not
clear pud_present/pmd_present.  This can confuse subsequent page
faults and possibly lead to the guest looping doing continual
hypervisor page faults.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

7e3d9a1d

KVM: PPC: Book3S HV: Make radix use correct tlbie sequence in kvmppc_radix_tlbie_page · e2560b10

由 Nicholas Piggin 提交于 5月 09, 2018

The standard eieio ; tlbsync ; ptesync must follow tlbie to ensure it
is ordered with respect to subsequent operations.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

e2560b10

KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry · 57b8daa7

由 Paul Mackerras 提交于 4月 20, 2018

Currently, the HV KVM guest entry/exit code adds the timebase offset
from the vcore struct to the timebase on guest entry, and subtracts
it on guest exit. Which is fine, except that it is possible for
userspace to change the offset using the SET_ONE_REG interface while
the vcore is running, as there is only one timebase offset per vcore
but potentially multiple VCPUs in the vcore. If that were to happen,
KVM would subtract a different offset on guest exit from that which
it had added on guest entry, leading to the timebase being out of sync
between cores in the host, which then leads to bad things happening
such as hangs and spurious watchdog timeouts.

To fix this, we add a new field 'tb_offset_applied' to the vcore struct
which stores the offset that is currently applied to the timebase.
This value is set from the vcore tb_offset field on guest entry, and
is what is subtracted from the timebase on guest exit. Since it is
zero when the timebase offset is not applied, we can simplify the
logic in kvmhv_start_timing and kvmhv_accumulate_time.

In addition, we had secondary threads reading the timebase while
running concurrently with code on the primary thread which would
eventually add or subtract the timebase offset from the timebase.
This occurred while saving or restoring the DEC register value on
the secondary threads. Although no specific incorrect behaviour has
been observed, this is a race which should be fixed. To fix it, we
move the DEC saving code to just before we call kvmhv_commence_exit,
and the DEC restoring code to after the point where we have waited
for the primary thread to switch the MMU context and add the timebase
offset. That way we are sure that the timebase contains the guest
timebase value in both cases.
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

57b8daa7

powerpc/kvm: Prefer fault_in_pages_readable function · 9f9eae5c

由 Mathieu Malaterre 提交于 3月 28, 2018

Directly use fault_in_pages_readable instead of manual __get_user code. Fix
warning treated as error with W=1:

arch/powerpc/kernel/kvm.c:675:6: error: variable ‘tmp’ set but not used [-Werror=unused-but-set-variable]
Suggested-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMathieu Malaterre <malat@debian.org>
Reviewed-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9f9eae5c

powerpc/mm/radix: implement LPID based TLB flushes to be used by KVM · 0078778a

由 Nicholas Piggin 提交于 5月 09, 2018

Implement a local TLB flush for invalidating an LPID with variants for
process or partition scope. And a global TLB flush for invalidating
a partition scoped page of an LPID.

These will be used by KVM in subsequent patches.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

0078778a

15 5月, 2018 1 次提交

powerpc/kvm: Switch kvm pmd allocator to custom allocator · 21828c99

由 Aneesh Kumar K.V 提交于 4月 16, 2018

In the next set of patches, we will switch pmd allocator to use page fragments
and the locking will be updated to split pmd ptlock. We want to avoid using
fragments for partition-scoped table. Use slab cache similar to level 4 table
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

21828c99

27 4月, 2018 2 次提交

powerpc/kvm/booke: Fix altivec related build break · b2d7ecbe

由 Laurentiu Tudor 提交于 4月 26, 2018

Add missing "altivec unavailable" interrupt injection helper
thus fixing the linker error below:

arch/powerpc/kvm/emulate_loadstore.o: In function `kvmppc_check_altivec_disabled':
arch/powerpc/kvm/emulate_loadstore.c: undefined reference to `.kvmppc_core_queue_vec_unavail'

Fixes: 09f98496 ("KVM: PPC: Book3S: Add MMIO emulation for VMX instructions")
Signed-off-by: NLaurentiu Tudor <laurentiu.tudor@nxp.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b2d7ecbe

powerpc: Fix deadlock with multiple calls to smp_send_stop · 6029755e

由 Nicholas Piggin 提交于 4月 27, 2018

smp_send_stop can lock up the IPI path for any subsequent calls,
because the receiving CPUs spin in their handler function. This
started becoming a problem with the addition of an smp_send_stop
call in the reboot path, because panics can reboot after doing
their own smp_send_stop.

The NMI IPI variant was fixed with ac61c115 ("powerpc: Fix
smp_send_stop NMI IPI handling"), which leaves the smp_call_function
variant.

This is fixed by having smp_send_stop only ever do the
smp_call_function once. This is a bit less robust than the NMI IPI
fix, because any other call to smp_call_function after smp_send_stop
could deadlock, but that has always been the case, and it was not
been a problem before.

Fixes: f2748bdf ("powerpc/powernv: Always stop secondaries before reboot/shutdown")
Reported-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6029755e

25 4月, 2018 2 次提交

powerpc: Fix smp_send_stop NMI IPI handling · ac61c115

由 Nicholas Piggin 提交于 4月 25, 2018

The NMI IPI handler for a receiving CPU increments nmi_ipi_busy_count
over the handler function call, which causes later smp_send_nmi_ipi()
callers to spin until the call is finished.

The stop_this_cpu() function never returns, so the busy count is never
decremeted, which can cause the system to hang in some cases. For
example panic() will call smp_send_stop() early on which calls
stop_this_cpu() on other CPUs, then later in the reboot path,
pnv_restart() will call smp_send_stop() again, which hangs.

Fix this by adding a special case to the stop_this_cpu() handler to
decrement the busy count, because it will never return.

Now that the NMI/non-NMI versions of stop_this_cpu() are different,
split them out into separate functions rather than doing #ifdef tricks
to share the body between the two functions.

Fixes: 6bed3237 ("powerpc: use NMI IPI for smp_send_stop")
Reported-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Split out the functions, tweak change log a bit]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

ac61c115

rtc: opal: Fix OPAL RTC driver OPAL_BUSY loops · 682e6b4d

由 Nicholas Piggin 提交于 4月 10, 2018

The OPAL RTC driver does not sleep in case it gets OPAL_BUSY or
OPAL_BUSY_EVENT from firmware, which causes large scheduling
latencies, up to 50 seconds have been observed here when RTC stops
responding (BMC reboot can do it).

Fix this by converting it to the standard form OPAL_BUSY loop that
sleeps.

Fixes: 628daa8d ("powerpc/powernv: Add RTC and NVRAM support plus RTAS fallbacks")
Cc: stable@vger.kernel.org # v3.2+
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Acked-by: NAlexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

682e6b4d

24 4月, 2018 6 次提交

powerpc/mce: Fix a bug where mce loops on memory UE. · 75ecfb49

由 Mahesh Salgaonkar 提交于 4月 23, 2018

The current code extracts the physical address for UE errors and then
hooks it up into memory failure infrastructure. On successful
extraction of physical address it wrongly sets "handled = 1" which
means this UE error has been recovered. Since MCE handler gets return
value as handled = 1, it assumes that error has been recovered and
goes back to same NIP. This causes MCE interrupt again and again in a
loop leading to hard lockup.

Also, initialize phys_addr to ULONG_MAX so that we don't end up
queuing undesired page to hwpoison.

Without this patch we see:
  Severe Machine check interrupt [Recovered]
    NIP: [000000001002588c] PID: 7109 Comm: find
    Initiator: CPU
    Error type: UE [Load/Store]
      Effective address: 00007fffd2755940
      Physical address:  000020181a080000
  ...
  Severe Machine check interrupt [Recovered]
    NIP: [000000001002588c] PID: 7109 Comm: find
    Initiator: CPU
    Error type: UE [Load/Store]
      Effective address: 00007fffd2755940
      Physical address:  000020181a080000
  Severe Machine check interrupt [Recovered]
    NIP: [000000001002588c] PID: 7109 Comm: find
    Initiator: CPU
    Error type: UE [Load/Store]
      Effective address: 00007fffd2755940
      Physical address:  000020181a080000
  Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
  Memory failure: 0x20181a08: already hardware poisoned
  Memory failure: 0x20181a08: already hardware poisoned
  Memory failure: 0x20181a08: already hardware poisoned
  Memory failure: 0x20181a08: already hardware poisoned
  Memory failure: 0x20181a08: already hardware poisoned
  Memory failure: 0x20181a08: already hardware poisoned
  ...
  Watchdog CPU:38 Hard LOCKUP

After this patch we see:

  Severe Machine check interrupt [Not recovered]
    NIP: [00007fffaae585f4] PID: 7168 Comm: find
    Initiator: CPU
    Error type: UE [Load/Store]
      Effective address: 00007fffaafe28ac
      Physical address:  00002017c0bd0000
  find[7168]: unhandled signal 7 at 00007fffaae585f4 nip 00007fffaae585f4 lr 00007fffaae585e0 code 4
  Memory failure: 0x2017c0bd: recovery action for dirty LRU page: Recovered

Fixes: 01eaac2b ("powerpc/mce: Hookup ierror (instruction) UE errors")
Fixes: ba41e1e1 ("powerpc/mce: Hookup derror (load/store) UE errors")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

75ecfb49

powerpc/powernv/npu: Do a PID GPU TLB flush when invalidating a large address range · d0cf9b56

由 Alistair Popple 提交于 4月 17, 2018

The NPU has a limited number of address translation shootdown (ATSD)
registers and the GPU has limited bandwidth to process ATSDs. This can
result in contention of ATSD registers leading to soft lockups on some
threads, particularly when invalidating a large address range in
pnv_npu2_mn_invalidate_range().

At some threshold it becomes more efficient to flush the entire GPU
TLB for the given MM context (PID) than individually flushing each
address in the range. This patch will result in ranges greater than
2MB being converted from 32+ ATSDs into a single ATSD which will flush
the TLB for the given PID on each GPU.

Fixes: 1ab66d1f ("powerpc/powernv: Introduce address translation services for Nvlink2")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: NAlistair Popple <alistair@popple.id.au>
Acked-by: NBalbir Singh <bsingharora@gmail.com>
Tested-by: NBalbir Singh <bsingharora@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d0cf9b56

powerpc/powernv/npu: Prevent overwriting of pnv_npu2_init_contex() callback parameters · a1409ada

由 Alistair Popple 提交于 4月 11, 2018

There is a single npu context per set of callback parameters. Callers
should be prevented from overwriting existing callback values so
instead return an error if different parameters are passed.

Fixes: 1ab66d1f ("powerpc/powernv: Introduce address translation services for Nvlink2")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: NAlistair Popple <alistair@popple.id.au>
Reviewed-by: NMark Hairgrove <mhairgrove@nvidia.com>
Tested-by: NMark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a1409ada

powerpc/powernv/npu: Add lock to prevent race in concurrent context init/destroy · 28a5933e

由 Alistair Popple 提交于 4月 11, 2018

The pnv_npu2_init_context() and pnv_npu2_destroy_context() functions
are used to allocate/free contexts to allow address translation and
shootdown by the NPU on a particular GPU. Context initialisation is
implicitly safe as it is protected by the requirement mmap_sem be held
in write mode, however pnv_npu2_destroy_context() does not require
mmap_sem to be held and it is not safe to call with a concurrent
initialisation for a different GPU.

It was assumed the driver would ensure destruction was not called
concurrently with initialisation. However the driver may be simplified
by allowing concurrent initialisation and destruction for different
GPUs. As npu context creation/destruction is not a performance
critical path and the critical section is not large a single spinlock
is used for simplicity.

Fixes: 1ab66d1f ("powerpc/powernv: Introduce address translation services for Nvlink2")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: NAlistair Popple <alistair@popple.id.au>
Reviewed-by: NMark Hairgrove <mhairgrove@nvidia.com>
Tested-by: NMark Hairgrove <mhairgrove@nvidia.com>
Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

28a5933e

powerpc/powernv/memtrace: Let the arch hotunplug code flush cache · 7fd6641d

由 Balbir Singh 提交于 4月 06, 2018

Don't do this via custom code, instead now that we have support in the
arch hotplug/hotunplug code, rely on those routines to do the right
thing.

The existing flush doesn't work because it uses ppc64_caches.l1d.size
instead of ppc64_caches.l1d.line_size.

Fixes: 9d5171a8 ("powerpc/powernv: Enable removal of memory for in memory tracing")
Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

7fd6641d

powerpc/mm: Flush cache on memory hot(un)plug · fb5924fd

由 Balbir Singh 提交于 4月 06, 2018

This patch adds support for flushing potentially dirty cache lines
when memory is hot-plugged/hot-un-plugged. The support is currently
limited to 64 bit systems.

The bug was exposed when mappings for a device were actually
hot-unplugged and plugged in back later. A similar issue was observed
during the development of memtrace, but memtrace does it's own
flushing of region via a custom routine.

These patches do a flush both on hotplug/unplug to clear any stale
data in the cache w.r.t mappings, there is a small race window where a
clean cache line may be created again just prior to tearing down the
mapping.

The patches were tested by disabling the flush routines in memtrace
and doing I/O on the trace file. The system immediately
checkstops (quite reliablly if prior to the hot-unplug of the memtrace
region, we memset the regions we are about to hot unplug). After these
patches no custom flushing is needed in the memtrace code.

Fixes: 9d5171a8 ("powerpc/powernv: Enable removal of memory for in memory tracing")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
Acked-by: NReza Arbab <arbab@linux.ibm.com>
Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

fb5924fd

21 4月, 2018 1 次提交

proc: fix /proc/loadavg regression · 9a1015b3

由 Alexey Dobriyan 提交于 4月 20, 2018

Commit 95846ecf ("pid: replace pid bitmap implementation with IDR
API") changed last field of /proc/loadavg (last pid allocated) to be off
by one:

	# unshare -p -f --mount-proc cat /proc/loadavg
	0.00 0.00 0.00 1/60 2	<===

It should be 1 after first fork into pid namespace.

This is formally a regression but given how useless this field is I
don't think anyone is affected.

Bug was found by /proc testsuite!

Link: http://lkml.kernel.org/r/20180413175408.GA27246@avx2
Fixes: 95846ecf ("pid: replace pid bitmap implementation with IDR API")
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gargi Sharma <gs051095@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9a1015b3

19 4月, 2018 1 次提交

powerpc/kvm: Fix lockups when running KVM guests on Power8 · 56376c58

由 Michael Ellerman 提交于 4月 19, 2018

When running KVM guests on Power8 we can see a lockup where one CPU
stops responding. This often leads to a message such as:

  watchdog: CPU 136 detected hard LOCKUP on other CPUs 72
  Task dump for CPU 72:
  qemu-system-ppc R  running task    10560 20917  20908 0x00040004

And then backtraces on other CPUs, such as:

  Task dump for CPU 48:
  ksmd            R  running task    10032  1519      2 0x00000804
  Call Trace:
    ...
    --- interrupt: 901 at smp_call_function_many+0x3c8/0x460
        LR = smp_call_function_many+0x37c/0x460
    pmdp_invalidate+0x100/0x1b0
    __split_huge_pmd+0x52c/0xdb0
    try_to_unmap_one+0x764/0x8b0
    rmap_walk_anon+0x15c/0x370
    try_to_unmap+0xb4/0x170
    split_huge_page_to_list+0x148/0xa30
    try_to_merge_one_page+0xc8/0x990
    try_to_merge_with_ksm_page+0x74/0xf0
    ksm_scan_thread+0x10ec/0x1ac0
    kthread+0x160/0x1a0
    ret_from_kernel_thread+0x5c/0x78

This is caused by commit 8c1c7fb0 ("powerpc/64s/idle: avoid sync
for KVM state when waking from idle"), which added a check in
pnv_powersave_wakeup() to see if the kvm_hstate.hwthread_state is
already set to KVM_HWTHREAD_IN_KERNEL, and if so to skip the store and
test of kvm_hstate.hwthread_req.

The problem is that the primary does not set KVM_HWTHREAD_IN_KVM when
entering the guest, so it can then come out to cede with
KVM_HWTHREAD_IN_KERNEL set. It can then go idle in kvm_do_nap after
setting hwthread_req to 1, but because hwthread_state is still
KVM_HWTHREAD_IN_KERNEL we will skip the test of hwthread_req when we
wake up from idle and won't go to kvm_start_guest. From there the
thread will return somewhere garbage and crash.

Fix it by skipping the store of hwthread_state, but not the test of
hwthread_req, when coming out of idle. It's OK to skip the sync in
that case because hwthread_req will have been set on the same thread,
so there is no synchronisation required.

Fixes: 8c1c7fb0 ("powerpc/64s/idle: avoid sync for KVM state when waking from idle")
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

56376c58

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功