提交 · ebaac1736245e78109cd47d453a86a18dcfc94b8 · openeuler / Kernel

24 5月, 2016 1 次提交

vdso: make arch_setup_additional_pages wait for mmap_sem for write killable · 69048176

由 Michal Hocko 提交于 5月 23, 2016

most architectures are relying on mmap_sem for write in their
arch_setup_additional_pages.  If the waiting task gets killed by the oom
killer it would block oom_reaper from asynchronous address space reclaim
and reduce the chances of timely OOM resolving.  Wait for the lock in
the killable mode and return with EINTR if the task got killed while
waiting.
Signed-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>	[x86 vdso]
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

69048176

21 5月, 2016 2 次提交

printk/nmi: generic solution for safe printk in NMI · 42a0bb3f

由 Petr Mladek 提交于 5月 20, 2016

printk() takes some locks and could not be used a safe way in NMI
context.

The chance of a deadlock is real especially when printing stacks from
all CPUs.  This particular problem has been addressed on x86 by the
commit a9edc880 ("x86/nmi: Perform a safe NMI stack trace on all
CPUs").

The patchset brings two big advantages.  First, it makes the NMI
backtraces safe on all architectures for free.  Second, it makes all NMI
messages almost safe on all architectures (the temporary buffer is
limited.  We still should keep the number of messages in NMI context at
minimum).

Note that there already are several messages printed in NMI context:
WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
handlers.  These are not easy to avoid.

This patch reuses most of the code and makes it generic.  It is useful
for all messages and architectures that support NMI.

The alternative printk_func is set when entering and is reseted when
leaving NMI context.  It queues IRQ work to copy the messages into the
main ring buffer in a safe context.

__printk_nmi_flush() copies all available messages and reset the buffer.
Then we could use a simple cmpxchg operations to get synchronized with
writers.  There is also used a spinlock to get synchronized with other
flushers.

We do not longer use seq_buf because it depends on external lock.  It
would be hard to make all supported operations safe for a lockless use.
It would be confusing and error prone to make only some operations safe.

The code is put into separate printk/nmi.c as suggested by Steven
Rostedt.  It needs a per-CPU buffer and is compiled only on
architectures that call nmi_enter().  This is achieved by the new
HAVE_NMI Kconfig flag.

The are MN10300 and Xtensa architectures.  We need to clean up NMI
handling there first.  Let's do it separately.

The patch is heavily based on the draft from Peter Zijlstra, see

  https://lkml.org/lkml/2015/6/10/327

[arnd@arndb.de: printk-nmi: use %zu format string for size_t]
[akpm@linux-foundation.org: min_t->min - all types are size_t here]
Signed-off-by: NPetr Mladek <pmladek@suse.com>
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>	[arm part]
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jiri Kosina <jkosina@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

42a0bb3f

exit_thread: remove empty bodies · 5f56a5df

由 Jiri Slaby 提交于 5月 20, 2016

Define HAVE_EXIT_THREAD for archs which want to do something in
exit_thread. For others, let's define exit_thread as an empty inline.

This is a cleanup before we change the prototype of exit_thread to
accept a task parameter.

[akpm@linux-foundation.org: fix mips]
Signed-off-by: NJiri Slaby <jslaby@suse.cz>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Howells <dhowells@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5f56a5df

20 5月, 2016 2 次提交

arch: fix has_transparent_hugepage() · fd8cfd30

由 Hugh Dickins 提交于 5月 19, 2016

I've just discovered that the useful-sounding has_transparent_hugepage()
is actually an architecture-dependent minefield: on some arches it only
builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
not, but on some of those (arm and arm64) it then gives the wrong
answer; and on mips alone it's marked __init, which would crash if
called later (but so far it has not been called later).

Straighten this out: make it available to all configs, with a sensible
default in asm-generic/pgtable.h, removing its definitions from those
arches (arc, arm, arm64, sparc, tile) which are served by the default,
adding #define has_transparent_hugepage has_transparent_hugepage to
those (mips, powerpc, s390, x86) which need to override the default at
runtime, and removing the __init from mips (but maybe that kind of code
should be avoided after init: set a static variable the first time it's
called).
Signed-off-by: NHugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: Vineet Gupta <vgupta@synopsys.com>		[arch/arc]
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[arch/s390]
Acked-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fd8cfd30

powerpc: mm: use hugetlb_bad_size() · 71bf79cc

由 Vaishali Thakkar 提交于 5月 19, 2016

Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
hugepage size is found.
Signed-off-by: NVaishali Thakkar <vaishali.thakkar@oracle.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

71bf79cc

19 5月, 2016 1 次提交

dax: enable dax in the presence of known media errors (badblocks) · 0a70bd43

由 Dan Williams 提交于 2月 24, 2016

1/ If a mapping overlaps a bad sector fail the request.

2/ Do not opportunistically report more dax-capable capacity than is
   requested when errors present.
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
[vishal: fix a conflict with system RAM collision patches]
[vishal: add a 'size' parameter to ->direct_access]
[vishal: fix a conflict with DAX alignment check patches]
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>

0a70bd43

17 5月, 2016 9 次提交

perf core: Add perf_callchain_store_context() helper · 3e4de4ec

由 Arnaldo Carvalho de Melo 提交于 5月 12, 2016

We need have different helpers to account how many contexts we have in
the sample and for real addresses, so do it now as a prep patch, to
ease review.

Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-q964tnyuqrxw5gld18vizs3c@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

3e4de4ec

perf core: Add a 'nr' field to perf_event_callchain_context · 3b1fff08

由 Arnaldo Carvalho de Melo 提交于 5月 10, 2016

We will use it to count how many addresses are in the entry->ip[] array,
excluding PERF_CONTEXT_{KERNEL,USER,etc} entries, so that we can really
return the number of entries specified by the user via the relevant
sysctl, kernel.perf_event_max_contexts, or via the per event
perf_event_attr.sample_max_stack knob.

This way we keep the perf_sample->ip_callchain->nr meaning, that is the
number of entries, be it real addresses or PERF_CONTEXT_ entries, while
honouring the max_stack knobs, i.e. the end result will be max_stack
entries if we have at least that many entries in a given stack trace.

Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-s8teto51tdqvlfhefndtat9r@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

3b1fff08

perf core: Pass max stack as a perf_callchain_entry context · cfbcf468

由 Arnaldo Carvalho de Melo 提交于 4月 28, 2016

This makes perf_callchain_{user,kernel}() receive the max stack
as context for the perf_callchain_entry, instead of accessing
the global sysctl_perf_event_max_stack.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

cfbcf468

powerpc/86xx: Fix PCI interrupt map definition · 1eef33be

由 Alessio Igor Bogani 提交于 5月 03, 2016

Fix PCI interrupt map definition from 2 to 4 cells. Move
interrupt-map and interrupt-map-mask and clone interrupts
into the pcie child nodes.
Signed-off-by: NAlessio Igor Bogani <alessio.bogani@elettra.eu>
Signed-off-by: NScott Wood <oss@buserror.net>

1eef33be

A
powerpc/86xx: Move pci1 definition to the include file · a66639d4
由 Alessio Igor Bogani 提交于 4月 20, 2016
```
Signed-off-by: NAlessio Igor Bogani <alessio.bogani@elettra.eu>
Signed-off-by: NScott Wood <oss@buserror.net>
```
a66639d4

powerpc/fsl: Fix build of the dtb embedded kernel images · 4680c9d5

由 Alessio Igor Bogani 提交于 4月 18, 2016

Commit dc37374b ("powerpc/fsl: Move Freescale device tree files
into fsl folder") moved a lot of device tree files into fsl directory,
fixing Makefile for cuImage target only.  Unfortunately there are other
targets which require embedding a device tree into the kernel image
(e.g. dtbImage.%).  So use a more generic approach.
Signed-off-by: NAlessio Igor Bogani <alessio.bogani@elettra.eu>
[scottwood: cleaned up commit message]
Signed-off-by: NScott Wood <oss@buserror.net>

4680c9d5

powerpc/fsl: Fix rcpm compatible string · d2d79dcc

由 Chenhui Zhao 提交于 4月 15, 2016

For T1040, T1042, T1023, and T1024, they should use the compatible
string "fsl,qoriq-rcpm-2.1".
Signed-off-by: NChenhui Zhao <chenhui.zhao@nxp.com>
Signed-off-by: NScott Wood <oss@buserror.net>

d2d79dcc

powerpc/fsl: Remove FSL_SOC dependency from FSL_LBC · ed8fd100

由 Scott Wood 提交于 3月 08, 2016

This dependency led to kconfig errors when MTD_NAND_FSL_ELBC was
enabled, which selects FSL_LBC, in the absence of FSL_SOC, as reported
in http://patchwork.ozlabs.org/patch/564405/

It was originally suggested to add an FSL_SOC dependency to
MTD_NAND_FSL_ELBC, but the FSL_SOC symbol has been a growing problem
due to hardware being shared between PPC and ARM SoCs.  Even though
eLBC isn't found on ARM SoCs (the newer IFC is used instead), I don't
want to expand the use of FSL_SOC for things other than functions
exported by fsl_soc.c.  In particular, it would be odd to add it to
MTD_NAND_FSL_ELBC and then remove it from MTD_NAND_FSL_IFC.

Removing artificial dependencies also helps get compile-test exposure
via randconfig, allyesconfig, etc.
Reported-by: NBrian Norris <computersforpeace@gmail.com>
Cc: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: NScott Wood <oss@buserror.net>

ed8fd100

bpf: split HAVE_BPF_JIT into cBPF and eBPF variant · 6077776b

由 Daniel Borkmann 提交于 5月 13, 2016

Split the HAVE_BPF_JIT into two for distinguishing cBPF and eBPF JITs.

Current cBPF ones:

  # git grep -n HAVE_CBPF_JIT arch/
  arch/arm/Kconfig:44:    select HAVE_CBPF_JIT
  arch/mips/Kconfig:18:   select HAVE_CBPF_JIT if !CPU_MICROMIPS
  arch/powerpc/Kconfig:129:       select HAVE_CBPF_JIT
  arch/sparc/Kconfig:35:  select HAVE_CBPF_JIT

Current eBPF ones:

  # git grep -n HAVE_EBPF_JIT arch/
  arch/arm64/Kconfig:61:  select HAVE_EBPF_JIT
  arch/s390/Kconfig:126:  select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
  arch/x86/Kconfig:94:    select HAVE_EBPF_JIT                    if X86_64

Later code also needs this facility to check for eBPF JITs.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6077776b

16 5月, 2016 2 次提交

powerpc/fsl-pci: Add a workaround for PCI 5 errata · a8165d42

由 chenhui zhao 提交于 1月 15, 2016

Issue:
As a master, the PCI IP block can combine a memory write to the last PCI
double word (4 bytes) of a cacheline with a 4 byte memory write to the
first PCI double word of the subsequent cacheline. This affects 32-bit
PCI target devices that blindly assert STOP on memory-write transactions,
without detecting that the data beat being transferred is the last data
beat of the transaction. It can cause a hang. PCI-X operation is not
affected by this erratum.

Workaround:
Setting the bit MDS in the PCI Bus Function Register will disable the
combining of crossing cacheline boundary requests into one burst
transaction. Therefore, it can prevent the errata scenario from
occurring.

This errata exists in MPC8543, MPC8543E, MPC8545, MPC8545E, MPC8547,
MPC8547E, MPC8548 and MPC8548E. Refer to PCI 5 in MPC8548 errata
document.
Signed-off-by: NZhao Chenhui <chenhui.zhao@freescale.com>
Signed-off-by: NZhiqiang Hou <Zhiqiang.Hou@freescale.com>
[scottwood: whitespace fix]
Signed-off-by: NScott Wood <oss@buserror.net>

a8165d42

powerpc/fsl: Fix SPI compatible on t208xrdb and t1040rdb · 6a369fa2

由 Hou Zhiqiang 提交于 1月 13, 2016

On the t208xrdb and t1040rdb, the SPI device is n25q512ax3
instead of n25q512a.
Signed-off-by: NHou Zhiqiang <Zhiqiang.Hou@freescale.com>
Signed-off-by: NScott Wood <oss@buserror.net>

6a369fa2

13 5月, 2016 2 次提交

KVM: halt_polling: provide a way to qualify wakeups during poll · 3491caf2

由 Christian Borntraeger 提交于 5月 13, 2016

Some wakeups should not be considered a sucessful poll. For example on
s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
would be considered runnable - letting all vCPUs poll all the time for
transactional like workload, even if one vCPU would be enough.
This can result in huge CPU usage for large guests.
This patch lets architectures provide a way to qualify wakeups if they
should be considered a good/bad wakeups in regard to polls.

For s390 the implementation will fence of halt polling for anything but
known good, single vCPU events. The s390 implementation for floating
interrupts does a wakeup for one vCPU, but the interrupt will be delivered
by whatever CPU checks first for a pending interrupt. We prefer the
woken up CPU by marking the poll of this CPU as "good" poll.
This code will also mark several other wakeup reasons like IPI or
expired timers as "good". This will of course also mark some events as
not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
we prefer to not poll, unless we are really sure, though.

This patch successfully limits the CPU usage for cases like uperf 1byte
transactional ping pong workload or wakeup heavy workload like OLTP
while still providing a proper speedup.

This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
wakeups that are considered not good for polling.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
Cc: David Matlack <dmatlack@google.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
[Rename config symbol. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3491caf2

coredump: get rid of coredump_params->written · a0083939

由 Omar Sandoval 提交于 5月 11, 2016

cprm->written is redundant with cprm->file->f_pos, so use that instead.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a0083939

12 5月, 2016 12 次提交

powerpc/4xx: Device tree update for the 460ex DWC SATA · 951b8e4e

由 Andy Shevchenko 提交于 4月 26, 2016

Device tree update for the Applied micro processor 460ex on-chip SATA to use
"dmas" property.
Acked-by: NRob Herring <robh@kernel.org>
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

951b8e4e

powerpc/powernv/npu: Add PE to PHB's list · 1d4e89cf

由 Alexey Kardashevskiy 提交于 5月 12, 2016

Before commit 3e68dc57 "powerpc/powernv: Remove DMA32 PE list", NPU PEs
were linked to the NPU PHB via phb->ioda.pe_dma_list; after that fix,
the phb->ioda.pe_list is used.

During the pe_dma_list removal, list_add_tail(&phb->ioda.pe_dma_list)
was removed, however no list_add() was added so does this patch.

Fixes: 3e68dc57219a ("powerpc/powernv: Remove DMA32 PE list")
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

1d4e89cf

powerpc/powernv: Fix insufficient memory allocation · 92a86756

由 Alexey Kardashevskiy 提交于 5月 12, 2016

The pnv_pci_init_ioda_phb() helper allocates a blob to store auxilary
data such PE and M32/M64 segment allocation maps; this single blob has
few partitions, size of each is derived from the PE number -
phb->ioda.total_pe_num.

It was assumed that the minimum PE number is 8, however it is 4 for NPU
so the pe_alloc part was missing in the allocated blob. It was invisible
till recently as we were not tracking used M64 segments and NPUs do not
use M32 segments so the phb->ioda.m32_segmap (which was pointing to the
same address as phb->ioda.pe_alloc) has never been written to leaving
the pe_alloc memory intact.

After commit 401203ac2d "powerpc/powernv: Track M64 segment consumption"
the pe_alloc gets corrupted and PE allocation cannot work. This fixes
the issue by enforcing the minimum PE number to 8.

Fixes: 401203ac2d15 ("powerpc/powernv: Track M64 segment consumption")
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

92a86756

powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism · 8445a87f

由 Guilherme G. Piccoli 提交于 4月 11, 2016

Commit 39baadbf ("powerpc/eeh: Remove eeh information from pci_dn")
changed the pci_dn struct by removing its EEH-related members.
As part of this clean-up, DDW mechanism was modified to read the device
configuration address from eeh_dev struct.

As a consequence, now if we disable EEH mechanism on kernel command-line
for example, the DDW mechanism will fail, generating a kernel oops by
dereferencing a NULL pointer (which turns to be the eeh_dev pointer).

This patch just changes the configuration address calculation on DDW
functions to a manual calculation based on pci_dn members instead of
using eeh_dev-based address.

No functional changes were made. This was tested on pSeries, both
in PHyp and qemu guest.

Fixes: 39baadbf ("powerpc/eeh: Remove eeh information from pci_dn")
Cc: stable@vger.kernel.org # v3.4+
Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: NGuilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

8445a87f

Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" · c2078d9e

由 Guilherme G. Piccoli 提交于 4月 11, 2016

This reverts commit 89a51df5.

The function eeh_add_device_early() is used to perform EEH
initialization in devices added later on the system, like in
hotplug/DLPAR scenarios. Since the commit 89a51df5 ("powerpc/eeh:
Fix crash in eeh_add_device_early() on Cell") a new check was introduced
in this function - Cell has no EEH capabilities which led to kernel oops
if hotplug was performed, so checking for eeh_enabled() was introduced
to avoid the issue.

However, in architectures that EEH is present like pSeries or PowerNV,
we might reach a case in which no PCI devices are present on boot time
and so EEH is not initialized. Then, if a device is added via DLPAR for
example, eeh_add_device_early() fails because eeh_enabled() is false,
and EEH end up not being enabled at all.

This reverts the aforementioned patch since a new verification was
introduced by the commit d91dafc0 ("powerpc/eeh: Delay probing EEH
device during hotplug") and so the original Cell issue does not happen
anymore.

Cc: stable@vger.kernel.org # v4.1+
Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: NGuilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

c2078d9e

powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner() · d6d63d72

由 Gavin Shan 提交于 4月 27, 2016

The label "reset" in eeh_pe_change_owner() is used only for once.
No need to keep it and just drop it. No logical changes introduced.
Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Reviewed-by: NRussell Currey <ruscur@russell.cc>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d6d63d72

powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover() · 2efc771f

由 Gavin Shan 提交于 4月 27, 2016

The function eeh_pe_reset_and_recover() is used to recover EEH
error when the passthrough device are transferred to guest and
backwards, meaning the device's driver is vfio-pci or none. In
both cases, the handlers triggered by eeh_report_reset() and
eeh_report_resume() shouldn't be called.

This ignores the error handlers from eeh_report_reset() and
eeh_report_resume().
Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: NRussell Currey <ruscur@russell.cc>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

2efc771f

powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover() · 5a0cdbfd

由 Gavin Shan 提交于 4月 27, 2016

The function eeh_pe_reset_and_recover() is used to recover EEH
error when the passthrou device are transferred to guest and
backwards. The content in the device's config space will be lost
on PE reset issued in the middle of the recovery. The function
saves/restores it before/after the reset. However, config access
to some adapters like Broadcom BCM5719 at this point will causes
fenced PHB. The config space is always blocked and we save 0xFF's
that are restored at late point. The memory BARs are totally
corrupted, causing another EEH error upon access to one of the
memory BARs.

This restores the config space on those adapters like BCM5719
from the content saved to the EEH device when it's populated,
to resolve above issue.

Fixes: 5cfb20b9 ("powerpc/eeh: Emulate EEH recovery for VFIO devices")
Cc: stable@vger.kernel.org #v3.18+
Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: NRussell Currey <ruscur@russell.cc>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

5a0cdbfd

powerpc/eeh: Don't report error in eeh_pe_reset_and_recover() · affeb0f2

由 Gavin Shan 提交于 4月 27, 2016

The function eeh_pe_reset_and_recover() is used to recover EEH
error when the passthrough device are transferred to guest and
backwards, meaning the device's driver is vfio-pci or none.
When the driver is vfio-pci that provides error_detected() error
handler only, the handler simply stops the guest and it's not
expected behaviour. On the other hand, no error handlers will
be called if we don't have a bound driver.

This ignores the error handler in eeh_pe_reset_and_recover()
that reports the error to device driver to avoid the exceptional
behaviour.

Fixes: 5cfb20b9 ("powerpc/eeh: Emulate EEH recovery for VFIO devices")
Cc: stable@vger.kernel.org #v3.18+
Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: NRussell Currey <ruscur@russell.cc>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

affeb0f2

Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()" · 848912e5

由 Michael Ellerman 提交于 5月 12, 2016

This reverts commit c8ceacc2.

Gavin says: I missed the fact that it affects the PCI passthrou path as
reported by Alexey: When passing GPU (0003:01:00.0) which seats behind
the root port, the reset request is routed to skiboot in original code.
In skiboot, the link bouncing events are masked during the reset. So we
don't see EEH (freeze all) error even link bouncing happens. With the
changes included, the reset is done by kernel and the link bouncing
events aren't masked by altering content of PHB3 (or P7IOC) specific
hardware registers which are invisible to kernel (skiboot hides the
hardware specific). It means the link bouncing is seen by the root port
and it causes a EEH (freeze all) error. The PCI passthrough on GPU
device cannot work.
Requested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Requested-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

848912e5

KVM: PPC: Book3S HV: Re-enable XICS fast path for irqfd-generated interrupts · b1a4286b

由 Paul Mackerras 提交于 5月 04, 2016

Commit c9a5ecca ("kvm/eventfd: add arch-specific set_irq",
2015-10-16) added the possibility for architecture-specific code
to handle the generation of virtual interrupts in atomic context
where possible, without having to schedule a work function.

Since we can easily generate virtual interrupts on XICS without
having to do anything worse than take a spinlock, we define a
kvm_arch_set_irq_inatomic() for XICS.  We also remove kvm_set_msi()
since it is not used any more.

The one slightly tricky thing is that with the new interface, we
don't get told whether the interrupt is an MSI (or other edge
sensitive interrupt) vs. level-sensitive.  The difference as far
as interrupt generation is concerned is that for LSIs we have to
set the asserted flag so it will continue to fire until it is
explicitly cleared.

In fact the XICS code gets told which interrupts are LSIs by userspace
when it configures the interrupt via the KVM_DEV_XICS_GRP_SOURCES
attribute group on the XICS device.  To store this information, we add
a new "lsi" field to struct ics_irq_state.  With that we can also do a
better job of returning accurate values when reading the attribute
group.
Signed-off-by: NPaul Mackerras <paulus@samba.org>

b1a4286b

kvm: introduce KVM_MAX_VCPU_ID · 0b1b1dfd

由 Greg Kurz 提交于 5月 09, 2016

The KVM_MAX_VCPUS define provides the maximum number of vCPUs per guest, and
also the upper limit for vCPU ids. This is okay for all archs except PowerPC
which can have higher ids, depending on the cpu/core/thread topology. In the
worst case (single threaded guest, host with 8 threads per core), it limits
the maximum number of vCPUS to KVM_MAX_VCPUS / 8.

This patch separates the vCPU numbering from the total number of vCPUs, with
the introduction of KVM_MAX_VCPU_ID, as the maximal valid value for vCPU ids
plus one.

The corresponding KVM_CAP_MAX_VCPU_ID allows userspace to validate vCPU ids
before passing them to KVM_CREATE_VCPU.

This patch only implements KVM_MAX_VCPU_ID with a specific value for PowerPC.
Other archs continue to return KVM_MAX_VCPUS instead.
Suggested-by: NRadim Krcmar <rkrcmar@redhat.com>
Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0b1b1dfd

11 5月, 2016 9 次提交

powerpc/powernv/npu: Enable NVLink pass through · b5cb9ab1

由 Alexey Kardashevskiy 提交于 4月 29, 2016

IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which
also has a couple of fast speed links (NVLink). The interface to links
is exposed as an emulated PCI bridge which is included into the same
IOMMU group as the corresponding GPU.

In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE
which behave pretty much as the standard IODA2 PHB except NPU PHB has
just a single TVE in the hardware which means it can have either
32bit window or 64bit window or DMA bypass but never two of these.

In order to make these links work when GPU is passed to the guest,
these bridges need to be passed as well; otherwise performance will
degrade.

This implements and exports API to manage NPU state in regard to VFIO;
it replicates iommu_table_group_ops.

This defines a new pnv_pci_ioda2_npu_ops which is assigned to
the IODA2 bridge if there are NPUs for a GPU on the bridge.
The new callbacks call the default IODA2 callbacks plus new NPU API.
This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2
table_group, it is not expected to fail as the helper is only called
from the pnv_pci_ioda2_npu_ops.

This does not define NPU-specific .release_ownership() so after
VFIO is finished, DMA on NPU is disabled which is ok as the nvidia
driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU.

This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to
the GPU group if any found. The helper uses helpers to look for
the "ibm,gpu" property in the device tree which is a phandle of
the corresponding GPU.

This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
loop skips NPU PEs as they do not have 32bit DMA segments.

As pnv_npu_set_window() and pnv_npu_unset_window() are started being used
by the new IODA2-NPU IOMMU group, this makes the helpers public and
adds the DMA window number parameter.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-By: NAlistair Popple <alistair@popple.id.au>
[mpe: Add pnv_pci_ioda_setup_iommu_api() to fix build with IOMMU_API=n]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b5cb9ab1

powerpc/powernv/npu: Rework TCE Kill handling · 85674868

由 Alexey Kardashevskiy 提交于 4月 29, 2016

The pnv_ioda_pe struct keeps an array of peers. At the moment it is only
used to link GPU and NPU for 2 purposes:

1. Access NPU quickly when configuring DMA for GPU - this was addressed
in the previos patch by removing use of it as DMA setup is not what
the kernel would constantly do.

2. Invalidate TCE cache for NPU when it is invalidated for GPU.
GPU and NPU are in different PE. There is already a mechanism to
attach multiple iommu_table_group to the same iommu_table (used for VFIO),
we can reuse it here so does this patch.

This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are
not needed anymore.

While we are here, add TCE cache invalidation after enabling bypass.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-By: NAlistair Popple <alistair@popple.id.au>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

85674868

powerpc/powernv/npu: Add set/unset window helpers · b575c731

由 Alexey Kardashevskiy 提交于 4月 29, 2016

The upcoming NVLink passthrough support will require NPU code to cope
with two DMA windows.

This adds a pnv_npu_set_window() helper which programs 32bit window to
the hardware. This also adds multilevel TCE support.

This adds a pnv_npu_unset_window() helper which removes the DMA window
from the hardware. This does not make difference now as the caller -
pnv_npu_dma_set_bypass() - enables bypass in the hardware but the next
patch will use it to manage TCE table lists for TCE Kill handling.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-By: NAlistair Popple <alistair@popple.id.au>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b575c731

powerpc/powernv/ioda2: Export debug helper pe_level_printk() · 7d623e42

由 Alexey Kardashevskiy 提交于 4月 29, 2016

This exports debugging helper pe_level_printk() and corresponding macroses
so they can be used in npu-dma.c.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-By: NAlistair Popple <alistair@popple.id.au>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

7d623e42

powerpc/powernv/npu: Simplify DMA setup · f9f83456

由 Alexey Kardashevskiy 提交于 4月 29, 2016

NPU devices are emulated in firmware and mainly used for NPU NVLink
training; one NPU device is per a hardware link. Their DMA/TCE setup
must match the GPU which is connected via PCIe and NVLink so any changes
to the DMA/TCE setup on the GPU PCIe device need to be propagated to
the NVLink device as this is what device drivers expect and it doesn't
make much sense to do anything else.

This makes NPU DMA setup explicit.
pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda,
made static and prints warning as dma_set_mask() should never be called
on this function as in any case it will not configure GPU; so we make
this explicit.

Instead of using PNV_IODA_PE_PEER and peers[] (which the next patch will
remove), we test every PCI device if there are corresponding NVLink
devices. If there are any, we propagate bypass mode to just found NPU
devices by calling the setup helper directly (which takes @bypass) and
avoid guessing (i.e. calculating from DMA mask) whether we need bypass
or not on NPU devices. Since DMA setup happens in very rare occasion,
this will not slow down booting or VFIO start/stop much.

This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it
more clear what the function really does which is programming 32bit
table address to the TVT ("disabling bypass" means writing zeroes to
the TVT).

This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as
the DMA configuration on NPU does not matter until dma_set_mask() is
called on GPU and that will do the NPU DMA configuration.

This removes phb->dma_dev_setup initialization for NPU as
pnv_pci_ioda_dma_dev_setup is no-op for it anyway.

This stops using npe->tce_bypass_base as it never changes and values
other than zero are not supported.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Reviewed-by: NAlistair Popple <alistair@popple.id.au>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f9f83456

powerpc/powernv/npu: Use the correct IOMMU page size · 6969af73

由 Alexey Kardashevskiy 提交于 4月 29, 2016

This uses the page size from iommu_table instead of hard-coded 4K.
This should cause no change in behavior.

While we are here, move bits around to prepare for further rework
which will define and use iommu_table_group_ops.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Reviewed-by: NAlistair Popple <alistair@popple.id.au>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6969af73

powerpc/powernv/npu: TCE Kill helpers cleanup · 0bbcdb43

由 Alexey Kardashevskiy 提交于 4月 29, 2016

NPU PHB TCE Kill register is exactly the same as in the rest of POWER8
so let's reuse the existing code for NPU. The only bit missing is
a helper to reset the entire TCE cache so this moves such a helper
from NPU code and renames it.

Since pnv_npu_tce_invalidate() does really invalidate the entire cache,
this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU.
This adds an explicit comment for workaround for invalidating NPU TCE
cache.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Reviewed-by: NAlistair Popple <alistair@popple.id.au>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

0bbcdb43

powerpc/powernv: Define TCE Kill flags · bef9253f

由 Alexey Kardashevskiy 提交于 4月 29, 2016

This replaces magic constants for TCE Kill IODA2 register with macros.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

bef9253f

powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire · a7cf13ca

由 Alexey Kardashevskiy 提交于 4月 29, 2016

As in fact pnv_pci_ioda2_tce_invalidate_entire() invalidates TCEs for
the specific PE rather than the entire cache, rename it to
pnv_pci_ioda2_tce_invalidate_pe(). In later patches we will add
a proper pnv_pci_ioda2_tce_invalidate_entire().
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a7cf13ca

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功