1. 28 7月, 2016 2 次提交
    • P
      KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE · 93d17397
      Paul Mackerras 提交于
      It turns out that if the guest does a H_CEDE while the CPU is in
      a transactional state, and the H_CEDE does a nap, and the nap
      loses the architected state of the CPU (which is is allowed to do),
      then we lose the checkpointed state of the virtual CPU.  In addition,
      the transactional-memory state recorded in the MSR gets reset back
      to non-transactional, and when we try to return to the guest, we take
      a TM bad thing type of program interrupt because we are trying to
      transition from non-transactional to transactional with a hrfid
      instruction, which is not permitted.
      
      The result of the program interrupt occurring at that point is that
      the host CPU will hang in an infinite loop with interrupts disabled.
      Thus this is a denial of service vulnerability in the host which can
      be triggered by any guest (and depending on the guest kernel, it can
      potentially triggered by unprivileged userspace in the guest).
      
      This vulnerability has been assigned the ID CVE-2016-5412.
      
      To fix this, we save the TM state before napping and restore it
      on exit from the nap, when handling a H_CEDE in real mode.  The
      case where H_CEDE exits to host virtual mode is already OK (as are
      other hcalls which exit to host virtual mode) because the exit
      path saves the TM state.
      
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      93d17397
    • P
      KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures · f024ee09
      Paul Mackerras 提交于
      This moves the transactional memory state save and restore sequences
      out of the guest entry/exit paths into separate procedures.  This is
      so that these sequences can be used in going into and out of nap
      in a subsequent patch.
      
      The only code changes here are (a) saving and restore LR on the
      stack, since these new procedures get called with a bl instruction,
      (b) explicitly saving r1 into the PACA instead of assuming that
      HSTATE_HOST_R1(r13) is already set, and (c) removing an unnecessary
      and redundant setting of MSR[TM] that should have been removed by
      commit 9d4d0bdd9e0a ("KVM: PPC: Book3S HV: Add transactional memory
      support", 2013-09-24) but wasn't.
      
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f024ee09
  2. 14 7月, 2016 1 次提交
  3. 01 7月, 2016 1 次提交
  4. 20 6月, 2016 4 次提交
    • M
      KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt · fd7bacbc
      Mahesh Salgaonkar 提交于
      When a guest is assigned to a core it converts the host Timebase (TB)
      into guest TB by adding guest timebase offset before entering into
      guest. During guest exit it restores the guest TB to host TB. This means
      under certain conditions (Guest migration) host TB and guest TB can differ.
      
      When we get an HMI for TB related issues the opal HMI handler would
      try fixing errors and restore the correct host TB value. With no guest
      running, we don't have any issues. But with guest running on the core
      we run into TB corruption issues.
      
      If we get an HMI while in the guest, the current HMI handler invokes opal
      hmi handler before forcing guest to exit. The guest exit path subtracts
      the guest TB offset from the current TB value which may have already
      been restored with host value by opal hmi handler. This leads to incorrect
      host and guest TB values.
      
      With split-core, things become more complex. With split-core, TB also gets
      split and each subcore gets its own TB register. When a hmi handler fixes
      a TB error and restores the TB value, it affects all the TB values of
      sibling subcores on the same core. On TB errors all the thread in the core
      gets HMI. With existing code, the individual threads call opal hmi handle
      independently which can easily throw TB out of sync if we have guest
      running on subcores. Hence we will need to co-ordinate with all the
      threads before making opal hmi handler call followed by TB resync.
      
      This patch introduces a sibling subcore state structure (shared by all
      threads in the core) in paca which holds information about whether sibling
      subcores are in Guest mode or host mode. An array in_guest[] of size
      MAX_SUBCORE_PER_CORE=4 is used to maintain the state of each subcore.
      The subcore id is used as index into in_guest[] array. Only primary
      thread entering/exiting the guest is responsible to set/unset its
      designated array element.
      
      On TB error, we get HMI interrupt on every thread on the core. Upon HMI,
      this patch will now force guest to vacate the core/subcore. Primary
      thread from each subcore will then turn off its respective bit
      from the above bitmap during the guest exit path just after the
      guest->host partition switch is complete.
      
      All other threads that have just exited the guest OR were already in host
      will wait until all other subcores clears their respective bit.
      Once all the subcores turn off their respective bit, all threads will
      will make call to opal hmi handler.
      
      It is not necessary that opal hmi handler would resync the TB value for
      every HMI interrupts. It would do so only for the HMI caused due to
      TB errors. For rest, it would not touch TB value. Hence to make things
      simpler, primary thread would call TB resync explicitly once for each
      core immediately after opal hmi handler instead of subtracting guest
      offset from TB. TB resync call will restore the TB with host value.
      Thus we can be sure about the TB state.
      
      One of the primary threads exiting the guest will take up the
      responsibility of calling TB resync. It will use one of the top bits
      (bit 63) from subcore state flags bitmap to make the decision. The first
      primary thread (among the subcores) that is able to set the bit will
      have to call the TB resync. Rest all other threads will wait until TB
      resync is complete.  Once TB resync is complete all threads will then
      proceed.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      fd7bacbc
    • M
      powerpc/powernv: Remove the usage of PACAR1 from opal wrappers · 6dd06d15
      Mahesh Salgaonkar 提交于
      OPAL_CALL wrapper code sticks the r1 (stack pointer) into PACAR1 purely
      for debugging purpose only. The power7_wakeup* functions relies on stack
      pointer saved in PACAR1. Any opal call made using opal wrapper (directly
      or in-directly) before we fall through power7_wakeup*, then it ends up
      replacing r1 in PACAR1(r13) leading to kernel panic. So far we don't see
      any issues because we have never made any opal calls using OPAL wrapper
      before power7_wakeup*. But the subsequent HMI patch would need to invoke
      C calls during cpu wakeup/idle path that in-directly makes opal call using
      opal wrapper. This patch facilitates the subsequent HMI patch by removing
      usage of PACAR1 from opal call wrapper.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6dd06d15
    • T
      KVM: PPC: Book3S PR: Fix contents of SRR1 when injecting a program exception · b69890d1
      Thomas Huth 提交于
      vcpu->arch.shadow_srr1 only contains usable values for injecting
      a program exception into the guest if we entered the function
      kvmppc_handle_exit_pr() with exit_nr == BOOK3S_INTERRUPT_PROGRAM.
      In other cases, the shadow_srr1 bits are zero. Since we want to
      pass an illegal-instruction program check to the guest, set
      "flags" to SRR1_PROGILL for these other cases.
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      b69890d1
    • T
      KVM: PPC: Book3S PR: Fix illegal opcode emulation · 708e75a3
      Thomas Huth 提交于
      If kvmppc_handle_exit_pr() calls kvmppc_emulate_instruction() to emulate
      one instruction (in the BOOK3S_INTERRUPT_H_EMUL_ASSIST case), it calls
      kvmppc_core_queue_program() afterwards if kvmppc_emulate_instruction()
      returned EMULATE_FAIL, so the guest gets an program interrupt for the
      illegal opcode.
      However, the kvmppc_emulate_instruction() also tried to inject a
      program exception for this already, so the program interrupt gets
      injected twice and the return address in srr0 gets destroyed.
      All other callers of kvmppc_emulate_instruction() are also injecting
      a program interrupt, and since the callers have the right knowledge
      about the srr1 flags that should be used, it is the function
      kvmppc_emulate_instruction() that should _not_ inject program
      interrupts, so remove the kvmppc_core_queue_program() here.
      
      This fixes the issue discovered by Laurent Vivier with kvm-unit-tests
      where the logs are filled with these messages when the test tries
      to execute an illegal instruction:
      
           Couldn't emulate instruction 0x00000000 (op 0 xop 0)
           kvmppc_handle_exit_pr: emulation at 700 failed (00000000)
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Reviewed-by: NAlexander Graf <agraf@suse.de>
      Tested-by: NLaurent Vivier <lvivier@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      708e75a3
  5. 24 5月, 2016 1 次提交
  6. 21 5月, 2016 2 次提交
    • P
      printk/nmi: generic solution for safe printk in NMI · 42a0bb3f
      Petr Mladek 提交于
      printk() takes some locks and could not be used a safe way in NMI
      context.
      
      The chance of a deadlock is real especially when printing stacks from
      all CPUs.  This particular problem has been addressed on x86 by the
      commit a9edc880 ("x86/nmi: Perform a safe NMI stack trace on all
      CPUs").
      
      The patchset brings two big advantages.  First, it makes the NMI
      backtraces safe on all architectures for free.  Second, it makes all NMI
      messages almost safe on all architectures (the temporary buffer is
      limited.  We still should keep the number of messages in NMI context at
      minimum).
      
      Note that there already are several messages printed in NMI context:
      WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
      handlers.  These are not easy to avoid.
      
      This patch reuses most of the code and makes it generic.  It is useful
      for all messages and architectures that support NMI.
      
      The alternative printk_func is set when entering and is reseted when
      leaving NMI context.  It queues IRQ work to copy the messages into the
      main ring buffer in a safe context.
      
      __printk_nmi_flush() copies all available messages and reset the buffer.
      Then we could use a simple cmpxchg operations to get synchronized with
      writers.  There is also used a spinlock to get synchronized with other
      flushers.
      
      We do not longer use seq_buf because it depends on external lock.  It
      would be hard to make all supported operations safe for a lockless use.
      It would be confusing and error prone to make only some operations safe.
      
      The code is put into separate printk/nmi.c as suggested by Steven
      Rostedt.  It needs a per-CPU buffer and is compiled only on
      architectures that call nmi_enter().  This is achieved by the new
      HAVE_NMI Kconfig flag.
      
      The are MN10300 and Xtensa architectures.  We need to clean up NMI
      handling there first.  Let's do it separately.
      
      The patch is heavily based on the draft from Peter Zijlstra, see
      
        https://lkml.org/lkml/2015/6/10/327
      
      [arnd@arndb.de: printk-nmi: use %zu format string for size_t]
      [akpm@linux-foundation.org: min_t->min - all types are size_t here]
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>	[arm part]
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Jiri Kosina <jkosina@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42a0bb3f
    • J
      exit_thread: remove empty bodies · 5f56a5df
      Jiri Slaby 提交于
      Define HAVE_EXIT_THREAD for archs which want to do something in
      exit_thread. For others, let's define exit_thread as an empty inline.
      
      This is a cleanup before we change the prototype of exit_thread to
      accept a task parameter.
      
      [akpm@linux-foundation.org: fix mips]
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f56a5df
  7. 20 5月, 2016 2 次提交
    • H
      arch: fix has_transparent_hugepage() · fd8cfd30
      Hugh Dickins 提交于
      I've just discovered that the useful-sounding has_transparent_hugepage()
      is actually an architecture-dependent minefield: on some arches it only
      builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
      not, but on some of those (arm and arm64) it then gives the wrong
      answer; and on mips alone it's marked __init, which would crash if
      called later (but so far it has not been called later).
      
      Straighten this out: make it available to all configs, with a sensible
      default in asm-generic/pgtable.h, removing its definitions from those
      arches (arc, arm, arm64, sparc, tile) which are served by the default,
      adding #define has_transparent_hugepage has_transparent_hugepage to
      those (mips, powerpc, s390, x86) which need to override the default at
      runtime, and removing the __init from mips (but maybe that kind of code
      should be avoided after init: set a static variable the first time it's
      called).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: Vineet Gupta <vgupta@synopsys.com>		[arch/arc]
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[arch/s390]
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd8cfd30
    • V
      powerpc: mm: use hugetlb_bad_size() · 71bf79cc
      Vaishali Thakkar 提交于
      Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
      hugepage size is found.
      Signed-off-by: NVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71bf79cc
  8. 19 5月, 2016 1 次提交
  9. 17 5月, 2016 9 次提交
  10. 16 5月, 2016 2 次提交
  11. 13 5月, 2016 2 次提交
    • C
      KVM: halt_polling: provide a way to qualify wakeups during poll · 3491caf2
      Christian Borntraeger 提交于
      Some wakeups should not be considered a sucessful poll. For example on
      s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
      would be considered runnable - letting all vCPUs poll all the time for
      transactional like workload, even if one vCPU would be enough.
      This can result in huge CPU usage for large guests.
      This patch lets architectures provide a way to qualify wakeups if they
      should be considered a good/bad wakeups in regard to polls.
      
      For s390 the implementation will fence of halt polling for anything but
      known good, single vCPU events. The s390 implementation for floating
      interrupts does a wakeup for one vCPU, but the interrupt will be delivered
      by whatever CPU checks first for a pending interrupt. We prefer the
      woken up CPU by marking the poll of this CPU as "good" poll.
      This code will also mark several other wakeup reasons like IPI or
      expired timers as "good". This will of course also mark some events as
      not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
      we prefer to not poll, unless we are really sure, though.
      
      This patch successfully limits the CPU usage for cases like uperf 1byte
      transactional ping pong workload or wakeup heavy workload like OLTP
      while still providing a proper speedup.
      
      This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
      wakeups that are considered not good for polling.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
      Cc: David Matlack <dmatlack@google.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      [Rename config symbol. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3491caf2
    • O
      coredump: get rid of coredump_params->written · a0083939
      Omar Sandoval 提交于
      cprm->written is redundant with cprm->file->f_pos, so use that instead.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a0083939
  12. 12 5月, 2016 12 次提交
    • A
      powerpc/4xx: Device tree update for the 460ex DWC SATA · 951b8e4e
      Andy Shevchenko 提交于
      Device tree update for the Applied micro processor 460ex on-chip SATA to use
      "dmas" property.
      Acked-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      951b8e4e
    • A
      powerpc/powernv/npu: Add PE to PHB's list · 1d4e89cf
      Alexey Kardashevskiy 提交于
      Before commit 3e68dc57 "powerpc/powernv: Remove DMA32 PE list", NPU PEs
      were linked to the NPU PHB via phb->ioda.pe_dma_list; after that fix,
      the phb->ioda.pe_list is used.
      
      During the pe_dma_list removal, list_add_tail(&phb->ioda.pe_dma_list)
      was removed, however no list_add() was added so does this patch.
      
      Fixes: 3e68dc57219a ("powerpc/powernv: Remove DMA32 PE list")
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1d4e89cf
    • A
      powerpc/powernv: Fix insufficient memory allocation · 92a86756
      Alexey Kardashevskiy 提交于
      The pnv_pci_init_ioda_phb() helper allocates a blob to store auxilary
      data such PE and M32/M64 segment allocation maps; this single blob has
      few partitions, size of each is derived from the PE number -
      phb->ioda.total_pe_num.
      
      It was assumed that the minimum PE number is 8, however it is 4 for NPU
      so the pe_alloc part was missing in the allocated blob. It was invisible
      till recently as we were not tracking used M64 segments and NPUs do not
      use M32 segments so the phb->ioda.m32_segmap (which was pointing to the
      same address as phb->ioda.pe_alloc) has never been written to leaving
      the pe_alloc memory intact.
      
      After commit 401203ac2d "powerpc/powernv: Track M64 segment consumption"
      the pe_alloc gets corrupted and PE allocation cannot work. This fixes
      the issue by enforcing the minimum PE number to 8.
      
      Fixes: 401203ac2d15 ("powerpc/powernv: Track M64 segment consumption")
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      92a86756
    • G
      powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism · 8445a87f
      Guilherme G. Piccoli 提交于
      Commit 39baadbf ("powerpc/eeh: Remove eeh information from pci_dn")
      changed the pci_dn struct by removing its EEH-related members.
      As part of this clean-up, DDW mechanism was modified to read the device
      configuration address from eeh_dev struct.
      
      As a consequence, now if we disable EEH mechanism on kernel command-line
      for example, the DDW mechanism will fail, generating a kernel oops by
      dereferencing a NULL pointer (which turns to be the eeh_dev pointer).
      
      This patch just changes the configuration address calculation on DDW
      functions to a manual calculation based on pci_dn members instead of
      using eeh_dev-based address.
      
      No functional changes were made. This was tested on pSeries, both
      in PHyp and qemu guest.
      
      Fixes: 39baadbf ("powerpc/eeh: Remove eeh information from pci_dn")
      Cc: stable@vger.kernel.org # v3.4+
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8445a87f
    • G
      Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" · c2078d9e
      Guilherme G. Piccoli 提交于
      This reverts commit 89a51df5.
      
      The function eeh_add_device_early() is used to perform EEH
      initialization in devices added later on the system, like in
      hotplug/DLPAR scenarios. Since the commit 89a51df5 ("powerpc/eeh:
      Fix crash in eeh_add_device_early() on Cell") a new check was introduced
      in this function - Cell has no EEH capabilities which led to kernel oops
      if hotplug was performed, so checking for eeh_enabled() was introduced
      to avoid the issue.
      
      However, in architectures that EEH is present like pSeries or PowerNV,
      we might reach a case in which no PCI devices are present on boot time
      and so EEH is not initialized. Then, if a device is added via DLPAR for
      example, eeh_add_device_early() fails because eeh_enabled() is false,
      and EEH end up not being enabled at all.
      
      This reverts the aforementioned patch since a new verification was
      introduced by the commit d91dafc0 ("powerpc/eeh: Delay probing EEH
      device during hotplug") and so the original Cell issue does not happen
      anymore.
      
      Cc: stable@vger.kernel.org # v4.1+
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c2078d9e
    • G
      powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner() · d6d63d72
      Gavin Shan 提交于
      The label "reset" in eeh_pe_change_owner() is used only for once.
      No need to keep it and just drop it. No logical changes introduced.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d6d63d72
    • G
      powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover() · 2efc771f
      Gavin Shan 提交于
      The function eeh_pe_reset_and_recover() is used to recover EEH
      error when the passthrough device are transferred to guest and
      backwards, meaning the device's driver is vfio-pci or none. In
      both cases, the handlers triggered by eeh_report_reset() and
      eeh_report_resume() shouldn't be called.
      
      This ignores the error handlers from eeh_report_reset() and
      eeh_report_resume().
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2efc771f
    • G
      powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover() · 5a0cdbfd
      Gavin Shan 提交于
      The function eeh_pe_reset_and_recover() is used to recover EEH
      error when the passthrou device are transferred to guest and
      backwards. The content in the device's config space will be lost
      on PE reset issued in the middle of the recovery. The function
      saves/restores it before/after the reset. However, config access
      to some adapters like Broadcom BCM5719 at this point will causes
      fenced PHB. The config space is always blocked and we save 0xFF's
      that are restored at late point. The memory BARs are totally
      corrupted, causing another EEH error upon access to one of the
      memory BARs.
      
      This restores the config space on those adapters like BCM5719
      from the content saved to the EEH device when it's populated,
      to resolve above issue.
      
      Fixes: 5cfb20b9 ("powerpc/eeh: Emulate EEH recovery for VFIO devices")
      Cc: stable@vger.kernel.org #v3.18+
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5a0cdbfd
    • G
      powerpc/eeh: Don't report error in eeh_pe_reset_and_recover() · affeb0f2
      Gavin Shan 提交于
      The function eeh_pe_reset_and_recover() is used to recover EEH
      error when the passthrough device are transferred to guest and
      backwards, meaning the device's driver is vfio-pci or none.
      When the driver is vfio-pci that provides error_detected() error
      handler only, the handler simply stops the guest and it's not
      expected behaviour. On the other hand, no error handlers will
      be called if we don't have a bound driver.
      
      This ignores the error handler in eeh_pe_reset_and_recover()
      that reports the error to device driver to avoid the exceptional
      behaviour.
      
      Fixes: 5cfb20b9 ("powerpc/eeh: Emulate EEH recovery for VFIO devices")
      Cc: stable@vger.kernel.org #v3.18+
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      affeb0f2
    • M
      Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()" · 848912e5
      Michael Ellerman 提交于
      This reverts commit c8ceacc2.
      
      Gavin says: I missed the fact that it affects the PCI passthrou path as
      reported by Alexey: When passing GPU (0003:01:00.0) which seats behind
      the root port, the reset request is routed to skiboot in original code.
      In skiboot, the link bouncing events are masked during the reset. So we
      don't see EEH (freeze all) error even link bouncing happens. With the
      changes included, the reset is done by kernel and the link bouncing
      events aren't masked by altering content of PHB3 (or P7IOC) specific
      hardware registers which are invisible to kernel (skiboot hides the
      hardware specific). It means the link bouncing is seen by the root port
      and it causes a EEH (freeze all) error. The PCI passthrough on GPU
      device cannot work.
      Requested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Requested-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      848912e5
    • P
      KVM: PPC: Book3S HV: Re-enable XICS fast path for irqfd-generated interrupts · b1a4286b
      Paul Mackerras 提交于
      Commit c9a5ecca ("kvm/eventfd: add arch-specific set_irq",
      2015-10-16) added the possibility for architecture-specific code
      to handle the generation of virtual interrupts in atomic context
      where possible, without having to schedule a work function.
      
      Since we can easily generate virtual interrupts on XICS without
      having to do anything worse than take a spinlock, we define a
      kvm_arch_set_irq_inatomic() for XICS.  We also remove kvm_set_msi()
      since it is not used any more.
      
      The one slightly tricky thing is that with the new interface, we
      don't get told whether the interrupt is an MSI (or other edge
      sensitive interrupt) vs. level-sensitive.  The difference as far
      as interrupt generation is concerned is that for LSIs we have to
      set the asserted flag so it will continue to fire until it is
      explicitly cleared.
      
      In fact the XICS code gets told which interrupts are LSIs by userspace
      when it configures the interrupt via the KVM_DEV_XICS_GRP_SOURCES
      attribute group on the XICS device.  To store this information, we add
      a new "lsi" field to struct ics_irq_state.  With that we can also do a
      better job of returning accurate values when reading the attribute
      group.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      b1a4286b
    • G
      kvm: introduce KVM_MAX_VCPU_ID · 0b1b1dfd
      Greg Kurz 提交于
      The KVM_MAX_VCPUS define provides the maximum number of vCPUs per guest, and
      also the upper limit for vCPU ids. This is okay for all archs except PowerPC
      which can have higher ids, depending on the cpu/core/thread topology. In the
      worst case (single threaded guest, host with 8 threads per core), it limits
      the maximum number of vCPUS to KVM_MAX_VCPUS / 8.
      
      This patch separates the vCPU numbering from the total number of vCPUs, with
      the introduction of KVM_MAX_VCPU_ID, as the maximal valid value for vCPU ids
      plus one.
      
      The corresponding KVM_CAP_MAX_VCPU_ID allows userspace to validate vCPU ids
      before passing them to KVM_CREATE_VCPU.
      
      This patch only implements KVM_MAX_VCPU_ID with a specific value for PowerPC.
      Other archs continue to return KVM_MAX_VCPUS instead.
      Suggested-by: NRadim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
      Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0b1b1dfd
  13. 11 5月, 2016 1 次提交
    • A
      powerpc/powernv/npu: Enable NVLink pass through · b5cb9ab1
      Alexey Kardashevskiy 提交于
      IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which
      also has a couple of fast speed links (NVLink). The interface to links
      is exposed as an emulated PCI bridge which is included into the same
      IOMMU group as the corresponding GPU.
      
      In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE
      which behave pretty much as the standard IODA2 PHB except NPU PHB has
      just a single TVE in the hardware which means it can have either
      32bit window or 64bit window or DMA bypass but never two of these.
      
      In order to make these links work when GPU is passed to the guest,
      these bridges need to be passed as well; otherwise performance will
      degrade.
      
      This implements and exports API to manage NPU state in regard to VFIO;
      it replicates iommu_table_group_ops.
      
      This defines a new pnv_pci_ioda2_npu_ops which is assigned to
      the IODA2 bridge if there are NPUs for a GPU on the bridge.
      The new callbacks call the default IODA2 callbacks plus new NPU API.
      This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2
      table_group, it is not expected to fail as the helper is only called
      from the pnv_pci_ioda2_npu_ops.
      
      This does not define NPU-specific .release_ownership() so after
      VFIO is finished, DMA on NPU is disabled which is ok as the nvidia
      driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU.
      
      This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to
      the GPU group if any found. The helper uses helpers to look for
      the "ibm,gpu" property in the device tree which is a phandle of
      the corresponding GPU.
      
      This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
      loop skips NPU PEs as they do not have 32bit DMA segments.
      
      As pnv_npu_set_window() and pnv_npu_unset_window() are started being used
      by the new IODA2-NPU IOMMU group, this makes the helpers public and
      adds the DMA window number parameter.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-By: NAlistair Popple <alistair@popple.id.au>
      [mpe: Add pnv_pci_ioda_setup_iommu_api() to fix build with IOMMU_API=n]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b5cb9ab1