1. 24 5月, 2016 1 次提交
  2. 21 5月, 2016 2 次提交
    • P
      printk/nmi: generic solution for safe printk in NMI · 42a0bb3f
      Petr Mladek 提交于
      printk() takes some locks and could not be used a safe way in NMI
      context.
      
      The chance of a deadlock is real especially when printing stacks from
      all CPUs.  This particular problem has been addressed on x86 by the
      commit a9edc880 ("x86/nmi: Perform a safe NMI stack trace on all
      CPUs").
      
      The patchset brings two big advantages.  First, it makes the NMI
      backtraces safe on all architectures for free.  Second, it makes all NMI
      messages almost safe on all architectures (the temporary buffer is
      limited.  We still should keep the number of messages in NMI context at
      minimum).
      
      Note that there already are several messages printed in NMI context:
      WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
      handlers.  These are not easy to avoid.
      
      This patch reuses most of the code and makes it generic.  It is useful
      for all messages and architectures that support NMI.
      
      The alternative printk_func is set when entering and is reseted when
      leaving NMI context.  It queues IRQ work to copy the messages into the
      main ring buffer in a safe context.
      
      __printk_nmi_flush() copies all available messages and reset the buffer.
      Then we could use a simple cmpxchg operations to get synchronized with
      writers.  There is also used a spinlock to get synchronized with other
      flushers.
      
      We do not longer use seq_buf because it depends on external lock.  It
      would be hard to make all supported operations safe for a lockless use.
      It would be confusing and error prone to make only some operations safe.
      
      The code is put into separate printk/nmi.c as suggested by Steven
      Rostedt.  It needs a per-CPU buffer and is compiled only on
      architectures that call nmi_enter().  This is achieved by the new
      HAVE_NMI Kconfig flag.
      
      The are MN10300 and Xtensa architectures.  We need to clean up NMI
      handling there first.  Let's do it separately.
      
      The patch is heavily based on the draft from Peter Zijlstra, see
      
        https://lkml.org/lkml/2015/6/10/327
      
      [arnd@arndb.de: printk-nmi: use %zu format string for size_t]
      [akpm@linux-foundation.org: min_t->min - all types are size_t here]
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>	[arm part]
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Jiri Kosina <jkosina@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42a0bb3f
    • J
      exit_thread: remove empty bodies · 5f56a5df
      Jiri Slaby 提交于
      Define HAVE_EXIT_THREAD for archs which want to do something in
      exit_thread. For others, let's define exit_thread as an empty inline.
      
      This is a cleanup before we change the prototype of exit_thread to
      accept a task parameter.
      
      [akpm@linux-foundation.org: fix mips]
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f56a5df
  3. 20 5月, 2016 2 次提交
    • H
      arch: fix has_transparent_hugepage() · fd8cfd30
      Hugh Dickins 提交于
      I've just discovered that the useful-sounding has_transparent_hugepage()
      is actually an architecture-dependent minefield: on some arches it only
      builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
      not, but on some of those (arm and arm64) it then gives the wrong
      answer; and on mips alone it's marked __init, which would crash if
      called later (but so far it has not been called later).
      
      Straighten this out: make it available to all configs, with a sensible
      default in asm-generic/pgtable.h, removing its definitions from those
      arches (arc, arm, arm64, sparc, tile) which are served by the default,
      adding #define has_transparent_hugepage has_transparent_hugepage to
      those (mips, powerpc, s390, x86) which need to override the default at
      runtime, and removing the __init from mips (but maybe that kind of code
      should be avoided after init: set a static variable the first time it's
      called).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: Vineet Gupta <vgupta@synopsys.com>		[arch/arc]
      Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[arch/s390]
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd8cfd30
    • V
      powerpc: mm: use hugetlb_bad_size() · 71bf79cc
      Vaishali Thakkar 提交于
      Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
      hugepage size is found.
      Signed-off-by: NVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71bf79cc
  4. 19 5月, 2016 1 次提交
  5. 17 5月, 2016 9 次提交
  6. 16 5月, 2016 2 次提交
  7. 13 5月, 2016 2 次提交
    • C
      KVM: halt_polling: provide a way to qualify wakeups during poll · 3491caf2
      Christian Borntraeger 提交于
      Some wakeups should not be considered a sucessful poll. For example on
      s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
      would be considered runnable - letting all vCPUs poll all the time for
      transactional like workload, even if one vCPU would be enough.
      This can result in huge CPU usage for large guests.
      This patch lets architectures provide a way to qualify wakeups if they
      should be considered a good/bad wakeups in regard to polls.
      
      For s390 the implementation will fence of halt polling for anything but
      known good, single vCPU events. The s390 implementation for floating
      interrupts does a wakeup for one vCPU, but the interrupt will be delivered
      by whatever CPU checks first for a pending interrupt. We prefer the
      woken up CPU by marking the poll of this CPU as "good" poll.
      This code will also mark several other wakeup reasons like IPI or
      expired timers as "good". This will of course also mark some events as
      not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
      we prefer to not poll, unless we are really sure, though.
      
      This patch successfully limits the CPU usage for cases like uperf 1byte
      transactional ping pong workload or wakeup heavy workload like OLTP
      while still providing a proper speedup.
      
      This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
      wakeups that are considered not good for polling.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
      Cc: David Matlack <dmatlack@google.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      [Rename config symbol. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3491caf2
    • O
      coredump: get rid of coredump_params->written · a0083939
      Omar Sandoval 提交于
      cprm->written is redundant with cprm->file->f_pos, so use that instead.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a0083939
  8. 12 5月, 2016 12 次提交
    • A
      powerpc/4xx: Device tree update for the 460ex DWC SATA · 951b8e4e
      Andy Shevchenko 提交于
      Device tree update for the Applied micro processor 460ex on-chip SATA to use
      "dmas" property.
      Acked-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      951b8e4e
    • A
      powerpc/powernv/npu: Add PE to PHB's list · 1d4e89cf
      Alexey Kardashevskiy 提交于
      Before commit 3e68dc57 "powerpc/powernv: Remove DMA32 PE list", NPU PEs
      were linked to the NPU PHB via phb->ioda.pe_dma_list; after that fix,
      the phb->ioda.pe_list is used.
      
      During the pe_dma_list removal, list_add_tail(&phb->ioda.pe_dma_list)
      was removed, however no list_add() was added so does this patch.
      
      Fixes: 3e68dc57219a ("powerpc/powernv: Remove DMA32 PE list")
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1d4e89cf
    • A
      powerpc/powernv: Fix insufficient memory allocation · 92a86756
      Alexey Kardashevskiy 提交于
      The pnv_pci_init_ioda_phb() helper allocates a blob to store auxilary
      data such PE and M32/M64 segment allocation maps; this single blob has
      few partitions, size of each is derived from the PE number -
      phb->ioda.total_pe_num.
      
      It was assumed that the minimum PE number is 8, however it is 4 for NPU
      so the pe_alloc part was missing in the allocated blob. It was invisible
      till recently as we were not tracking used M64 segments and NPUs do not
      use M32 segments so the phb->ioda.m32_segmap (which was pointing to the
      same address as phb->ioda.pe_alloc) has never been written to leaving
      the pe_alloc memory intact.
      
      After commit 401203ac2d "powerpc/powernv: Track M64 segment consumption"
      the pe_alloc gets corrupted and PE allocation cannot work. This fixes
      the issue by enforcing the minimum PE number to 8.
      
      Fixes: 401203ac2d15 ("powerpc/powernv: Track M64 segment consumption")
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      92a86756
    • G
      powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism · 8445a87f
      Guilherme G. Piccoli 提交于
      Commit 39baadbf ("powerpc/eeh: Remove eeh information from pci_dn")
      changed the pci_dn struct by removing its EEH-related members.
      As part of this clean-up, DDW mechanism was modified to read the device
      configuration address from eeh_dev struct.
      
      As a consequence, now if we disable EEH mechanism on kernel command-line
      for example, the DDW mechanism will fail, generating a kernel oops by
      dereferencing a NULL pointer (which turns to be the eeh_dev pointer).
      
      This patch just changes the configuration address calculation on DDW
      functions to a manual calculation based on pci_dn members instead of
      using eeh_dev-based address.
      
      No functional changes were made. This was tested on pSeries, both
      in PHyp and qemu guest.
      
      Fixes: 39baadbf ("powerpc/eeh: Remove eeh information from pci_dn")
      Cc: stable@vger.kernel.org # v3.4+
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8445a87f
    • G
      Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" · c2078d9e
      Guilherme G. Piccoli 提交于
      This reverts commit 89a51df5.
      
      The function eeh_add_device_early() is used to perform EEH
      initialization in devices added later on the system, like in
      hotplug/DLPAR scenarios. Since the commit 89a51df5 ("powerpc/eeh:
      Fix crash in eeh_add_device_early() on Cell") a new check was introduced
      in this function - Cell has no EEH capabilities which led to kernel oops
      if hotplug was performed, so checking for eeh_enabled() was introduced
      to avoid the issue.
      
      However, in architectures that EEH is present like pSeries or PowerNV,
      we might reach a case in which no PCI devices are present on boot time
      and so EEH is not initialized. Then, if a device is added via DLPAR for
      example, eeh_add_device_early() fails because eeh_enabled() is false,
      and EEH end up not being enabled at all.
      
      This reverts the aforementioned patch since a new verification was
      introduced by the commit d91dafc0 ("powerpc/eeh: Delay probing EEH
      device during hotplug") and so the original Cell issue does not happen
      anymore.
      
      Cc: stable@vger.kernel.org # v4.1+
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c2078d9e
    • G
      powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner() · d6d63d72
      Gavin Shan 提交于
      The label "reset" in eeh_pe_change_owner() is used only for once.
      No need to keep it and just drop it. No logical changes introduced.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d6d63d72
    • G
      powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover() · 2efc771f
      Gavin Shan 提交于
      The function eeh_pe_reset_and_recover() is used to recover EEH
      error when the passthrough device are transferred to guest and
      backwards, meaning the device's driver is vfio-pci or none. In
      both cases, the handlers triggered by eeh_report_reset() and
      eeh_report_resume() shouldn't be called.
      
      This ignores the error handlers from eeh_report_reset() and
      eeh_report_resume().
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2efc771f
    • G
      powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover() · 5a0cdbfd
      Gavin Shan 提交于
      The function eeh_pe_reset_and_recover() is used to recover EEH
      error when the passthrou device are transferred to guest and
      backwards. The content in the device's config space will be lost
      on PE reset issued in the middle of the recovery. The function
      saves/restores it before/after the reset. However, config access
      to some adapters like Broadcom BCM5719 at this point will causes
      fenced PHB. The config space is always blocked and we save 0xFF's
      that are restored at late point. The memory BARs are totally
      corrupted, causing another EEH error upon access to one of the
      memory BARs.
      
      This restores the config space on those adapters like BCM5719
      from the content saved to the EEH device when it's populated,
      to resolve above issue.
      
      Fixes: 5cfb20b9 ("powerpc/eeh: Emulate EEH recovery for VFIO devices")
      Cc: stable@vger.kernel.org #v3.18+
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5a0cdbfd
    • G
      powerpc/eeh: Don't report error in eeh_pe_reset_and_recover() · affeb0f2
      Gavin Shan 提交于
      The function eeh_pe_reset_and_recover() is used to recover EEH
      error when the passthrough device are transferred to guest and
      backwards, meaning the device's driver is vfio-pci or none.
      When the driver is vfio-pci that provides error_detected() error
      handler only, the handler simply stops the guest and it's not
      expected behaviour. On the other hand, no error handlers will
      be called if we don't have a bound driver.
      
      This ignores the error handler in eeh_pe_reset_and_recover()
      that reports the error to device driver to avoid the exceptional
      behaviour.
      
      Fixes: 5cfb20b9 ("powerpc/eeh: Emulate EEH recovery for VFIO devices")
      Cc: stable@vger.kernel.org #v3.18+
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      affeb0f2
    • M
      Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()" · 848912e5
      Michael Ellerman 提交于
      This reverts commit c8ceacc2.
      
      Gavin says: I missed the fact that it affects the PCI passthrou path as
      reported by Alexey: When passing GPU (0003:01:00.0) which seats behind
      the root port, the reset request is routed to skiboot in original code.
      In skiboot, the link bouncing events are masked during the reset. So we
      don't see EEH (freeze all) error even link bouncing happens. With the
      changes included, the reset is done by kernel and the link bouncing
      events aren't masked by altering content of PHB3 (or P7IOC) specific
      hardware registers which are invisible to kernel (skiboot hides the
      hardware specific). It means the link bouncing is seen by the root port
      and it causes a EEH (freeze all) error. The PCI passthrough on GPU
      device cannot work.
      Requested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Requested-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      848912e5
    • P
      KVM: PPC: Book3S HV: Re-enable XICS fast path for irqfd-generated interrupts · b1a4286b
      Paul Mackerras 提交于
      Commit c9a5ecca ("kvm/eventfd: add arch-specific set_irq",
      2015-10-16) added the possibility for architecture-specific code
      to handle the generation of virtual interrupts in atomic context
      where possible, without having to schedule a work function.
      
      Since we can easily generate virtual interrupts on XICS without
      having to do anything worse than take a spinlock, we define a
      kvm_arch_set_irq_inatomic() for XICS.  We also remove kvm_set_msi()
      since it is not used any more.
      
      The one slightly tricky thing is that with the new interface, we
      don't get told whether the interrupt is an MSI (or other edge
      sensitive interrupt) vs. level-sensitive.  The difference as far
      as interrupt generation is concerned is that for LSIs we have to
      set the asserted flag so it will continue to fire until it is
      explicitly cleared.
      
      In fact the XICS code gets told which interrupts are LSIs by userspace
      when it configures the interrupt via the KVM_DEV_XICS_GRP_SOURCES
      attribute group on the XICS device.  To store this information, we add
      a new "lsi" field to struct ics_irq_state.  With that we can also do a
      better job of returning accurate values when reading the attribute
      group.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      b1a4286b
    • G
      kvm: introduce KVM_MAX_VCPU_ID · 0b1b1dfd
      Greg Kurz 提交于
      The KVM_MAX_VCPUS define provides the maximum number of vCPUs per guest, and
      also the upper limit for vCPU ids. This is okay for all archs except PowerPC
      which can have higher ids, depending on the cpu/core/thread topology. In the
      worst case (single threaded guest, host with 8 threads per core), it limits
      the maximum number of vCPUS to KVM_MAX_VCPUS / 8.
      
      This patch separates the vCPU numbering from the total number of vCPUs, with
      the introduction of KVM_MAX_VCPU_ID, as the maximal valid value for vCPU ids
      plus one.
      
      The corresponding KVM_CAP_MAX_VCPU_ID allows userspace to validate vCPU ids
      before passing them to KVM_CREATE_VCPU.
      
      This patch only implements KVM_MAX_VCPU_ID with a specific value for PowerPC.
      Other archs continue to return KVM_MAX_VCPUS instead.
      Suggested-by: NRadim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
      Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0b1b1dfd
  9. 11 5月, 2016 9 次提交
    • A
      powerpc/powernv/npu: Enable NVLink pass through · b5cb9ab1
      Alexey Kardashevskiy 提交于
      IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which
      also has a couple of fast speed links (NVLink). The interface to links
      is exposed as an emulated PCI bridge which is included into the same
      IOMMU group as the corresponding GPU.
      
      In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE
      which behave pretty much as the standard IODA2 PHB except NPU PHB has
      just a single TVE in the hardware which means it can have either
      32bit window or 64bit window or DMA bypass but never two of these.
      
      In order to make these links work when GPU is passed to the guest,
      these bridges need to be passed as well; otherwise performance will
      degrade.
      
      This implements and exports API to manage NPU state in regard to VFIO;
      it replicates iommu_table_group_ops.
      
      This defines a new pnv_pci_ioda2_npu_ops which is assigned to
      the IODA2 bridge if there are NPUs for a GPU on the bridge.
      The new callbacks call the default IODA2 callbacks plus new NPU API.
      This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2
      table_group, it is not expected to fail as the helper is only called
      from the pnv_pci_ioda2_npu_ops.
      
      This does not define NPU-specific .release_ownership() so after
      VFIO is finished, DMA on NPU is disabled which is ok as the nvidia
      driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU.
      
      This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to
      the GPU group if any found. The helper uses helpers to look for
      the "ibm,gpu" property in the device tree which is a phandle of
      the corresponding GPU.
      
      This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
      loop skips NPU PEs as they do not have 32bit DMA segments.
      
      As pnv_npu_set_window() and pnv_npu_unset_window() are started being used
      by the new IODA2-NPU IOMMU group, this makes the helpers public and
      adds the DMA window number parameter.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-By: NAlistair Popple <alistair@popple.id.au>
      [mpe: Add pnv_pci_ioda_setup_iommu_api() to fix build with IOMMU_API=n]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b5cb9ab1
    • A
      powerpc/powernv/npu: Rework TCE Kill handling · 85674868
      Alexey Kardashevskiy 提交于
      The pnv_ioda_pe struct keeps an array of peers. At the moment it is only
      used to link GPU and NPU for 2 purposes:
      
      1. Access NPU quickly when configuring DMA for GPU - this was addressed
      in the previos patch by removing use of it as DMA setup is not what
      the kernel would constantly do.
      
      2. Invalidate TCE cache for NPU when it is invalidated for GPU.
      GPU and NPU are in different PE. There is already a mechanism to
      attach multiple iommu_table_group to the same iommu_table (used for VFIO),
      we can reuse it here so does this patch.
      
      This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are
      not needed anymore.
      
      While we are here, add TCE cache invalidation after enabling bypass.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-By: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      85674868
    • A
      powerpc/powernv/npu: Add set/unset window helpers · b575c731
      Alexey Kardashevskiy 提交于
      The upcoming NVLink passthrough support will require NPU code to cope
      with two DMA windows.
      
      This adds a pnv_npu_set_window() helper which programs 32bit window to
      the hardware. This also adds multilevel TCE support.
      
      This adds a pnv_npu_unset_window() helper which removes the DMA window
      from the hardware. This does not make difference now as the caller -
      pnv_npu_dma_set_bypass() - enables bypass in the hardware but the next
      patch will use it to manage TCE table lists for TCE Kill handling.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-By: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b575c731
    • A
      powerpc/powernv/ioda2: Export debug helper pe_level_printk() · 7d623e42
      Alexey Kardashevskiy 提交于
      This exports debugging helper pe_level_printk() and corresponding macroses
      so they can be used in npu-dma.c.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-By: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7d623e42
    • A
      powerpc/powernv/npu: Simplify DMA setup · f9f83456
      Alexey Kardashevskiy 提交于
      NPU devices are emulated in firmware and mainly used for NPU NVLink
      training; one NPU device is per a hardware link. Their DMA/TCE setup
      must match the GPU which is connected via PCIe and NVLink so any changes
      to the DMA/TCE setup on the GPU PCIe device need to be propagated to
      the NVLink device as this is what device drivers expect and it doesn't
      make much sense to do anything else.
      
      This makes NPU DMA setup explicit.
      pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda,
      made static and prints warning as dma_set_mask() should never be called
      on this function as in any case it will not configure GPU; so we make
      this explicit.
      
      Instead of using PNV_IODA_PE_PEER and peers[] (which the next patch will
      remove), we test every PCI device if there are corresponding NVLink
      devices. If there are any, we propagate bypass mode to just found NPU
      devices by calling the setup helper directly (which takes @bypass) and
      avoid guessing (i.e. calculating from DMA mask) whether we need bypass
      or not on NPU devices. Since DMA setup happens in very rare occasion,
      this will not slow down booting or VFIO start/stop much.
      
      This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it
      more clear what the function really does which is programming 32bit
      table address to the TVT ("disabling bypass" means writing zeroes to
      the TVT).
      
      This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as
      the DMA configuration on NPU does not matter until dma_set_mask() is
      called on GPU and that will do the NPU DMA configuration.
      
      This removes phb->dma_dev_setup initialization for NPU as
      pnv_pci_ioda_dma_dev_setup is no-op for it anyway.
      
      This stops using npe->tce_bypass_base as it never changes and values
      other than zero are not supported.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f9f83456
    • A
      powerpc/powernv/npu: Use the correct IOMMU page size · 6969af73
      Alexey Kardashevskiy 提交于
      This uses the page size from iommu_table instead of hard-coded 4K.
      This should cause no change in behavior.
      
      While we are here, move bits around to prepare for further rework
      which will define and use iommu_table_group_ops.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      6969af73
    • A
      powerpc/powernv/npu: TCE Kill helpers cleanup · 0bbcdb43
      Alexey Kardashevskiy 提交于
      NPU PHB TCE Kill register is exactly the same as in the rest of POWER8
      so let's reuse the existing code for NPU. The only bit missing is
      a helper to reset the entire TCE cache so this moves such a helper
      from NPU code and renames it.
      
      Since pnv_npu_tce_invalidate() does really invalidate the entire cache,
      this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU.
      This adds an explicit comment for workaround for invalidating NPU TCE
      cache.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0bbcdb43
    • A
      powerpc/powernv: Define TCE Kill flags · bef9253f
      Alexey Kardashevskiy 提交于
      This replaces magic constants for TCE Kill IODA2 register with macros.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bef9253f
    • A
      powerpc/powernv: Rename pnv_pci_ioda2_tce_invalidate_entire · a7cf13ca
      Alexey Kardashevskiy 提交于
      As in fact pnv_pci_ioda2_tce_invalidate_entire() invalidates TCEs for
      the specific PE rather than the entire cache, rename it to
      pnv_pci_ioda2_tce_invalidate_pe(). In later patches we will add
      a proper pnv_pci_ioda2_tce_invalidate_entire().
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a7cf13ca