1. 28 4月, 2017 2 次提交
  2. 26 4月, 2017 2 次提交
    • M
      powerpc/powernv: Fix oops on P9 DD1 in cause_ipi() · 45b21cfe
      Michael Ellerman 提交于
      Recently we merged the native xive support for Power9, and then separately some
      reworks for doorbell IPI support. In isolation both series were OK, but the
      merged result had a bug in one case.
      
      On P9 DD1 we use pnv_p9_dd1_cause_ipi() which tries to use doorbells, and then
      falls back to the interrupt controller. However the fallback is implemented by
      calling icp_ops->cause_ipi. But now that xive support is merged we might be
      using xive, in which case icp_ops is not initialised, it's a xics specific
      structure. This leads to an oops such as:
      
        Unable to handle kernel paging request for data at address 0x00000028
        Oops: Kernel access of bad area, sig: 11 [#1]
        NIP pnv_p9_dd1_cause_ipi+0x74/0xe0
        LR smp_muxed_ipi_message_pass+0x54/0x70
      
      To fix it, rather than using icp_ops which might be NULL, have both xics and
      xive set smp_ops->cause_ipi, and then in the powernv code we save that as
      ic_cause_ipi before overriding smp_ops->cause_ipi. For paranoia add a WARN_ON()
      to check if somehow smp_ops->cause_ipi is NULL.
      
      Fixes: b866cc21 ("powerpc: Change the doorbell IPI calling convention")
      Tested-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      45b21cfe
    • M
      powerpc/powernv: Fix missing attr initialisation in opal_export_attrs() · 83c49190
      Michael Ellerman 提交于
      In opal_export_attrs() we dynamically allocate some bin_attributes. They're
      allocated with kmalloc() and although we initialise most of the fields, we don't
      initialise write() or mmap(), and in particular we don't initialise the lockdep
      related fields in the embedded struct attribute.
      
      This leads to a lockdep warning at boot:
      
        BUG: key c0000000f11906d8 not in .data!
        WARNING: CPU: 0 PID: 1 at ../kernel/locking/lockdep.c:3136 lockdep_init_map+0x28c/0x2a0
        ...
        Call Trace:
          lockdep_init_map+0x288/0x2a0 (unreliable)
          __kernfs_create_file+0x8c/0x170
          sysfs_add_file_mode_ns+0xc8/0x240
          __machine_initcall_powernv_opal_init+0x60c/0x684
          do_one_initcall+0x60/0x1c0
          kernel_init_freeable+0x2f4/0x3d4
          kernel_init+0x24/0x160
          ret_from_kernel_thread+0x5c/0xb0
      
      Fix it by kzalloc'ing the attr, which fixes the uninitialised write() and
      mmap(), and calling sysfs_bin_attr_init() on it to initialise the lockdep
      fields.
      
      Fixes: 11fe909d ("powerpc/powernv: Add OPAL exports attributes to sysfs")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      83c49190
  3. 23 4月, 2017 1 次提交
    • N
      powerpc/64s: Stop using bit in HSPRG0 to test winkle · 544686ca
      Nicholas Piggin 提交于
      The POWER8 idle code has a neat trick of programming the power on engine
      to restore a low bit into HSPRG0, so idle wakeup code can test and see
      if it has been programmed this way and therefore lost all state. Restore
      time can be reduced if winkle has not been reached.
      
      However this messes with our r13 PACA pointer, and requires HSPRG0 to be
      written to. It also optimizes the slowest and most uncommon case at the
      expense of another SPR write in the common nap state wakeup.
      
      Remove this complexity and assume winkle sleeps always require a state
      restore. This speedup could be made entirely contained within the winkle
      idle code by counting per-core winkles and setting a thread bitmap when
      all have gone to winkle.
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      544686ca
  4. 13 4月, 2017 3 次提交
    • M
      powerpc/powernv: Always enable SMP when building powernv · 40e27565
      Michael Ellerman 提交于
      The powernv platform supports Power7 and later CPUs, all of which are
      multithreaded and multicore.
      
      As such we never build a SMP=n kernel for those machines, other than
      possibly for debugging or running in a simulator.
      
      In the debugging case we can get a similar effect by booting with
      nr_cpus=1, or there's always the option of building a custom kernel with
      SMP hacked out.
      
      For running in simulators the code size reduction from building without
      SMP is not particularly important, what matters is the number of
      instructions executed. A quick test shows that a SMP=y kernel takes ~6%
      more instructions to boot to a shell. Booting with nr_cpus=1 recovers
      about half that deficit.
      
      On the flip side, keeping the SMP=n kernel building can be a pain at
      times. And although we've mostly kept it building in recent years, no
      one is regularly testing that the SMP=n kernel actually boots and works
      well on these machines.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      40e27565
    • N
      powerpc/powernv: POWER9 support for msgsnd/doorbell IPI · 6b3edefe
      Nicholas Piggin 提交于
      POWER9 requires msgsync for receiver-side synchronization, and a DD1
      workaround restricts IPIs to core-local.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      [mpe: Drop no longer needed asm feature macro changes]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      6b3edefe
    • N
      powerpc: Change the doorbell IPI calling convention · b866cc21
      Nicholas Piggin 提交于
      Change the doorbell callers to know about their msgsnd addressing,
      rather than have them set a per-cpu target data tag at boot that gets
      sent to the cause_ipi functions. The data is only used for doorbell IPI
      functions, no other IPI types, so it makes sense to keep that detail
      local to doorbell.
      
      Have the platform code understand doorbell IPIs, rather than the
      interrupt controller code understand them. Platform code can look at
      capabilities it has available and decide which to use.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b866cc21
  5. 11 4月, 2017 6 次提交
    • G
      powerpc/powernv: Recover correct PACA on wakeup from a stop on P9 DD1 · 17ed4c8f
      Gautham R. Shenoy 提交于
      POWER9 DD1.0 hardware has a bug where the SPRs of a thread waking up
      from stop 0,1,2 with ESL=1 can endup being misplaced in the core. Thus
      the HSPRG0 of a thread waking up from can contain the paca pointer of
      its sibling.
      
      This patch implements a context recovery framework within threads of a
      core, by provisioning space in paca_struct for saving every sibling
      threads's paca pointers. Basically, we should be able to arrive at the
      right paca pointer from any of the thread's existing paca pointer.
      
      At bootup, during powernv idle-init, we save the paca address of every
      CPU in each one its siblings paca_struct in the slot corresponding to
      this CPU's index in the core.
      
      On wakeup from a stop, the thread will determine its index in the core
      from the TIR register and recover its PACA pointer by indexing into
      the correct slot in the provisioned space in the current PACA.
      
      Furthermore, ensure that the NVGPRs are restored from the stack on the
      way out by setting the NAPSTATELOST in paca.
      
      [Changelog written with inputs from svaidy@linux.vnet.ibm.com]
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      [mpe: Call it a bug]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      17ed4c8f
    • G
      powerpc/powernv/idle: Don't override default/deepest directly in kernel · f3b3f284
      Gautham R. Shenoy 提交于
      Currently during idle-init on power9, if we don't find suitable stop
      states in the device tree that can be used as the
      default_stop/deepest_stop, we set stop0 (ESL=1,EC=1) as the default
      stop state psscr to be used by power9_idle and deepest stop state
      which is used by CPU-Hotplug.
      
      However, if the platform firmware has not configured or enabled a stop
      state, the kernel should not make any assumptions and fallback to a
      default choice.
      
      If the kernel uses a stop state that is not configured by the platform
      firmware, it may lead to further failures which should be avoided.
      
      In this patch, we modify the init code to ensure that the kernel uses
      only the stop states exposed by the firmware through the device
      tree. When a suitable default stop state isn't found, we disable
      ppc_md.power_save for power9. Similarly, when a suitable
      deepest_stop_state is not found in the device tree exported by the
      firmware, fall back to the default busy-wait loop in the CPU-Hotplug
      code.
      
      [Changelog written with inputs from svaidy@linux.vnet.ibm.com]
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f3b3f284
    • G
      powerpc/powernv/smp: Add busy-wait loop as fall back for CPU-Hotplug · 90061231
      Gautham R. Shenoy 提交于
      Currently, the powernv cpu-offline function assumes that platform idle
      states such as stop on POWER9, winkle/sleep/nap on POWER8 are always
      available. On POWER8, it picks nap as the default state if other deep
      idle states like sleep/winkle are not available and enabled in the
      platform.
      
      On POWER9, nap is not available and all idle states are managed by
      STOP instruction.  The parameters to the idle state are passed through
      processor stop status control register (PSSCR).  Hence as such
      executing STOP would take parameters from current PSSCR. We do not
      want to make any assumptions in kernel on what STOP states and PSSCR
      features are configured by the platform.
      
      Ideally platform will configure a good set of stop states that can be
      used in the kernel.  We would like to start with a clean slate, if the
      platform choose to not configure any state or there is an error in
      platform firmware that lead to no stop states being configured or
      allowed to be requested.
      
      This patch adds a fallback method for CPU-Hotplug that is similar to
      snooze loop at idle where the threads are left to spin at low priority
      and hence reduce the cycles consumed.
      
      This is a safe fallback mechanism in the case when no stop state would
      be requested if the platform firmware did not configure them most
      likely due to an error condition.
      
      Requesting a stop state when the platform has not configured them or
      enabled them would lead to further error conditions which could be
      difficult to debug.
      
      [Changelog written with inputs from svaidy@linux.vnet.ibm.com]
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      90061231
    • G
      powerpc/powernv: Move CPU-Offline idle state invocation from smp.c to idle.c · a7cd88da
      Gautham R. Shenoy 提交于
      Move the piece of code in powernv/smp.c::pnv_smp_cpu_kill_self() which
      transitions the CPU to the deepest available platform idle state to a
      new function named pnv_cpu_offline() in powernv/idle.c. The rationale
      behind this code movement is that the data required to determine the
      deepest available platform state resides in powernv/idle.c.
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a7cd88da
    • M
      powerpc: Create asm/debugfs.h and move powerpc_debugfs_root there · 7644d581
      Michael Ellerman 提交于
      powerpc_debugfs_root is the dentry representing the root of the
      "powerpc" directory tree in debugfs.
      
      Currently it sits in asm/debug.h, a long with some other things that
      have "debug" in the name, but are otherwise unrelated.
      
      Pull it out into a separate header, which also includes linux/debugfs.h,
      and convert all the users to include debugfs.h instead of debug.h.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7644d581
    • A
      powerpc/powernv: Require MMU_NOTIFIER to fix NPU build · abfe8026
      Alistair Popple 提交于
      In the recent commit 1ab66d1f ("powerpc/powernv: Introduce address
      translation services for Nvlink2") the NPU code gained a dependency on MMU
      notifiers.
      
      All our defconfigs have KVM enabled, which selects MMU_NOTIFIER, but if KVM is
      not enabled then the build breaks.
      
      Fix it by always selecting MMU_NOTIFIER when we're building powernv.
      
      Fixes: 1ab66d1f ("powerpc/powernv: Introduce address translation services for Nvlink2")
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      [mpe: Reword change log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      abfe8026
  6. 10 4月, 2017 2 次提交
    • B
      powerpc: Consolidate variants of real-mode MMIOs · d381d7ca
      Benjamin Herrenschmidt 提交于
      We have all sort of variants of MMIO accessors for the real mode
      instructions. This creates a clean set of accessors based on
      Linux normal naming conventions, replacing all occurrences of
      the old ones in the tree.
      
      I have purposefully removed the "out/in" variants in favor of
      only including __raw variants. Any code using these is already
      pretty much hand tuned to operate in a very specific environment.
      I've fixed up the 2 users (only one of them actually needed
      a barrier in the first place).
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d381d7ca
    • B
      powerpc/xive: Native exploitation of the XIVE interrupt controller · 243e2511
      Benjamin Herrenschmidt 提交于
      The XIVE interrupt controller is the new interrupt controller
      found in POWER9. It supports advanced virtualization capabilities
      among other things.
      
      Currently we use a set of firmware calls that simulate the old
      "XICS" interrupt controller but this is fairly inefficient.
      
      This adds the framework for using XIVE along with a native
      backend which OPAL for configuration. Later, a backend allowing
      the use in a KVM or PowerVM guest will also be provided.
      
      This disables some fast path for interrupts in KVM when XIVE is
      enabled as these rely on the firmware emulation code which is no
      longer available when the XIVE is used natively by Linux.
      
      A latter patch will make KVM also directly exploit the XIVE, thus
      recovering the lost performance (and more).
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      [mpe: Fixup pr_xxx("XIVE:"...), don't split pr_xxx() strings,
       tweak Kconfig so XIVE_NATIVE selects XIVE and depends on POWERNV,
       fix build errors when SMP=n, fold in fixes from Ben:
         Don't call cpu_online() on an invalid CPU number
         Fix irq target selection returning out of bounds cpu#
         Extra sanity checks on cpu numbers
       ]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      243e2511
  7. 06 4月, 2017 1 次提交
  8. 04 4月, 2017 2 次提交
    • M
      powerpc/powernv: Add OPAL exports attributes to sysfs · 11fe909d
      Matt Brown 提交于
      New versions of OPAL have a device node /ibm,opal/firmware/exports, each
      property of which describes a range of memory in OPAL that Linux might
      want to export to userspace for debugging.
      
      This patch adds a sysfs file under 'opal/exports' for each property
      found there, and makes it read-only by root.
      Signed-off-by: NMatt Brown <matthew.brown.dev@gmail.com>
      [mpe: Drop counting of props, rename to attr, free on sysfs error, c'log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      11fe909d
    • A
      powerpc/powernv: Introduce address translation services for Nvlink2 · 1ab66d1f
      Alistair Popple 提交于
      Nvlink2 supports address translation services (ATS) allowing devices
      to request address translations from an mmu known as the nest MMU
      which is setup to walk the CPU page tables.
      
      To access this functionality certain firmware calls are required to
      setup and manage hardware context tables in the nvlink processing unit
      (NPU). The NPU also manages forwarding of TLB invalidates (known as
      address translation shootdowns/ATSDs) to attached devices.
      
      This patch exports several methods to allow device drivers to register
      a process id (PASID/PID) in the hardware tables and to receive
      notification of when a device should stop issuing address translation
      requests (ATRs). It also adds a fault handler to allow device drivers
      to demand fault pages in.
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      [mpe: Fix up comment formatting, use flush_tlb_mm()]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1ab66d1f
  9. 03 4月, 2017 2 次提交
  10. 31 3月, 2017 2 次提交
  11. 30 3月, 2017 3 次提交
  12. 28 3月, 2017 2 次提交
  13. 20 3月, 2017 2 次提交
  14. 10 3月, 2017 1 次提交
  15. 09 3月, 2017 2 次提交
    • A
      powerpc/powernv/ioda2: Update iommu table base on ownership change · db08e1d5
      Alexey Kardashevskiy 提交于
      On POWERNV platform, in order to do DMA via IOMMU (i.e. 32bit DMA in
      our case), a device needs an iommu_table pointer set via
      set_iommu_table_base().
      
      The codeflow is:
      - pnv_pci_ioda2_setup_dma_pe()
      	- pnv_pci_ioda2_setup_default_config()
      	- pnv_ioda_setup_bus_dma() [1]
      
      pnv_pci_ioda2_setup_dma_pe() creates IOMMU groups,
      pnv_pci_ioda2_setup_default_config() does default DMA setup,
      pnv_ioda_setup_bus_dma() takes a bus PE (on IODA2, all physical function
      PEs as bus PEs except NPU), walks through all underlying buses and
      devices, adds all devices to an IOMMU group and sets iommu_table.
      
      On IODA2, when VFIO is used, it takes ownership over a PE which means it
      removes all tables and creates new ones (with a possibility of sharing
      them among PEs). So when the ownership is returned from VFIO to
      the kernel, the iommu_table pointer written to a device at [1] is
      stale and needs an update.
      
      This adds an "add_to_group" parameter to pnv_ioda_setup_bus_dma()
      (in fact re-adds as it used to be there a while ago for different
      reasons) to tell the helper if a device needs to be added to
      an IOMMU group with an iommu_table update or just the latter.
      
      This calls pnv_ioda_setup_bus_dma(..., false) from
      pnv_ioda2_release_ownership() so when the ownership is restored,
      32bit DMA can work again for a device. This does the same thing
      on obtaining ownership as the iommu_table point is stale at this point
      anyway and it is safer to have NULL there.
      
      We did not hit this earlier as all tested devices in recent years were
      only using 64bit DMA; the rare exception for this is MPT3 SAS adapter
      which uses both 32bit and 64bit DMA access and it has not been tested
      with VFIO much.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      db08e1d5
    • A
      powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested · 7aafac11
      Alexey Kardashevskiy 提交于
      The IODA2 specification says that a 64 DMA address cannot use top 4 bits
      (3 are reserved and one is a "TVE select"); bottom page_shift bits
      cannot be used for multilevel table addressing either.
      
      The existing IODA2 table allocation code aligns the minimum TCE table
      size to PAGE_SIZE so in the case of 64K system pages and 4K IOMMU pages,
      we have 64-4-12=48 bits. Since 64K page stores 8192 TCEs, i.e. needs
      13 bits, the maximum number of levels is 48/13 = 3 so we physically
      cannot address more and EEH happens on DMA accesses.
      
      This adds a check that too many levels were requested.
      
      It is still possible to have 5 levels in the case of 4K system page size.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7aafac11
  16. 04 3月, 2017 1 次提交
    • A
      powerpc/powernv: Fix opal tracepoints with JUMP_LABEL=n · 2a9c4f40
      Alexey Kardashevskiy 提交于
      The recent commit to allow calling OPAL calls in real mode, commit
      ab9bad0e ("powerpc/powernv: Remove separate entry for OPAL real mode
      calls"), introduced a bug when CONFIG_JUMP_LABEL=n.
      
      The commit moved the "mfmsr r12" prior to the call to OPAL_BRANCH, but
      we missed that OPAL_BRANCH clobbers r12 when jump labels are disabled.
      This leads to us using the tracepoint refcount as the MSR value,
      typically zero, and saving that into PACASAVEDMSR. When we return from
      OPAL we use that value as the MSR value for rfid, meaning we switch to
      32-bit BE real mode - hilarity ensues.
      
      Fix it by using r11 in OPAL_BRANCH, which is not live at the time the
      macro is used in OPAL_CALL.
      
      Fixes: ab9bad0e ("powerpc/powernv: Remove separate entry for OPAL real mode calls")
      Suggested-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2a9c4f40
  17. 02 3月, 2017 1 次提交
  18. 28 2月, 2017 1 次提交
  19. 20 2月, 2017 1 次提交
  20. 17 2月, 2017 1 次提交
  21. 10 2月, 2017 1 次提交
    • M
      powerpc/powernv: Fix opal_exit tracepoint opcode · a7e0fb6c
      Michael Ellerman 提交于
      Currently the opal_exit tracepoint usually shows the opcode as 0:
      
        <idle>-0     [047] d.h.   635.654292: opal_entry: opcode=63
        <idle>-0     [047] d.h.   635.654296: opal_exit: opcode=0 retval=0
        kopald-1209  [019] d...   636.420943: opal_entry: opcode=10
        kopald-1209  [019] d...   636.420959: opal_exit: opcode=0 retval=0
      
      This is because we incorrectly load the opcode into r0 before calling
      __trace_opal_exit(), whereas it expects the opcode in r3 (first function
      parameter). In fact we are leaving the retval in r3, so opcode and
      retval will always show the same value.
      
      Instead load the opcode into r3, resulting in:
      
        <idle>-0     [040] d.h.   636.618625: opal_entry: opcode=63
        <idle>-0     [040] d.h.   636.618627: opal_exit: opcode=63 retval=0
      
      Fixes: c49f6353 ("powernv: Add OPAL tracepoints")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a7e0fb6c
  22. 09 2月, 2017 1 次提交