1. 29 4月, 2019 1 次提交
    • N
      powerpc/pseries: Track LMB nid instead of using device tree · b2d3b5ee
      Nathan Fontenot 提交于
      When removing memory we need to remove the memory from the node
      it was added to instead of looking up the node it should be in
      in the device tree.
      
      During testing we have seen scenarios where the affinity for a
      LMB changes due to a partition migration or PRRN event. In these
      cases the node the LMB exists in may not match the node the device
      tree indicates it belongs in. This can lead to a system crash
      when trying to DLPAR remove the LMB after a migration or PRRN
      event. The current code looks up the node in the device tree to
      remove the LMB from, the crash occurs when we try to offline this
      node and it does not have any data, i.e. node_data[nid] == NULL.
      
      36:mon> e
      cpu 0x36: Vector: 300 (Data Access) at [c0000001828b7810]
          pc: c00000000036d08c: try_offline_node+0x2c/0x1b0
          lr: c0000000003a14ec: remove_memory+0xbc/0x110
          sp: c0000001828b7a90
         msr: 800000000280b033
         dar: 9a28
       dsisr: 40000000
        current = 0xc0000006329c4c80
        paca    = 0xc000000007a55200   softe: 0        irq_happened: 0x01
          pid   = 76926, comm = kworker/u320:3
      
      36:mon> t
      [link register   ] c0000000003a14ec remove_memory+0xbc/0x110
      [c0000001828b7a90] c00000000006a1cc arch_remove_memory+0x9c/0xd0 (unreliable)
      [c0000001828b7ad0] c0000000003a14e0 remove_memory+0xb0/0x110
      [c0000001828b7b20] c0000000000c7db4 dlpar_remove_lmb+0x94/0x160
      [c0000001828b7b60] c0000000000c8ef8 dlpar_memory+0x7e8/0xd10
      [c0000001828b7bf0] c0000000000bf828 handle_dlpar_errorlog+0xf8/0x160
      [c0000001828b7c60] c0000000000bf8cc pseries_hp_work_fn+0x3c/0xa0
      [c0000001828b7c90] c000000000128cd8 process_one_work+0x298/0x5a0
      [c0000001828b7d20] c000000000129068 worker_thread+0x88/0x620
      [c0000001828b7dc0] c00000000013223c kthread+0x1ac/0x1c0
      [c0000001828b7e30] c00000000000b45c ret_from_kernel_thread+0x5c/0x80
      
      To resolve this we need to track the node a LMB belongs to when
      it is added to the system so we can remove it from that node instead
      of the node that the device tree indicates it should belong to.
      Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b2d3b5ee
  2. 20 4月, 2019 5 次提交
  3. 29 3月, 2019 1 次提交
  4. 27 3月, 2019 1 次提交
  5. 08 3月, 2019 1 次提交
    • M
      powerpc: prefer memblock APIs returning virtual address · f806714f
      Mike Rapoport 提交于
      Patch series "memblock: simplify several early memory allocation", v4.
      
      These patches simplify some of the early memory allocations by replacing
      usage of older memblock APIs with newer and shinier ones.
      
      Quite a few places in the arch/ code allocated memory using a memblock
      API that returns a physical address of the allocated area, then
      converted this physical address to a virtual one and then used memset(0)
      to clear the allocated range.
      
      More recent memblock APIs do all the three steps in one call and their
      usage simplifies the code.
      
      It's important to note that regardless of API used, the core allocation
      is nearly identical for any set of memblock allocators: first it tries
      to find a free memory with all the constraints specified by the caller
      and then falls back to the allocation with some or all constraints
      disabled.
      
      The first three patches perform the conversion of call sites that have
      exact requirements for the node and the possible memory range.
      
      The fourth patch is a bit one-off as it simplifies openrisc's
      implementation of pte_alloc_one_kernel(), and not only the memblock
      usage.
      
      The fifth patch takes care of simpler cases when the allocation can be
      satisfied with a simple call to memblock_alloc().
      
      The sixth patch removes one-liner wrappers for memblock_alloc on arm and
      unicore32, as suggested by Christoph.
      
      This patch (of 6):
      
      There are a several places that allocate memory using memblock APIs that
      return a physical address, convert the returned address to the virtual
      address and frequently also memset(0) the allocated range.
      
      Update these places to use memblock allocators already returning a
      virtual address.  Use memblock functions that clear the allocated memory
      instead of calling memset(0) where appropriate.
      
      The calls to memblock_alloc_base() that were not followed by memset(0)
      are replaced with memblock_alloc_try_nid_raw().  Since the latter does
      not panic() when the allocation fails, the appropriate panic() calls are
      added to the call sites.
      
      Link: http://lkml.kernel.org/r/1546248566-14910-2-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Simek <michal.simek@xilinx.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f806714f
  6. 21 2月, 2019 1 次提交
    • T
      powerpc/pseries: export timebase register sample in lparcfg · 9f3ba362
      Tyrel Datwyler 提交于
      The Processor Utilzation of Resource Registers (PURR) provide an
      estimate of resources used by a cpu thread. Section 7.6 in Book III of
      the ISA outlines how to calculate the percentage of shared resources
      for threads using the ratio of the PURR delta and Timebase Register
      delta for a sampled period.
      
      This calculation is currently done erroneously by the lparstat tool
      from the powerpc-utils package. This patch exports the current
      timebase value after we sample the PURRs and exposes it to userspace
      accounting tools via /proc/ppc64/lparcfg.
      Signed-off-by: NTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9f3ba362
  7. 18 2月, 2019 6 次提交
  8. 01 2月, 2019 1 次提交
    • O
      powerpc/papr_scm: Use the correct bind address · 5a3840a4
      Oliver O'Halloran 提交于
      When binding an SCM volume to a physical address the hypervisor has the
      option to return early with a continue token with the expectation that
      the guest will resume the bind operation until it completes. A quirk of
      this interface is that the bind address will only be returned by the
      first bind h-call and the subsequent calls will return
      0xFFFF_FFFF_FFFF_FFFF for the bind address.
      
      We currently do not save the address returned by the first h-call. As a
      result we will use the junk address as the base of the bound region if
      the hypervisor decides to split the bind across multiple h-calls. This
      bug was found when testing with very large SCM volumes where the bind
      process would take more time than they hypervisor's internal h-call time
      limit would allow. This patch fixes the issue by saving the bind address
      from the first call.
      
      Cc: stable@vger.kernel.org
      Fixes: b5beae5e ("powerpc/pseries: Add driver for PAPR SCM regions")
      Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5a3840a4
  9. 30 1月, 2019 1 次提交
  10. 22 1月, 2019 2 次提交
    • G
      pseries: ibmebus.c: convert to use BUS_ATTR_WO · c1507ea8
      Greg Kroah-Hartman 提交于
      We are trying to get rid of BUS_ATTR() and the usage of that in
      ibmebus.c can be trivially converted to use BUS_ATTR_WO(), so use that
      instead.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c1507ea8
    • F
      powerpc: Adopt nvram module for PPC64 · 20e07af7
      Finn Thain 提交于
      Adopt nvram module to reduce code duplication. This means CONFIG_NVRAM
      becomes available to PPC64 builds. Previously it was only available to
      PPC32 builds because it depended on CONFIG_GENERIC_NVRAM.
      
      The IOC_NVRAM_GET_OFFSET ioctl as implemented on PPC64 validates the
      offset returned by pmac_get_partition(). Do the same in the nvram module.
      
      Note that the old PPC32 generic_nvram module lacked this test.
      So when CONFIG_PPC32 && CONFIG_PPC_PMAC, the IOC_NVRAM_GET_OFFSET ioctl
      would have returned 0 (always). But when CONFIG_PPC64 && CONFIG_PPC_PMAC,
      the IOC_NVRAM_GET_OFFSET ioctl would have returned -1 (which is -EPERM)
      when the requested partition was not found.
      
      With this patch, the result is now -EINVAL on both PPC32 and PPC64 when
      the requested PowerMac NVRAM partition is not found. This is a userspace-
      visible change, in the non-existent partition case, which would be in
      an error path for an IOC_NVRAM_GET_OFFSET ioctl syscall.
      Tested-by: NStan Johnson <userm57@yahoo.com>
      Signed-off-by: NFinn Thain <fthain@telegraphics.com.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      20e07af7
  11. 15 1月, 2019 1 次提交
  12. 07 1月, 2019 1 次提交
    • D
      acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node · 8fc5c735
      Dan Williams 提交于
      Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
      Interface Table), is the first known instance of a memory range
      described by a unique "target" proximity domain. Where "initiator" and
      "target" proximity domains is an approach that the ACPI HMAT
      (Heterogeneous Memory Attributes Table) uses to described the unique
      performance properties of a memory range relative to a given initiator
      (e.g. CPU or DMA device).
      
      Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
      char-device follows the traditional notion of 'numa-node' where the
      attribute conveys the closest online numa-node. That numa-node attribute
      is useful for cpu-binding and memory-binding processes *near* the
      device. However, when the memory range backing a 'pmem', or 'dax' device
      is onlined (memory hot-add) the memory-only-numa-node representing that
      address needs to be differentiated from the set of online nodes. In
      other words, the numa-node association of the device depends on whether
      you can bind processes *near* the cpu-numa-node in the offline
      device-case, or bind process *on* the memory-range directly after the
      backing address range is onlined.
      
      Allow for the case that platform firmware describes persistent memory
      with a unique proximity domain, i.e. when it is distinct from the
      proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
      numa-node translation of that proximity through the libnvdimm region
      device to namespaces that are in device-dax mode. With this in place the
      proposed kmem driver [1] can optionally discover a unique numa-node
      number for the address range as it transitions the memory from an
      offline state managed by a device-driver to an online memory range
      managed by the core-mm.
      
      [1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.comReported-by: NFan Du <fan.du@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      8fc5c735
  13. 06 1月, 2019 1 次提交
    • M
      jump_label: move 'asm goto' support test to Kconfig · e9666d10
      Masahiro Yamada 提交于
      Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".
      
      The jump label is controlled by HAVE_JUMP_LABEL, which is defined
      like this:
      
        #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
        # define HAVE_JUMP_LABEL
        #endif
      
      We can improve this by testing 'asm goto' support in Kconfig, then
      make JUMP_LABEL depend on CC_HAS_ASM_GOTO.
      
      Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
      match to the real kernel capability.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      e9666d10
  14. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  15. 29 12月, 2018 1 次提交
  16. 22 12月, 2018 3 次提交
  17. 21 12月, 2018 5 次提交
    • A
      powerpc/powernv/pseries: Rework device adding to IOMMU groups · c4e9d3c1
      Alexey Kardashevskiy 提交于
      The powernv platform registers IOMMU groups and adds devices to them
      from the pci_controller_ops::setup_bridge() hook except one case when
      virtual functions (SRIOV VFs) are added from a bus notifier.
      
      The pseries platform registers IOMMU groups from
      the pci_controller_ops::dma_bus_setup() hook and adds devices from
      the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier
      used for powernv does not add devices for pseries though as
      __of_scan_bus() adds devices first, then it does the bus/dev DMA setup.
      
      Both platforms use iommu_add_device() which takes a device and expects
      it to have a valid IOMMU table struct with an iommu_table_group pointer
      which in turn points the iommu_group struct (which represents
      an IOMMU group). Although the helper seems easy to use, it relies on
      some pre-existing device configuration and associated data structures
      which it does not really need.
      
      This simplifies iommu_add_device() to take the table_group pointer
      directly. Pseries already has a table_group pointer handy and the bus
      notified is not used anyway. For powernv, this copies the existing bus
      notifier, makes it work for powernv only which means an easy way of
      getting to the table_group pointer. This was tested on VFs but should
      also support physical PCI hotplug.
      
      Since iommu_add_device() receives the table_group pointer directly,
      pseries does not do TCE cache invalidation (the hypervisor does) nor
      allow multiple groups per a VFIO container (in other words sharing
      an IOMMU table between partitionable endpoints), this removes
      iommu_table_group_link from pseries.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c4e9d3c1
    • A
      powerpc/pseries: Remove IOMMU API support for non-LPAR systems · c409c631
      Alexey Kardashevskiy 提交于
      The pci_dma_bus_setup_pSeries and pci_dma_dev_setup_pSeries hooks are
      registered for the pseries platform which does not have FW_FEATURE_LPAR;
      these would be pre-powernv platforms which we never supported PCI pass
      through for anyway so remove it.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c409c631
    • A
      powerpc/pseries/npu: Enable platform support · 3be2df00
      Alexey Kardashevskiy 提交于
      We already changed NPU API for GPUs to not to call OPAL and the remaining
      bit is initializing NPU structures.
      
      This searches for POWER9 NVLinks attached to any device on a PHB and
      initializes an NPU structure if any found.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3be2df00
    • A
      powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation · 68c0449e
      Alexey Kardashevskiy 提交于
      We might have memory@ nodes with "linux,usable-memory" set to zero
      (for example, to replicate powernv's behaviour for GPU coherent memory)
      which means that the memory needs an extra initialization but since
      it can be used afterwards, the pseries platform will try mapping it
      for DMA so the DMA window needs to cover those memory regions too;
      if the window cannot cover new memory regions, the memory onlining fails.
      
      This walks through the memory nodes to find the highest RAM address to
      let a huge DMA window cover that too in case this memory gets onlined
      later.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      68c0449e
    • M
      powerpc/fadump: Do not allow hot-remove memory from fadump reserved area. · 0db6896f
      Mahesh Salgaonkar 提交于
      For fadump to work successfully there should not be any holes in reserved
      memory ranges where kernel has asked firmware to move the content of old
      kernel memory in event of crash. Now that fadump uses CMA for reserved
      area, this memory area is now not protected from hot-remove operations
      unless it is cma allocated. Hence, fadump service can fail to re-register
      after the hot-remove operation, if hot-removed memory belongs to fadump
      reserved region. To avoid this make sure that memory from fadump reserved
      area is not hot-removable if fadump is registered.
      
      However, if user still wants to remove that memory, he can do so by
      manually stopping fadump service before hot-remove operation.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0db6896f
  18. 09 12月, 2018 4 次提交
    • O
      powerpc/papr_scm: Use ibm,unit-guid as the iset cookie · 43001c52
      Oliver O'Halloran 提交于
      The interleave set cookie is used to determine if a label stored in the
      metadata space should be applied to the current region. This is
      important in the case of NVDIMMs since the firmware may change the
      interleaving configuration of a DIMM which would invalidate the existing
      labels. In our case the hypervisor hides those details from us so we
      don't really care, but libnvdimm still requires the interleave set
      cookie to be non-zero.
      
      For our purposes we just need the set cookie to be unique and fixed for
      a given PAPR SCM region and using the unit-guid (really a UUID) is fine
      for this purpose.
      
      Fixes: b5beae5e ("powerpc/pseries: Add driver for PAPR SCM regions")
      Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
      [mpe: Use kernel types (u64)]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      43001c52
    • O
      powerpc/papr_scm: Fix DIMM device registration race · b0d65a8c
      Oliver O'Halloran 提交于
      When a new nvdimm device is registered with libnvdimm via
      nvdimm_create() it is added as a device on the nvdimm bus. The probe
      function for the DIMM driver is potentially quite slow so actually
      registering and probing the device is done in an async domain rather
      than immediately after device creation. This can result in a race where
      the region device (created 2nd) is probed first and fails to activate at
      boot.
      
      To fix this we use the same approach as the ACPI/NFIT driver which is to
      check that all the DIMM devices registered successfully. LibNVDIMM
      provides the nvdimm_bus_count_dimms() function which synchronises with
      the async domain and verifies that the dimm was successfully registered
      with the bus.
      
      If either of these does not occur then we bail.
      
      Fixes: b5beae5e ("powerpc/pseries: Add driver for PAPR SCM regions")
      Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b0d65a8c
    • O
      powerpc/papr_scm: Remove endian conversions · 409dd7dc
      Oliver O'Halloran 提交于
      The return values of a h-call are returned in the CPU registers and
      written to the provided buffer by the plpar_hcall() wrapper. As a result
      the values written to memory are always in the native endian and should
      not be byte swapped.
      
      The inital implementation of the H-Call interface was done in qemu and
      the returned values were byte swapped unnecessarily in both the
      hypervisor and in the driver so this was only noticed when bringing up
      the PowerVM implementation.
      
      Fixes: b5beae5e ("powerpc/pseries: Add driver for PAPR SCM regions")
      Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      409dd7dc
    • O
      powerpc/papr_scm: Update DT properties · 683ec0e0
      Oliver O'Halloran 提交于
      The ibm,unit-sizes property was originally specified as an array of two
      u32s corresponding to the memory block size, and the number of blocks
      available in that region. A fairly last-minute change to the SCM DT
      specification was splitting that into two seperate u64 properties:
      ibm,block-sizes and ibm,number-of-blocks that convey the same
      information. No firmware / hypervisor that emitted the ibm,unit-size
      property ever appeared in the wild.
      
      Fixes: b5beae5e ("powerpc/pseries: Add driver for PAPR SCM regions")
      Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
      [mpe: Use kernel types (u32/u64)]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      683ec0e0
  19. 07 12月, 2018 2 次提交
  20. 06 12月, 2018 1 次提交