1. 17 7月, 2016 1 次提交
  2. 14 7月, 2016 3 次提交
    • I
      cxl: Add support for interrupts on the Mellanox CX4 · a2f67d5e
      Ian Munsie 提交于
      The Mellanox CX4 in cxl mode uses a hybrid interrupt model, where
      interrupts are routed from the networking hardware to the XSL using the
      MSIX table, and from there will be transformed back into an MSIX
      interrupt using the cxl style interrupts (i.e. using IVTE entries and
      ranges to map a PE and AFU interrupt number to an MSIX address).
      
      We want to hide the implementation details of cxl interrupts as much as
      possible. To this end, we use a special version of the MSI setup &
      teardown routines in the PHB while in cxl mode to allocate the cxl
      interrupts and configure the IVTE entries in the process element.
      
      This function does not configure the MSIX table - the CX4 card uses a
      custom format in that table and it would not be appropriate to fill that
      out in generic code. The rest of the functionality is similar to the
      "Full MSI-X mode" described in the CAIA, and this could be easily
      extended to support other adapters that use that mode in the future.
      
      The interrupts will be associated with the default context. If the
      maximum number of interrupts per context has been limited (e.g. by the
      mlx5 driver), it will automatically allocate additional kernel contexts
      to associate extra interrupts as required. These contexts will be
      started using the same WED that was used to start the default context.
      Signed-off-by: NIan Munsie <imunsie@au1.ibm.com>
      Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a2f67d5e
    • I
      powerpc/powernv: Add support for the cxl kernel api on the real phb · 4361b034
      Ian Munsie 提交于
      This adds support for the peer model of the cxl kernel api to the
      PowerNV PHB, in which physical function 0 represents the cxl function on
      the card (an XSL in the case of the CX4), which other physical functions
      will use for memory access and interrupt services. It is referred to as
      the peer model as these functions are peers of one another, as opposed
      to the Virtual PHB model which forms a hierarchy.
      
      This patch exports APIs to enable the peer mode, check if a PCI device
      is attached to a PHB in this mode, and to set and get the peer AFU for
      this mode.
      
      The cxl driver will enable this mode for supported cards by calling
      pnv_cxl_enable_phb_kernel_api(). This will set a flag in the PHB to note
      that this mode is enabled, and switch out it's controller_ops for the
      cxl version.
      
      The cxl version of the controller_ops struct implements it's own
      versions of the enable_device_hook and release_device to handle
      refcounting on the peer AFU and to allocate a default context for the
      device.
      
      Once enabled, the cxl kernel API may not be disabled on a PHB. Currently
      there is no safe way to disable cxl mode short of a reboot, so until
      that changes there is no reason to support the disable path.
      Signed-off-by: NIan Munsie <imunsie@au1.ibm.com>
      Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4361b034
    • I
      powerpc/powernv: Split cxl code out into a separate file · f456834a
      Ian Munsie 提交于
      The support for using the Mellanox CX4 in cxl mode will require
      additions to the PHB code. In preparation for this, move the existing
      cxl code out of pci-ioda.c into a separate pci-cxl.c file to keep things
      more organised.
      Signed-off-by: NIan Munsie <imunsie@au1.ibm.com>
      Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Reviewed-by: NFrederic Barrat <fbarrat@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f456834a
  3. 21 6月, 2016 3 次提交
    • G
      powerpc/powernv: Dynamically release PE · c5f7700b
      Gavin Shan 提交于
      This supports releasing PEs dynamically. A reference count is
      introduced to PE representing number of PCI devices associated
      with the PE. The reference count is increased when PCI device
      joins the PE and decreased when PCI device leaves the PE in
      pnv_pci_release_device(). When the count becomes zero, the PE
      and its consumed resources are released. Note that the count
      is accessed concurrently. So a counter with "int" type is enough
      here.
      
      In order to release the sources consumed by the PE, couple of
      helper functions are introduced as below:
      
         * pnv_pci_ioda1_unset_window() - Unset IODA1 DMA32 window
         * pnv_pci_ioda1_release_dma_pe() - Release IODA1 DMA32 segments
         * pnv_pci_ioda2_release_dma_pe() - Release IODA2 DMA resource
         * pnv_ioda_release_pe_seg() - Unmap IO/M32/M64 segments
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c5f7700b
    • G
      powerpc/powernv: Setup PE for root bus · 63803c39
      Gavin Shan 提交于
      There is no parent bridge for root bus, meaning pcibios_setup_bridge()
      isn't invoked for root bus. The PE for root bus is the ancestor of
      other PEs in PELTV. It means we need PE for root bus populated before
      all others.
      
      This populates the PE for root bus in pcibios_setup_bridge() path
      if it's not populated yet. The PE number next to the reserved one
      is used as the PE# to avoid holes in continuous M64 space.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      63803c39
    • G
      powerpc/powernv: Increase PE# capacity · c127562a
      Gavin Shan 提交于
      Each PHB maintains an array helping to translate 2-bytes Request
      ID (RID) to PE# with the assumption that PE# takes one byte, meaning
      that we can't have more than 256 PEs. However, pci_dn->pe_number
      already had 4-bytes for the PE#.
      
      This extends the PE# capacity for every PHB. After that, the PE number
      is represented by 4-bytes value. Then we can reuse IODA_INVALID_PE to
      check the PE# in phb->pe_rmap[] is valid or not.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c127562a
  4. 11 5月, 2016 13 次提交
    • A
      powerpc/powernv/npu: Enable NVLink pass through · b5cb9ab1
      Alexey Kardashevskiy 提交于
      IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which
      also has a couple of fast speed links (NVLink). The interface to links
      is exposed as an emulated PCI bridge which is included into the same
      IOMMU group as the corresponding GPU.
      
      In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE
      which behave pretty much as the standard IODA2 PHB except NPU PHB has
      just a single TVE in the hardware which means it can have either
      32bit window or 64bit window or DMA bypass but never two of these.
      
      In order to make these links work when GPU is passed to the guest,
      these bridges need to be passed as well; otherwise performance will
      degrade.
      
      This implements and exports API to manage NPU state in regard to VFIO;
      it replicates iommu_table_group_ops.
      
      This defines a new pnv_pci_ioda2_npu_ops which is assigned to
      the IODA2 bridge if there are NPUs for a GPU on the bridge.
      The new callbacks call the default IODA2 callbacks plus new NPU API.
      This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2
      table_group, it is not expected to fail as the helper is only called
      from the pnv_pci_ioda2_npu_ops.
      
      This does not define NPU-specific .release_ownership() so after
      VFIO is finished, DMA on NPU is disabled which is ok as the nvidia
      driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU.
      
      This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to
      the GPU group if any found. The helper uses helpers to look for
      the "ibm,gpu" property in the device tree which is a phandle of
      the corresponding GPU.
      
      This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main
      loop skips NPU PEs as they do not have 32bit DMA segments.
      
      As pnv_npu_set_window() and pnv_npu_unset_window() are started being used
      by the new IODA2-NPU IOMMU group, this makes the helpers public and
      adds the DMA window number parameter.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-By: NAlistair Popple <alistair@popple.id.au>
      [mpe: Add pnv_pci_ioda_setup_iommu_api() to fix build with IOMMU_API=n]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b5cb9ab1
    • A
      powerpc/powernv/npu: Rework TCE Kill handling · 85674868
      Alexey Kardashevskiy 提交于
      The pnv_ioda_pe struct keeps an array of peers. At the moment it is only
      used to link GPU and NPU for 2 purposes:
      
      1. Access NPU quickly when configuring DMA for GPU - this was addressed
      in the previos patch by removing use of it as DMA setup is not what
      the kernel would constantly do.
      
      2. Invalidate TCE cache for NPU when it is invalidated for GPU.
      GPU and NPU are in different PE. There is already a mechanism to
      attach multiple iommu_table_group to the same iommu_table (used for VFIO),
      we can reuse it here so does this patch.
      
      This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are
      not needed anymore.
      
      While we are here, add TCE cache invalidation after enabling bypass.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-By: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      85674868
    • A
      powerpc/powernv/ioda2: Export debug helper pe_level_printk() · 7d623e42
      Alexey Kardashevskiy 提交于
      This exports debugging helper pe_level_printk() and corresponding macroses
      so they can be used in npu-dma.c.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-By: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7d623e42
    • A
      powerpc/powernv/npu: Simplify DMA setup · f9f83456
      Alexey Kardashevskiy 提交于
      NPU devices are emulated in firmware and mainly used for NPU NVLink
      training; one NPU device is per a hardware link. Their DMA/TCE setup
      must match the GPU which is connected via PCIe and NVLink so any changes
      to the DMA/TCE setup on the GPU PCIe device need to be propagated to
      the NVLink device as this is what device drivers expect and it doesn't
      make much sense to do anything else.
      
      This makes NPU DMA setup explicit.
      pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda,
      made static and prints warning as dma_set_mask() should never be called
      on this function as in any case it will not configure GPU; so we make
      this explicit.
      
      Instead of using PNV_IODA_PE_PEER and peers[] (which the next patch will
      remove), we test every PCI device if there are corresponding NVLink
      devices. If there are any, we propagate bypass mode to just found NPU
      devices by calling the setup helper directly (which takes @bypass) and
      avoid guessing (i.e. calculating from DMA mask) whether we need bypass
      or not on NPU devices. Since DMA setup happens in very rare occasion,
      this will not slow down booting or VFIO start/stop much.
      
      This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it
      more clear what the function really does which is programming 32bit
      table address to the TVT ("disabling bypass" means writing zeroes to
      the TVT).
      
      This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as
      the DMA configuration on NPU does not matter until dma_set_mask() is
      called on GPU and that will do the NPU DMA configuration.
      
      This removes phb->dma_dev_setup initialization for NPU as
      pnv_pci_ioda_dma_dev_setup is no-op for it anyway.
      
      This stops using npe->tce_bypass_base as it never changes and values
      other than zero are not supported.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f9f83456
    • A
      powerpc/powernv/npu: TCE Kill helpers cleanup · 0bbcdb43
      Alexey Kardashevskiy 提交于
      NPU PHB TCE Kill register is exactly the same as in the rest of POWER8
      so let's reuse the existing code for NPU. The only bit missing is
      a helper to reset the entire TCE cache so this moves such a helper
      from NPU code and renames it.
      
      Since pnv_npu_tce_invalidate() does really invalidate the entire cache,
      this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU.
      This adds an explicit comment for workaround for invalidating NPU TCE
      cache.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0bbcdb43
    • G
      powerpc/powernv: Use PE instead of number during setup and release · 1e916772
      Gavin Shan 提交于
      In current implementation, the PEs that are allocated or picked
      from the reserved list are identified by PE number. The PE instance
      has to be picked according to the PE number eventually. We have
      same issue when PE is released.
      
      For pnv_ioda_pick_m64_pe() and pnv_ioda_alloc_pe(), this returns
      PE instance so that pnv_ioda_setup_bus_PE() can use the allocated
      or reserved PE instance directly. Also, pnv_ioda_setup_bus_PE()
      returns the reserved/allocated PE instance to be used in subsequent
      patches. On the other hand, pnv_ioda_free_pe() uses PE instance
      (not number) as its argument. No logical changes introduced.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1e916772
    • G
      powerpc/powernv/ioda1: Improve DMA32 segment track · 2b923ed1
      Gavin Shan 提交于
      In current implementation, the DMA32 segments required by one specific
      PE isn't calculated with the information hold in the PE independently.
      It conflicts with the PCI hotplug design: PE centralized, meaning the
      PE's DMA32 segments should be calculated from the information hold in
      the PE independently.
      
      This introduces an array (@dma32_segmap) for every PHB to track the
      DMA32 segmeng usage. Besides, this moves the logic calculating PE's
      consumed DMA32 segments to pnv_pci_ioda1_setup_dma_pe() so that PE's
      DMA32 segments are calculated/allocated from the information hold in
      the PE (DMA32 weight). Also the logic is improved: we try to allocate
      as much DMA32 segments as we can. It's acceptable that number of DMA32
      segments less than the expected number are allocated.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2b923ed1
    • G
      powerpc/powernv: Remove DMA32 PE list · 801846d1
      Gavin Shan 提交于
      PEs are put into PHB DMA32 list (phb->ioda.pe_dma_list) according
      to their DMA32 weight. The PEs on the list are iterated to setup
      their TCE32 tables at system booting time. The list is used for
      once at boot time and no need to keep it.
      
      This moves the logic calculating DMA32 weight of PHB and PE to
      pnv_ioda_setup_dma() to drop PHB's DMA32 list. Also, every PE
      traces the consumed DMA32 segment by @tce32_seg and @tce32_segcount
      are useless and they're removed.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      801846d1
    • G
      powerpc/powernv: Track M64 segment consumption · 93289d8c
      Gavin Shan 提交于
      When unplugging PCI devices, their parent PEs might be offline.
      The consumed M64 resource by the PEs should be released at that
      time. As we track M32 segment consumption, this introduces an
      array to the PHB to track the mapping between M64 segment and
      PE number.
      
      Note: M64 mapping isn't covered by pnv_ioda_setup_pe_seg() as
      IODA2 doesn't support the mapping explicitly while it's supported
      on IODA1. Until now, no M64 is supported on IODA1 in software.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      93289d8c
    • G
      powerpc/powernv: Data type unsigned int for PE number · 689ee8c9
      Gavin Shan 提交于
      This changes the data type of PE number from "int" to "unsigned int"
      in order to match the fact PE number is never negative:
      
         * The number of PE to which the specified PCI device is attached.
         * The PE number map for SRIOV VFs.
         * The returned PE number from pnv_ioda_alloc_pe().
         * The returned PE number from pnv_ioda2_pick_m64_pe().
      Suggested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-By: NAlistair Popple <alistair@popple.id.au>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      689ee8c9
    • G
      powerpc/powernv: Rename PE# fields in struct pnv_phb · 92b8f137
      Gavin Shan 提交于
      This renames the fields related to PE number in "struct pnv_phb"
      for better reflecting of their usages as Alexey suggested. No
      logical changes introduced.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      92b8f137
    • G
      powerpc/powernv: Reorder fields in struct pnv_phb · 13ce7598
      Gavin Shan 提交于
      This moves those fields in struct pnv_phb that are related to PE
      allocation around. No logical change.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      13ce7598
    • G
      powerpc/powernv: Drop phb->bdfn_to_pe() · 475d92c2
      Gavin Shan 提交于
      The last usage of pnv_phb::bdfn_to_pe() was removed in
      ff57b454 ("powerpc/eeh: Do probe on pci_dn"), so drop it.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      475d92c2
  5. 15 2月, 2016 1 次提交
  6. 08 2月, 2016 1 次提交
  7. 17 12月, 2015 1 次提交
    • A
      powerpc/powernv: Add support for Nvlink NPUs · 5d2aa710
      Alistair Popple 提交于
      NVLink is a high speed interconnect that is used in conjunction with a
      PCI-E connection to create an interface between CPU and GPU that
      provides very high data bandwidth. A PCI-E connection to a GPU is used
      as the control path to initiate and report status of large data
      transfers sent via the NVLink.
      
      On IBM Power systems the NVLink processing unit (NPU) is similar to
      the existing PHB3. This patch adds support for a new NPU PHB type. DMA
      operations on the NPU are not supported as this patch sets the TCE
      translation tables to be the same as the related GPU PCIe device for
      each NVLink. Therefore all DMA operations are setup and controlled via
      the PCIe device.
      
      EEH is not presently supported for the NPU devices, although it may be
      added in future.
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5d2aa710
  8. 18 8月, 2015 1 次提交
    • A
      powerpc/powernv: move dma_get_required_mask from pnv_phb to pci_controller_ops · 53522982
      Andrew Donnellan 提交于
      Simplify the dma_get_required_mask call chain by moving it from pnv_phb to
      pci_controller_ops, similar to commit 763d2d8d ("powerpc/powernv:
      Move dma_set_mask from pnv_phb to pci_controller_ops").
      
      Previous call chain:
      
        0) call dma_get_required_mask() (kernel/dma.c)
        1) call ppc_md.dma_get_required_mask, if it exists. On powernv, that
           points to pnv_dma_get_required_mask() (platforms/powernv/setup.c)
        2) device is PCI, therefore call pnv_pci_dma_get_required_mask()
           (platforms/powernv/pci.c)
        3) call phb->dma_get_required_mask if it exists
        4) it only exists in the ioda case, where it points to
             pnv_pci_ioda_dma_get_required_mask() (platforms/powernv/pci-ioda.c)
      
      New call chain:
      
        0) call dma_get_required_mask() (kernel/dma.c)
        1) device is PCI, therefore call pci_controller_ops.dma_get_required_mask
           if it exists
        2) in the ioda case, that points to pnv_pci_ioda_dma_get_required_mask()
           (platforms/powernv/pci-ioda.c)
      
      In the p5ioc2 case, the call chain remains the same -
      dma_get_required_mask() does not find either a ppc_md call or
      pci_controller_ops call, so it calls __dma_get_required_mask().
      Signed-off-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Reviewed-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      53522982
  9. 13 7月, 2015 3 次提交
  10. 11 6月, 2015 5 次提交
    • A
      powerpc/iommu/powernv: Release replaced TCE · 05c6cfb9
      Alexey Kardashevskiy 提交于
      At the moment writing new TCE value to the IOMMU table fails with EBUSY
      if there is a valid entry already. However PAPR specification allows
      the guest to write new TCE value without clearing it first.
      
      Another problem this patch is addressing is the use of pool locks for
      external IOMMU users such as VFIO. The pool locks are to protect
      DMA page allocator rather than entries and since the host kernel does
      not control what pages are in use, there is no point in pool locks and
      exchange()+put_page(oldtce) is sufficient to avoid possible races.
      
      This adds an exchange() callback to iommu_table_ops which does the same
      thing as set() plus it returns replaced TCE and DMA direction so
      the caller can release the pages afterwards. The exchange() receives
      a physical address unlike set() which receives linear mapping address;
      and returns a physical address as the clear() does.
      
      This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
      for a platform to have exchange() implemented in order to support VFIO.
      
      This replaces iommu_tce_build() and iommu_clear_tce() with
      a single iommu_tce_xchg().
      
      This makes sure that TCE permission bits are not set in TCE passed to
      IOMMU API as those are to be calculated by platform code from
      DMA direction.
      
      This moves SetPageDirty() to the IOMMU code to make it work for both
      VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
      available later).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      05c6cfb9
    • A
      powerpc/powernv/ioda2: Move TCE kill register address to PE · 5780fb04
      Alexey Kardashevskiy 提交于
      At the moment the DMA setup code looks for the "ibm,opal-tce-kill"
      property which contains the TCE kill register address. Writing to
      this register invalidates TCE cache on IODA/IODA2 hub.
      
      This moves the register address from iommu_table to pnv_pnb as this
      register belongs to PHB and invalidates TCE cache for all tables of
      all attached PEs.
      
      This moves the property reading/remapping code to a helper which is
      called when DMA is being configured for PE and which does DMA setup
      for both IODA1 and IODA2.
      
      This adds a new pnv_pci_ioda2_tce_invalidate_entire() helper which
      invalidates cache for the entire table. It should be called after
      every call to opal_pci_map_pe_dma_window(). It was not required before
      because there was just a single TCE table and 64bit DMA was handled via
      bypass window (which has no table so no cache was used) but this is going
      to change with Dynamic DMA windows (DDW).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5780fb04
    • A
      powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group · 0eaf4def
      Alexey Kardashevskiy 提交于
      So far one TCE table could only be used by one IOMMU group. However
      IODA2 hardware allows programming the same TCE table address to
      multiple PE allowing sharing tables.
      
      This replaces a single pointer to a group in a iommu_table struct
      with a linked list of groups which provides the way of invalidating
      TCE cache for every PE when an actual TCE table is updated. This adds
      pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group()
      helpers to manage the list. However without VFIO, it is still going
      to be a single IOMMU group per iommu_table.
      
      This changes iommu_add_device() to add a device to a first group
      from the group list of a table as it is only called from the platform
      init code or PCI bus notifier and at these moments there is only
      one group per table.
      
      This does not change TCE invalidation code to loop through all
      attached groups in order to simplify this patch and because
      it is not really needed in most cases. IODA2 is fixed in a later
      patch.
      
      This should cause no behavioural change.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0eaf4def
    • A
      powerpc/spapr: vfio: Replace iommu_table with iommu_table_group · b348aa65
      Alexey Kardashevskiy 提交于
      Modern IBM POWERPC systems support multiple (currently two) TCE tables
      per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
      for TCE tables. Right now just one table is supported.
      
      This defines iommu_table_group struct which stores pointers to
      iommu_group and iommu_table(s). This replaces iommu_table with
      iommu_table_group where iommu_table was used to identify a group:
      - iommu_register_group();
      - iommudata of generic iommu_group;
      
      This removes @data from iommu_table as it_table_group provides
      same access to pnv_ioda_pe.
      
      For IODA, instead of embedding iommu_table, the new iommu_table_group
      keeps pointers to those. The iommu_table structs are allocated
      dynamically.
      
      For P5IOC2, both iommu_table_group and iommu_table are embedded into
      PE struct. As there is no EEH and SRIOV support for P5IOC2,
      iommu_free_table() should not be called on iommu_table struct pointers
      so we can keep it embedded in pnv_phb::p5ioc2.
      
      For pSeries, this replaces multiple calls of kzalloc_node() with a new
      iommu_pseries_alloc_group() helper and stores the table group struct
      pointer into the pci_dn struct. For release, a iommu_table_free_group()
      helper is added.
      
      This moves iommu_table struct allocation from SR-IOV code to
      the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
      pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
      This change is here because those lines had to be changed anyway.
      
      This should cause no behavioural change.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b348aa65
    • A
      powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table · da004c36
      Alexey Kardashevskiy 提交于
      This adds a iommu_table_ops struct and puts pointer to it into
      the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
      callbacks from ppc_md to the new struct where they really belong to.
      
      This adds the requirement for @it_ops to be initialized before calling
      iommu_init_table() to make sure that we do not leave any IOMMU table
      with iommu_table_ops uninitialized. This is not a parameter of
      iommu_init_table() though as there will be cases when iommu_init_table()
      will not be called on TCE tables, for example - VFIO.
      
      This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_"
      redundant prefixes.
      
      This removes tce_xxx_rm handlers from ppc_md but does not add
      them to iommu_table_ops as this will be done later if we decide to
      support TCE hypercalls in real mode. This removes _vm callbacks as
      only virtual mode is supported by now so this also removes @rm parameter.
      
      For pSeries, this always uses tce_buildmulti_pSeriesLP/
      tce_buildmulti_pSeriesLP. This changes multi callback to fall back to
      tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
      present. The reason for this is we still have to support "multitce=off"
      boot parameter in disable_multitce() and we do not want to walk through
      all IOMMU tables in the system and replace "multi" callbacks with single
      ones.
      
      For powernv, this defines _ops per PHB type which are P5IOC2/IODA1/IODA2.
      This makes the callbacks for them public. Later patches will extend
      callbacks for IODA1/2.
      
      No change in behaviour is expected.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      da004c36
  11. 03 6月, 2015 1 次提交
  12. 02 6月, 2015 2 次提交
    • D
      powerpc/powernv: Move dma_set_mask() from pnv_phb to pci_controller_ops · 763d2d8d
      Daniel Axtens 提交于
      Previously, dma_set_mask() on powernv was convoluted:
       0) Call dma_set_mask() (a/p/kernel/dma.c)
       1) In dma_set_mask(), ppc_md.dma_set_mask() exists, so call it.
       2) On powernv, that function pointer is pnv_dma_set_mask().
          In pnv_dma_set_mask(), the device is pci, so call pnv_pci_dma_set_mask().
       3) In pnv_pci_dma_set_mask(), call pnv_phb->set_dma_mask() if it exists.
       4) It only exists in the ioda case, where it points to
          pnv_pci_ioda_dma_set_mask(), which is the final function.
      
      So the call chain is:
       dma_set_mask() ->
        pnv_dma_set_mask() ->
         pnv_pci_dma_set_mask() ->
          pnv_pci_ioda_dma_set_mask()
      
      Both ppc_md and pnv_phb function pointers are used.
      
      Rip out the ppc_md call, pnv_dma_set_mask() and pnv_pci_dma_set_mask().
      
      Instead:
       0) Call dma_set_mask() (a/p/kernel/dma.c)
       1) In dma_set_mask(), the device is pci, and pci_controller_ops.dma_set_mask()
          exists, so call pci_controller_ops.dma_set_mask()
       2) In the ioda case, that points to pnv_pci_ioda_dma_set_mask().
      
      The new call chain is
       dma_set_mask() ->
        pnv_pci_ioda_dma_set_mask()
      
      Now only the pci_controller_ops function pointer is used.
      
      The fallback paths for p5ioc2 are the same.
      
      Previously, pnv_pci_dma_set_mask() would find no pnv_phb->set_dma_mask()
      function, to it would call __set_dma_mask().
      
      Now, dma_set_mask() finds no ppc_md call or pci_controller_ops call,
      so it calls __set_dma_mask().
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      763d2d8d
    • D
      powerpc/powernv: Specialise pci_controller_ops for each controller type · 92ae0353
      Daniel Axtens 提交于
      Remove powernv generic PCI controller operations. Replace it with
      controller ops for each of the two supported PHBs.
      
      As an added bonus, make the two new structs const, which will help
      guard against bugs such as the one introduced in 65ebf4b6
      ("powerpc/powernv: Move controller ops from ppc_md to controller_ops")
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      92ae0353
  13. 31 3月, 2015 2 次提交
    • W
      powerpc/powernv: Shift VF resource with an offset · 781a868f
      Wei Yang 提交于
      On PowerNV platform, resource position in M64 BAR implies the PE# the
      resource belongs to. In some cases, adjustment of a resource is necessary
      to locate it to a correct position in M64 BAR .
      
      This patch adds pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR
      address according to an offset.
      
      Note:
      
          After doing so, there would be a "hole" in the /proc/iomem when offset
          is a positive value. It looks like the device return some mmio back to
          the system, which actually no one could use it.
      
      [bhelgaas: rework loops, rework overlap check, index resource[]
      conventionally, remove pci_regs.h include, squashed with next patch]
      Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      781a868f
    • W
      powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically · 9e8d4a19
      Wei Yang 提交于
      Previously the iommu_table had the same lifetime as a struct pnv_ioda_pe
      and was embedded in it. The pnv_ioda_pe was assigned to a PE on the bootup
      stage. Since PEs are based on the hardware layout which is static in the
      system, they will never get released. This means the iommu_table in the
      pnv_ioda_pe will never get released either.
      
      This no longer works for VF PE. VF PEs are created and released dynamically
      when VFs are created and released. So we need to assign pnv_ioda_pe to VF
      PEs respectively when VFs are enabled and clean up those resources for VF
      PE when VFs are disabled. And iommu_table is one of the resources we need
      to handle dynamically.
      
      Current iommu_table is a static field in pnv_ioda_pe, which will face a
      problem when freeing it. During the disabling of a VF,
      pnv_pci_ioda2_release_dma_pe will call iommu_free_table to release the
      iommu_table for this PE. A static iommu_table will fail in
      iommu_free_table.
      
      According to these requirement, this patch allocates iommu_table
      dynamically.
      Signed-off-by: NWei Yang <weiyang@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      9e8d4a19
  14. 24 3月, 2015 1 次提交
  15. 17 3月, 2015 2 次提交