1. 28 5月, 2020 11 次提交
  2. 11 5月, 2020 2 次提交
  3. 04 3月, 2020 1 次提交
  4. 23 1月, 2020 12 次提交
  5. 07 1月, 2020 1 次提交
  6. 06 1月, 2020 1 次提交
    • O
      powerpc/powernv/iov: Ensure the pdn for VFs always contains a valid PE number · 3b5b9997
      Oliver O'Halloran 提交于
      On pseries there is a bug with adding hotplugged devices to an IOMMU
      group. For a number of dumb reasons fixing that bug first requires
      re-working how VFs are configured on PowerNV. For background, on
      PowerNV we use the pcibios_sriov_enable() hook to do two things:
      
        1. Create a pci_dn structure for each of the VFs, and
        2. Configure the PHB's internal BARs so the MMIO range for each VF
           maps to a unique PE.
      
      Roughly speaking a PE is the hardware counterpart to a Linux IOMMU
      group since all the devices in a PE share the same IOMMU table. A PE
      also defines the set of devices that should be isolated in response to
      a PCI error (i.e. bad DMA, UR/CA, AER events, etc). When isolated all
      MMIO and DMA traffic to and from devicein the PE is blocked by the
      root complex until the PE is recovered by the OS.
      
      The requirement to block MMIO causes a giant headache because the P8
      PHB generally uses a fixed mapping between MMIO addresses and PEs. As
      a result we need to delay configuring the IOMMU groups for device
      until after MMIO resources are assigned. For physical devices (i.e.
      non-VFs) the PE assignment is done in pcibios_setup_bridge() which is
      called immediately after the MMIO resources for downstream
      devices (and the bridge's windows) are assigned. For VFs the setup is
      more complicated because:
      
        a) pcibios_setup_bridge() is not called again when VFs are activated, and
        b) The pci_dev for VFs are created by generic code which runs after
           pcibios_sriov_enable() is called.
      
      The work around for this is a two step process:
      
        1. A fixup in pcibios_add_device() is used to initialised the cached
           pe_number in pci_dn, then
        2. A bus notifier then adds the device to the IOMMU group for the PE
           specified in pci_dn->pe_number.
      
      A side effect fixing the pseries bug mentioned in the first paragraph
      is moving the fixup out of pcibios_add_device() and into
      pcibios_bus_add_device(), which is called much later. This results in
      step 2. failing because pci_dn->pe_number won't be initialised when
      the bus notifier is run.
      
      We can fix this by removing the need for the fixup. The PE for a VF is
      known before the VF is even scanned so we can initialise
      pci_dn->pe_number pcibios_sriov_enable() instead. Unfortunately,
      moving the initialisation causes two problems:
      
        1. We trip the WARN_ON() in the current fixup code, and
        2. The EEH core clears pdn->pe_number when recovering a VF and
           relies on the fixup to correctly re-set it.
      
      The only justification for either of these is a comment in
      eeh_rmv_device() suggesting that pdn->pe_number *must* be set to
      IODA_INVALID_PE in order for the VF to be scanned. However, this
      comment appears to have no basis in reality. Both bugs can be fixed by
      just deleting the code.
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20191028085424.12006-1-oohall@gmail.com
      3b5b9997
  7. 13 11月, 2019 1 次提交
  8. 30 8月, 2019 2 次提交
  9. 19 8月, 2019 1 次提交
    • A
      powerpc/powernv/ioda2: Create bigger default window with 64k IOMMU pages · 201ed7f3
      Alexey Kardashevskiy 提交于
      At the moment we create a small window only for 32bit devices, the window
      maps 0..2GB of the PCI space only. For other devices we either use
      a sketchy bypass or hardware bypass but the former can only work if
      the amount of RAM is no bigger than the device's DMA mask and the latter
      requires devices to support at least 59bit DMA.
      
      This extends the default DMA window to the maximum size possible to allow
      a wider DMA mask than just 32bit. The default window size is now limited
      by the the iommu_table::it_map allocation bitmap which is a contiguous
      array, 1 bit per an IOMMU page.
      
      This increases the default IOMMU page size from hard coded 4K to
      the system page size to allow wider DMA masks.
      
      This increases the level number to not exceed the max order allocation
      limit per TCE level. By the same time, this keeps minimal levels number
      as 2 in order to save memory.
      
      As the extended window now overlaps the 32bit MMIO region, this adds
      an area reservation to iommu_init_table().
      
      After this change the default window size is 0x80000000000==1<<43 so
      devices limited to DMA mask smaller than the amount of system RAM can
      still use more than just 2GB of memory for DMA.
      
      This is an optimization and not a bug fix for DMA API usage.
      
      With the on-demand allocation of indirect TCE table levels enabled and
      2 levels, the first TCE level size is just
      1<<ceil((log2(0x7ffffffffff+1)-16)/2)=16384 TCEs or 2 system pages.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190718051139.74787-5-aik@ozlabs.ru
      201ed7f3
  10. 03 7月, 2019 1 次提交
    • A
      powerpc/powernv: Fix stale iommu table base after VFIO · 5636427d
      Alexey Kardashevskiy 提交于
      The powernv platform uses @dma_iommu_ops for non-bypass DMA. These ops
      need an iommu_table pointer which is stored in
      dev->archdata.iommu_table_base. It is initialized during
      pcibios_setup_device() which handles boot time devices. However when a
      device is taken from the system in order to pass it through, the
      default IOMMU table is destroyed but the pointer in a device is not
      updated; also when a device is returned back to the system, a new
      table pointer is not stored in dev->archdata.iommu_table_base either.
      So when a just returned device tries using IOMMU, it crashes on
      accessing stale iommu_table or its members.
      
      This calls set_iommu_table_base() when the default window is created.
      Note it used to be there before but was wrongly removed (see "fixes").
      It did not appear before as these days most devices simply use bypass.
      
      This adds set_iommu_table_base(NULL) when a device is taken from the
      system to make it clear that IOMMU DMA cannot be used past that point.
      
      Fixes: c4e9d3c1 ("powerpc/powernv/pseries: Rework device adding to IOMMU groups")
      Cc: stable@vger.kernel.org # v5.0+
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5636427d
  11. 01 7月, 2019 1 次提交
  12. 31 5月, 2019 1 次提交
  13. 03 5月, 2019 1 次提交
  14. 30 4月, 2019 1 次提交
  15. 13 3月, 2019 1 次提交
    • M
      treewide: add checks for the return value of memblock_alloc*() · 8a7f97b9
      Mike Rapoport 提交于
      Add check for the return value of memblock_alloc*() functions and call
      panic() in case of error.  The panic message repeats the one used by
      panicing memblock allocators with adjustment of parameters to include
      only relevant ones.
      
      The replacement was mostly automated with semantic patches like the one
      below with manual massaging of format strings.
      
        @@
        expression ptr, size, align;
        @@
        ptr = memblock_alloc(size, align);
        + if (!ptr)
        + 	panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align);
      
      [anders.roxell@linaro.org: use '%pa' with 'phys_addr_t' type]
        Link: http://lkml.kernel.org/r/20190131161046.21886-1-anders.roxell@linaro.org
      [rppt@linux.ibm.com: fix format strings for panics after memblock_alloc]
        Link: http://lkml.kernel.org/r/1548950940-15145-1-git-send-email-rppt@linux.ibm.com
      [rppt@linux.ibm.com: don't panic if the allocation in sparse_buffer_init fails]
        Link: http://lkml.kernel.org/r/20190131074018.GD28876@rapoport-lnx
      [akpm@linux-foundation.org: fix xtensa printk warning]
      Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Reviewed-by: Guo Ren <ren_guo@c-sky.com>		[c-sky]
      Acked-by: Paul Burton <paul.burton@mips.com>		[MIPS]
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>	[s390]
      Reviewed-by: Juergen Gross <jgross@suse.com>		[Xen]
      Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Acked-by: Max Filippov <jcmvbkbc@gmail.com>		[xtensa]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a7f97b9
  16. 28 2月, 2019 1 次提交
    • A
      powerpc/powernv/ioda: Fix locked_vm counting for memory used by IOMMU tables · 11f5acce
      Alexey Kardashevskiy 提交于
      We store 2 multilevel tables in iommu_table - one for the hardware and
      one with the corresponding userspace addresses. Before allocating
      the tables, the iommu_table_group_ops::get_table_size() hook returns
      the combined size of the two and VFIO SPAPR TCE IOMMU driver adjusts
      the locked_vm counter correctly. When the table is actually allocated,
      the amount of allocated memory is stored in iommu_table::it_allocated_size
      and used to decrement the locked_vm counter when we release the memory
      used by the table; .get_table_size() and .create_table() calculate it
      independently but the result is expected to be the same.
      
      However the allocator does not add the userspace table size to
      .it_allocated_size so when we destroy the table because of VFIO PCI
      unplug (i.e. VFIO container is gone but the userspace keeps running),
      we decrement locked_vm by just a half of size of memory we are
      releasing.
      
      To make things worse, since we enabled on-demand allocation of
      indirect levels, it_allocated_size contains only the amount of memory
      actually allocated at the table creation time which can just be a
      fraction. It is not a problem with incrementing locked_vm (as
      get_table_size() value is used) but it is with decrementing.
      
      As the result, we leak locked_vm and may not be able to allocate more
      IOMMU tables after few iterations of hotplug/unplug.
      
      This sets it_allocated_size in the pnv_pci_ioda2_ops::create_table()
      hook to what pnv_pci_ioda2_get_table_size() returns so from now on we
      have a single place which calculates the maximum memory a table can
      occupy. The original meaning of it_allocated_size is somewhat lost now
      though.
      
      We do not ditch it_allocated_size whatsoever here and we do not call
      get_table_size() from vfio_iommu_spapr_tce.c when decrementing
      locked_vm as we may have multiple IOMMU groups per container and even
      though they all are supposed to have the same get_table_size()
      implementation, there is a small chance for failure or confusion.
      
      Fixes: 090bad39 ("powerpc/powernv: Add indirect levels to it_userspace")
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      11f5acce
  17. 19 2月, 2019 1 次提交
    • A
      powerpc/powernv/sriov: Register IOMMU groups for VFs · 8f5b2734
      Alexey Kardashevskiy 提交于
      The compound IOMMU group rework moved iommu_register_group() together
      in pnv_pci_ioda_setup_iommu_api() (which is a part of
      ppc_md.pcibios_fixup). As the result, pnv_ioda_setup_bus_iommu_group()
      does not create groups any more, it only adds devices to groups.
      
      This works fine for boot time devices. However IOMMU groups for
      SRIOV's VFs were added by pnv_ioda_setup_bus_iommu_group() so this got
      broken: pnv_tce_iommu_bus_notifier() expects a group to be registered
      for VF and it is not.
      
      This adds missing group registration and adds a NULL pointer check
      into the bus notifier so we won't crash if there is no group, although
      it is not expected to happen now because of the change above.
      
      Example oops seen prior to this patch:
      
        $ echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
        Unable to handle kernel paging request for data at address 0x00000030
        Faulting instruction address: 0xc0000000004a6018
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE SMP NR_CPUS=2048 NUMA PowerNV
        CPU: 46 PID: 7006 Comm: bash Not tainted 4.15-ish
        NIP:  c0000000004a6018 LR: c0000000004a6014 CTR: 0000000000000000
        REGS: c000008fc876b400 TRAP: 0300   Not tainted  (4.15-ish)
        MSR:  900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
        CFAR: c000000000d0be20 DAR: 0000000000000030 DSISR: 40000000 SOFTE: 1
        ...
        NIP sysfs_do_create_link_sd.isra.0+0x68/0x150
        LR  sysfs_do_create_link_sd.isra.0+0x64/0x150
        Call Trace:
          pci_dev_type+0x0/0x30 (unreliable)
          iommu_group_add_device+0x8c/0x600
          iommu_add_device+0xe8/0x180
          pnv_tce_iommu_bus_notifier+0xb0/0xf0
          notifier_call_chain+0x9c/0x110
          blocking_notifier_call_chain+0x64/0xa0
          device_add+0x524/0x7d0
          pci_device_add+0x248/0x450
          pci_iov_add_virtfn+0x294/0x3e0
          pci_enable_sriov+0x43c/0x580
          mlx5_core_sriov_configure+0x15c/0x2f0 [mlx5_core]
          sriov_numvfs_store+0x180/0x240
          dev_attr_store+0x3c/0x60
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1ac/0x240
          __vfs_write+0x3c/0x70
          vfs_write+0xd8/0x220
          SyS_write+0x6c/0x110
          system_call+0x58/0x6c
      
      Fixes: 0bd97167 ("powerpc/powernv/npu: Add compound IOMMU groups")
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reported-by: NSantwana Samantray <santwana.samantray@in.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8f5b2734