1. 23 4月, 2019 1 次提交
  2. 04 4月, 2019 1 次提交
    • L
      vfio/pci: use correct format characters · 426b046b
      Louis Taylor 提交于
      When compiling with -Wformat, clang emits the following warnings:
      
      drivers/vfio/pci/vfio_pci.c:1601:5: warning: format specifies type
            'unsigned short' but the argument has type 'unsigned int' [-Wformat]
                                      vendor, device, subvendor, subdevice,
                                      ^~~~~~
      
      drivers/vfio/pci/vfio_pci.c:1601:13: warning: format specifies type
            'unsigned short' but the argument has type 'unsigned int' [-Wformat]
                                      vendor, device, subvendor, subdevice,
                                              ^~~~~~
      
      drivers/vfio/pci/vfio_pci.c:1601:21: warning: format specifies type
            'unsigned short' but the argument has type 'unsigned int' [-Wformat]
                                      vendor, device, subvendor, subdevice,
                                                      ^~~~~~~~~
      
      drivers/vfio/pci/vfio_pci.c:1601:32: warning: format specifies type
            'unsigned short' but the argument has type 'unsigned int' [-Wformat]
                                      vendor, device, subvendor, subdevice,
                                                                 ^~~~~~~~~
      
      drivers/vfio/pci/vfio_pci.c:1605:5: warning: format specifies type
            'unsigned short' but the argument has type 'unsigned int' [-Wformat]
                                      vendor, device, subvendor, subdevice,
                                      ^~~~~~
      
      drivers/vfio/pci/vfio_pci.c:1605:13: warning: format specifies type
            'unsigned short' but the argument has type 'unsigned int' [-Wformat]
                                      vendor, device, subvendor, subdevice,
                                              ^~~~~~
      
      drivers/vfio/pci/vfio_pci.c:1605:21: warning: format specifies type
            'unsigned short' but the argument has type 'unsigned int' [-Wformat]
                                      vendor, device, subvendor, subdevice,
                                                      ^~~~~~~~~
      
      drivers/vfio/pci/vfio_pci.c:1605:32: warning: format specifies type
            'unsigned short' but the argument has type 'unsigned int' [-Wformat]
                                      vendor, device, subvendor, subdevice,
                                                                 ^~~~~~~~~
      The types of these arguments are unconditionally defined, so this patch
      updates the format character to the correct ones for unsigned ints.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/378Signed-off-by: NLouis Taylor <louis@kragniz.eu>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      426b046b
  3. 19 2月, 2019 2 次提交
    • E
      vfio_pci: Enable memory accesses before calling pci_map_rom · 0cfd027b
      Eric Auger 提交于
      pci_map_rom/pci_get_rom_size() performs memory access in the ROM.
      In case the Memory Space accesses were disabled, readw() is likely
      to trigger a synchronous external abort on some platforms.
      
      In case memory accesses were disabled, re-enable them before the
      call and disable them back again just after.
      
      Fixes: 89e1f7d4 ("vfio: Add PCI device driver")
      Signed-off-by: NEric Auger <eric.auger@redhat.com>
      Suggested-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      0cfd027b
    • A
      vfio/pci: Restore device state on PM transition · 51ef3a00
      Alex Williamson 提交于
      PCI core handles save and restore of device state around reset, but
      when using pci_set_power_state() we can unintentionally trigger a soft
      reset of the device, where PCI core only restores the BAR state.  If
      we're using vfio-pci's idle D3 support to try to put devices into low
      power when unused, this might trigger a reset when the device is woken
      for use.  Also power state management by the user, or within a guest,
      can put the device into D3 power state with potentially limited
      ability to restore the device if it should undergo a reset.  The PCI
      spec does not define the extent of a soft reset and many devices
      reporting soft reset on D3->D0 transition do not undergo a PCI config
      space reset.  It's therefore assumed safe to unconditionally restore
      the remainder of the state if the device indicates soft reset
      support, even on a user initiated wakeup.
      
      Implement a wrapper in vfio-pci to tag devices reporting PM reset
      support, save their state on transitions into D3 and restore on
      transitions back to D0.
      Reported-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      51ef3a00
  4. 21 12月, 2018 3 次提交
    • A
      vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver · 7f928917
      Alexey Kardashevskiy 提交于
      POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
      pluggable PCIe devices but still have PCIe links which are used
      for config space and MMIO. In addition to that the GPUs have 6 NVLinks
      which are connected to other GPUs and the POWER9 CPU. POWER9 chips
      have a special unit on a die called an NPU which is an NVLink2 host bus
      adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
      These systems also support ATS (address translation services) which is
      a part of the NVLink2 protocol. Such GPUs also share on-board RAM
      (16GB or 32GB) to the system via the same NVLink2 so a CPU has
      cache-coherent access to a GPU RAM.
      
      This exports GPU RAM to the userspace as a new VFIO device region. This
      preregisters the new memory as device memory as it might be used for DMA.
      This inserts pfns from the fault handler as the GPU memory is not onlined
      until the vendor driver is loaded and trained the NVLinks so doing this
      earlier causes low level errors which we fence in the firmware so
      it does not hurt the host system but still better be avoided; for the same
      reason this does not map GPU RAM into the host kernel (usual thing for
      emulated access otherwise).
      
      This exports an ATSD (Address Translation Shootdown) register of NPU which
      allows TLB invalidations inside GPU for an operating system. The register
      conveniently occupies a single 64k page. It is also presented to
      the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
      each of them can be used for TLB invalidation in a GPU linked to this NPU.
      This allocates one ATSD register per an NVLink bridge allowing passing
      up to 6 registers. Due to the host firmware bug (just recently fixed),
      only 1 ATSD register per NPU was actually advertised to the host system
      so this passes that alone register via the first NVLink bridge device in
      the group which is still enough as QEMU collects them all back and
      presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
      
      In order to provide the userspace with the information about GPU-to-NVLink
      connections, this exports an additional capability called "tgt"
      (which is an abbreviated host system bus address). The "tgt" property
      tells the GPU its own system address and allows the guest driver to
      conglomerate the routing information so each GPU knows how to get directly
      to the other GPUs.
      
      For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
      know LPID (a logical partition ID or a KVM guest hardware ID in other
      words) and PID (a memory context ID of a userspace process, not to be
      confused with a linux pid). This assigns a GPU to LPID in the NPU and
      this is why this adds a listener for KVM on an IOMMU group. A PID comes
      via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
      
      This requires coherent memory and ATSD to be available on the host as
      the GPU vendor only supports configurations with both features enabled
      and other configurations are known not to work. Because of this and
      because of the ways the features are advertised to the host system
      (which is a device tree with very platform specific properties),
      this requires enabled POWERNV platform.
      
      The V100 GPUs do not advertise any of these capabilities via the config
      space and there are more than just one device ID so this relies on
      the platform to tell whether these GPUs have special abilities such as
      NVLinks.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7f928917
    • A
      vfio_pci: Allow regions to add own capabilities · c2c0f1cd
      Alexey Kardashevskiy 提交于
      VFIO regions already support region capabilities with a limited set of
      fields. However the subdriver might have to report to the userspace
      additional bits.
      
      This adds an add_capability() hook to vfio_pci_regops.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c2c0f1cd
    • A
      vfio_pci: Allow mapping extra regions · a15b1883
      Alexey Kardashevskiy 提交于
      So far we only allowed mapping of MMIO BARs to the userspace. However
      there are GPUs with on-board coherent RAM accessible via side
      channels which we also want to map to the userspace. The first client
      for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9
      NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped
      to the system address space, we are going to export these as an extra
      PCI region.
      
      We already support extra PCI regions and this adds support for mapping
      them to the userspace.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a15b1883
  5. 13 12月, 2018 1 次提交
  6. 26 9月, 2018 1 次提交
    • A
      vfio/pci: Mask buggy SR-IOV VF INTx support · db04264f
      Alex Williamson 提交于
      The SR-IOV spec requires that VFs must report zero for the INTx pin
      register as VFs are precluded from INTx support.  It's much easier for
      the host kernel to understand whether a device is a VF and therefore
      whether a non-zero pin register value is bogus than it is to do the
      same in userspace.  Override the INTx count for such devices and
      virtualize the pin register to provide a consistent view of the device
      to the user.
      
      As this is clearly a spec violation, warn about it to support hardware
      validation, but also provide a known whitelist as it doesn't do much
      good to continue complaining if the hardware vendor doesn't plan to
      fix it.
      
      Known devices with this issue: 8086:270c
      Tested-by: NGage Eads <gage.eads@intel.com>
      Reviewed-by: NAshok Raj <ashok.raj@intel.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      db04264f
  7. 07 8月, 2018 2 次提交
  8. 20 7月, 2018 2 次提交
  9. 19 7月, 2018 1 次提交
  10. 27 3月, 2018 1 次提交
    • A
      vfio/pci: Add ioeventfd support · 30656177
      Alex Williamson 提交于
      The ioeventfd here is actually irqfd handling of an ioeventfd such as
      supported in KVM.  A user is able to pre-program a device write to
      occur when the eventfd triggers.  This is yet another instance of
      eventfd-irqfd triggering between KVM and vfio.  The impetus for this
      is high frequency writes to pages which are virtualized in QEMU.
      Enabling this near-direct write path for selected registers within
      the virtualized page can improve performance and reduce overhead.
      Specifically this is initially targeted at NVIDIA graphics cards where
      the driver issues a write to an MMIO register within a virtualized
      region in order to allow the MSI interrupt to re-trigger.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      30656177
  11. 22 3月, 2018 1 次提交
  12. 21 12月, 2017 3 次提交
    • A
      vfio-pci: Allow mapping MSIX BAR · a32295c6
      Alexey Kardashevskiy 提交于
      By default VFIO disables mapping of MSIX BAR to the userspace as
      the userspace may program it in a way allowing spurious interrupts;
      instead the userspace uses the VFIO_DEVICE_SET_IRQS ioctl.
      In order to eliminate guessing from the userspace about what is
      mmapable, VFIO also advertises a sparse list of regions allowed to mmap.
      
      This works fine as long as the system page size equals to the MSIX
      alignment requirement which is 4KB. However with a bigger page size
      the existing code prohibits mapping non-MSIX parts of a page with MSIX
      structures so these parts have to be emulated via slow reads/writes on
      a VFIO device fd. If these emulated bits are accessed often, this has
      serious impact on performance.
      
      This allows mmap of the entire BAR containing MSIX vector table.
      
      This removes the sparse capability for PCI devices as it becomes useless.
      
      As the userspace needs to know for sure whether mmapping of the MSIX
      vector containing data can succeed, this adds a new capability -
      VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - which explicitly tells the userspace
      that the entire BAR can be mmapped.
      
      This does not touch the MSIX mangling in the BAR read/write handlers as
      we are doing this just to enable direct access to non MSIX registers.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw - fixup whitespace, trim function name]
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      a32295c6
    • A
      vfio: Simplify capability helper · dda01f78
      Alex Williamson 提交于
      The vfio_info_add_capability() helper requires the caller to pass a
      capability ID, which it then uses to fill in header fields, assuming
      hard coded versions.  This makes for an awkward and rigid interface.
      The only thing we want this helper to do is allocate sufficient
      space in the caps buffer and chain this capability into the list.
      Reduce it to that simple task.
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NZhenyu Wang <zhenyuw@linux.intel.com>
      Reviewed-by: NKirti Wankhede <kwankhede@nvidia.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      dda01f78
    • A
      vfio-pci: Mask INTx if a device is not capabable of enabling it · 2170dd04
      Alexey Kardashevskiy 提交于
      At the moment VFIO rightfully assumes that INTx is supported if
      the interrupt pin is not set to zero in the device config space.
      However if that is not the case (the pin is not zero but pdev->irq is),
      vfio_intx_enable() fails.
      
      In order to prevent the userspace from trying to enable INTx when we know
      that it cannot work, let's mask the PCI_INTERRUPT_PIN register.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      2170dd04
  13. 27 7月, 2017 1 次提交
  14. 13 6月, 2017 1 次提交
  15. 04 1月, 2017 1 次提交
  16. 17 11月, 2016 2 次提交
  17. 27 10月, 2016 1 次提交
    • V
      vfio/pci: Fix integer overflows, bitmask check · 05692d70
      Vlad Tsyrklevich 提交于
      The VFIO_DEVICE_SET_IRQS ioctl did not sufficiently sanitize
      user-supplied integers, potentially allowing memory corruption. This
      patch adds appropriate integer overflow checks, checks the range bounds
      for VFIO_IRQ_SET_DATA_NONE, and also verifies that only single element
      in the VFIO_IRQ_SET_DATA_TYPE_MASK bitmask is set.
      VFIO_IRQ_SET_ACTION_TYPE_MASK is already correctly checked later in
      vfio_pci_set_irqs_ioctl().
      
      Furthermore, a kzalloc is changed to a kcalloc because the use of a
      kzalloc with an integer multiplication allowed an integer overflow
      condition to be reached without this patch. kcalloc checks for overflow
      and should prevent a similar occurrence.
      Signed-off-by: NVlad Tsyrklevich <vlad@tsyrklevich.net>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      05692d70
  18. 09 7月, 2016 1 次提交
    • Y
      vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive · 05f0c03f
      Yongji Xie 提交于
      Current vfio-pci implementation disallows to mmap
      sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
      page may be shared with other BARs. This will cause some
      performance issues when we passthrough a PCI device with
      this kind of BARs. Guest will be not able to handle the mmio
      accesses to the BARs which leads to mmio emulations in host.
      
      However, not all sub-page BARs will share page with other BARs.
      We should allow to mmap the sub-page MMIO BARs which we can
      make sure will not share page with other BARs.
      
      This patch adds support for this case. And we try to add a
      dummy resource to reserve the remainder of the page which
      hot-add device's BAR might be assigned into. But it's not
      necessary to handle the case when the BAR is not page aligned.
      Because we can't expect the BAR will be assigned into the same
      location in a page in guest when we passthrough the BAR. And
      it's hard to access this BAR in userspace because we have
      no way to get the BAR's location in a page.
      Signed-off-by: NYongji Xie <xyjxie@linux.vnet.ibm.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      05f0c03f
  19. 29 4月, 2016 1 次提交
    • A
      vfio/pci: Hide broken INTx support from user · 45074405
      Alex Williamson 提交于
      INTx masking has two components, the first is that we need the ability
      to prevent the device from continuing to assert INTx.  This is
      provided via the DisINTx bit in the command register and is the only
      thing we can really probe for when testing if INTx masking is
      supported.  The second component is that the device needs to indicate
      if INTx is asserted via the interrupt status bit in the device status
      register.  With these two features we can generically determine if one
      of the devices we own is asserting INTx, signal the user, and mask the
      interrupt while the user services the device.
      
      Generally if one or both of these components is broken we resort to
      APIC level interrupt masking, which requires an exclusive interrupt
      since we have no way to determine the source of the interrupt in a
      shared configuration.  This often makes it difficult or impossible to
      configure the system for userspace use of the device, for an interrupt
      mode that the user may not need.
      
      One possible configuration of broken INTx masking is that the DisINTx
      support is fully functional, but the interrupt status bit never
      signals interrupt assertion.  In this case we do have the ability to
      prevent the device from asserting INTx, but lack the ability to
      identify the interrupt source.  For this case we can simply pretend
      that the device lacks INTx support entirely, keeping DisINTx set on
      the physical device, virtualizing this bit for the user, and
      virtualizing the interrupt pin register to indicate no INTx support.
      We already support virtualization of the DisINTx bit and already
      virtualize the interrupt pin for platforms without INTx support.  By
      tying these components together, setting DisINTx on open and reset,
      and identifying devices broken in this particular way, we can provide
      support for them w/o the handicap of APIC level INTx masking.
      
      Intel i40e (XL710/X710) 10/20/40GbE NICs have been identified as being
      broken in this specific way.  We leave the vfio-pci.nointxmask option
      as a mechanism to bypass this support, enabling INTx on the device
      with all the requirements of APIC level masking.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Cc: John Ronciak <john.ronciak@intel.com>
      Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
      45074405
  20. 28 2月, 2016 1 次提交
  21. 26 2月, 2016 1 次提交
  22. 23 2月, 2016 5 次提交
  23. 22 12月, 2015 1 次提交
    • A
      vfio: Include No-IOMMU mode · 03a76b60
      Alex Williamson 提交于
      There is really no way to safely give a user full access to a DMA
      capable device without an IOMMU to protect the host system.  There is
      also no way to provide DMA translation, for use cases such as device
      assignment to virtual machines.  However, there are still those users
      that want userspace drivers even under those conditions.  The UIO
      driver exists for this use case, but does not provide the degree of
      device access and programming that VFIO has.  In an effort to avoid
      code duplication, this introduces a No-IOMMU mode for VFIO.
      
      This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
      the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
      should make it very clear that this mode is not safe.  Additionally,
      CAP_SYS_RAWIO privileges are necessary to work with groups and
      containers using this mode.  Groups making use of this support are
      named /dev/vfio/noiommu-$GROUP and can only make use of the special
      VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
      binding a device without a native IOMMU group to a VFIO bus driver
      will taint the kernel and should therefore not be considered
      supported.  This patch includes no-iommu support for the vfio-pci bus
      driver only.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      03a76b60
  24. 04 12月, 2015 1 次提交
  25. 20 11月, 2015 1 次提交
  26. 05 11月, 2015 1 次提交
    • A
      vfio: Include No-IOMMU mode · 033291ec
      Alex Williamson 提交于
      There is really no way to safely give a user full access to a DMA
      capable device without an IOMMU to protect the host system.  There is
      also no way to provide DMA translation, for use cases such as device
      assignment to virtual machines.  However, there are still those users
      that want userspace drivers even under those conditions.  The UIO
      driver exists for this use case, but does not provide the degree of
      device access and programming that VFIO has.  In an effort to avoid
      code duplication, this introduces a No-IOMMU mode for VFIO.
      
      This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
      the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
      should make it very clear that this mode is not safe.  Additionally,
      CAP_SYS_RAWIO privileges are necessary to work with groups and
      containers using this mode.  Groups making use of this support are
      named /dev/vfio/noiommu-$GROUP and can only make use of the special
      VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
      binding a device without a native IOMMU group to a VFIO bus driver
      will taint the kernel and should therefore not be considered
      supported.  This patch includes no-iommu support for the vfio-pci bus
      driver only.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      033291ec
  27. 10 6月, 2015 1 次提交
    • A
      vfio/pci: Fix racy vfio_device_get_from_dev() call · 20f30017
      Alex Williamson 提交于
      Testing the driver for a PCI device is racy, it can be all but
      complete in the release path and still report the driver as ours.
      Therefore we can't trust drvdata to be valid.  This race can sometimes
      be seen when one port of a multifunction device is being unbound from
      the vfio-pci driver while another function is being released by the
      user and attempting a bus reset.  The device in the remove path is
      found as a dependent device for the bus reset of the release path
      device, the driver is still set to vfio-pci, but the drvdata has
      already been cleared, resulting in a null pointer dereference.
      
      To resolve this, fix vfio_device_get_from_dev() to not take the
      dev_get_drvdata() shortcut and instead traverse through the
      iommu_group, vfio_group, vfio_device path to get a reference we
      can trust.  Once we have that reference, we know the device isn't
      in transition and we can test to make sure the driver is still what
      we expect, so that we don't interfere with devices we don't own.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      20f30017
  28. 02 5月, 2015 1 次提交