1. 22 5月, 2019 2 次提交
  2. 26 4月, 2019 1 次提交
    • A
      spapr: Support NVIDIA V100 GPU with NVLink2 · ec132efa
      Alexey Kardashevskiy 提交于
      NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
      space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
      implements special regions for such GPUs and emulates an NVLink bridge.
      NVLink2-enabled POWER9 CPUs also provide address translation services
      which includes an ATS shootdown (ATSD) register exported via the NVLink
      bridge device.
      
      This adds a quirk to VFIO to map the GPU memory and create an MR;
      the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
      this to get the MR and map it to the system address space.
      Another quirk does the same for ATSD.
      
      This adds additional steps to sPAPR PHB setup:
      
      1. Search for specific GPUs and NPUs, collect findings in
      sPAPRPHBState::nvgpus, manage system address space mappings;
      
      2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
      "memory-block", "link-speed" to advertise the NVLink2 function to
      the guest;
      
      3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
      
      4. Add new memory blocks (with extra "linux,memory-usable" to prevent
      the guest OS from accessing the new memory until it is onlined) and
      npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
      uses it for link discovery.
      
      This allocates space for GPU RAM and ATSD like we do for MMIOs by
      adding 2 new parameters to the phb_placement() hook. Older machine types
      set these to zero.
      
      This puts new memory nodes in a separate NUMA node to as the GPU RAM
      needs to be configured equally distant from any other node in the system.
      Unlike the host setup which assigns numa ids from 255 downwards, this
      adds new NUMA nodes after the user configures nodes or from 1 if none
      were configured.
      
      This adds requirement similar to EEH - one IOMMU group per vPHB.
      The reason for this is that ATSD registers belong to a physical NPU
      so they cannot invalidate translations on GPUs attached to another NPU.
      It is guaranteed by the host platform as it does not mix NVLink bridges
      or GPUs from different NPU in the same IOMMU group. If more than one
      IOMMU group is detected on a vPHB, this disables ATSD support for that
      vPHB and prints a warning.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for vfio portions]
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Message-Id: <20190312082103.130561-1-aik@ozlabs.ru>
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      ec132efa
  3. 19 4月, 2019 1 次提交
  4. 12 3月, 2019 1 次提交
  5. 24 1月, 2019 1 次提交
    • D
      trace: forbid use of %m in trace event format strings · 772f1b37
      Daniel P. Berrangé 提交于
      The '%m' format instructs glibc's printf()/syslog() implementation to
      insert the contents of strerror(errno). Since this is a glibc extension
      it should generally be avoided in QEMU due to need for portability to a
      variety of platforms.
      
      Even though vfio is Linux-only code that could otherwise use "%m", it
      must still be avoided in trace-events files because several of the
      backends do not use the format string and so this error information is
      invisible to them.
      
      The errno string value should be given as an explicit trace argument
      instead, making it accessible to all backends. This also allows it to
      work correctly with future patches that use the format string with
      systemtap's simple printf code.
      Reviewed-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NDaniel P. Berrangé <berrange@redhat.com>
      Message-id: 20190123120016.4538-4-berrange@redhat.com
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      772f1b37
  6. 20 12月, 2018 2 次提交
  7. 19 10月, 2018 3 次提交
    • M
      vfio: Clean up error reporting after previous commit · c3b8e3e0
      Markus Armbruster 提交于
      The previous commit changed vfio's warning messages from
      
          vfio warning: DEV-NAME: Could not frobnicate
      
      to
      
          warning: vfio DEV-NAME: Could not frobnicate
      
      To match this change, change error messages from
      
          vfio error: DEV-NAME: On fire
      
      to
      
          vfio DEV-NAME: On fire
      
      Note the loss of "error".  If we think marking error messages that way
      is a good idea, we should mark *all* error messages, i.e. make
      error_report() print it.
      
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMarkus Armbruster <armbru@redhat.com>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Message-Id: <20181017082702.5581-7-armbru@redhat.com>
      c3b8e3e0
    • M
      vfio: Use warn_report() & friends to report warnings · e1eb292a
      Markus Armbruster 提交于
      The vfio code reports warnings like
      
          error_report(WARN_PREFIX "Could not frobnicate", DEV-NAME);
      
      where WARN_PREFIX is defined so the message comes out as
      
          vfio warning: DEV-NAME: Could not frobnicate
      
      This usage predates the introduction of warn_report() & friends in
      commit 97f40301.  It's time to convert to that interface.  Since
      these functions already prefix the message with "warning: ", replace
      WARN_PREFIX by VFIO_MSG_PREFIX, so the messages come out like
      
          warning: vfio DEV-NAME: Could not frobnicate
      
      The next commit will replace ERR_PREFIX.
      
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMarkus Armbruster <armbru@redhat.com>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NEric Blake <eblake@redhat.com>
      Message-Id: <20181017082702.5581-6-armbru@redhat.com>
      e1eb292a
    • M
      error: Fix use of error_prepend() with &error_fatal, &error_abort · 4b576648
      Markus Armbruster 提交于
      From include/qapi/error.h:
      
        * Pass an existing error to the caller with the message modified:
        *     error_propagate(errp, err);
        *     error_prepend(errp, "Could not frobnicate '%s': ", name);
      
      Fei Li pointed out that doing error_propagate() first doesn't work
      well when @errp is &error_fatal or &error_abort: the error_prepend()
      is never reached.
      
      Since I doubt fixing the documentation will stop people from getting
      it wrong, introduce error_propagate_prepend(), in the hope that it
      lures people away from using its constituents in the wrong order.
      Update the instructions in error.h accordingly.
      
      Convert existing error_prepend() next to error_propagate to
      error_propagate_prepend().  If any of these get reached with
      &error_fatal or &error_abort, the error messages improve.  I didn't
      check whether that's the case anywhere.
      
      Cc: Fei Li <fli@suse.com>
      Signed-off-by: NMarkus Armbruster <armbru@redhat.com>
      Reviewed-by: NPhilippe Mathieu-Daudé <philmd@redhat.com>
      Reviewed-by: NEric Blake <eblake@redhat.com>
      Message-Id: <20181017082702.5581-2-armbru@redhat.com>
      4b576648
  8. 16 10月, 2018 2 次提交
  9. 24 8月, 2018 1 次提交
  10. 17 8月, 2018 1 次提交
    • A
      vfio/ccw/pci: Allow devices to opt-in for ballooning · 238e9172
      Alex Williamson 提交于
      If a vfio assigned device makes use of a physical IOMMU, then memory
      ballooning is necessarily inhibited due to the page pinning, lack of
      page level granularity at the IOMMU, and sufficient notifiers to both
      remove the page on balloon inflation and add it back on deflation.
      However, not all devices are backed by a physical IOMMU.  In the case
      of mediated devices, if a vendor driver is well synchronized with the
      guest driver, such that only pages actively used by the guest driver
      are pinned by the host mdev vendor driver, then there should be no
      overlap between pages available for the balloon driver and pages
      actively in use by the device.  Under these conditions, ballooning
      should be safe.
      
      vfio-ccw devices are always mediated devices and always operate under
      the constraints above.  Therefore we can consider all vfio-ccw devices
      as balloon compatible.
      
      The situation is far from straightforward with vfio-pci.  These
      devices can be physical devices with physical IOMMU backing or
      mediated devices where it is unknown whether a physical IOMMU is in
      use or whether the vendor driver is well synchronized to the working
      set of the guest driver.  The safest approach is therefore to assume
      all vfio-pci devices are incompatible with ballooning, but allow user
      opt-in should they have further insight into mediated devices.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      238e9172
  11. 12 7月, 2018 1 次提交
    • C
      vfio/pci: do not set the PCIDevice 'has_rom' attribute · 26c0ae56
      Cédric Le Goater 提交于
      PCI devices needing a ROM allocate an optional MemoryRegion with
      pci_add_option_rom(). pci_del_option_rom() does the cleanup when the
      device is destroyed. The only action taken by this routine is to call
      vmstate_unregister_ram() which clears the id string of the optional
      ROM RAMBlock and now, also flags the RAMBlock as non-migratable. This
      was recently added by commit b895de50 ("migration: discard
      non-migratable RAMBlocks"), .
      
      VFIO devices do their own loading of the PCI option ROM in
      vfio_pci_size_rom(). The memory region is switched to an I/O region
      and the PCI attribute 'has_rom' is set but the RAMBlock of the ROM
      region is not allocated. When the associated PCI device is deleted,
      pci_del_option_rom() calls vmstate_unregister_ram() which tries to
      flag a NULL RAMBlock, leading to a SEGV.
      
      It seems that 'has_rom' was set to have memory_region_destroy()
      called, but since commit 469b046e ("memory: remove
      memory_region_destroy") this is not necessary anymore as the
      MemoryRegion is freed automagically.
      
      Remove the PCIDevice 'has_rom' attribute setting in vfio.
      
      Fixes: b895de50 ("migration: discard non-migratable RAMBlocks")
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      26c0ae56
  12. 02 7月, 2018 1 次提交
  13. 05 6月, 2018 4 次提交
    • A
      vfio/pci: Default display option to "off" · 8151a9c5
      Alex Williamson 提交于
      Commit a9994687 ("vfio/display: core & wireup") added display
      support to vfio-pci with the default being "auto", which breaks
      existing VMs when the vGPU requires GL support but had no previous
      requirement for a GL compatible configuration.  "Off" is the safer
      default as we impose no new requirements to VM configurations.
      
      Fixes: a9994687 ("vfio/display: core & wireup")
      Cc: qemu-stable@nongnu.org
      Cc: Gerd Hoffmann <kraxel@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      8151a9c5
    • A
      vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly · 2b1dbd0d
      Alex Williamson 提交于
      With vfio ioeventfd support, we can program vfio-pci to perform a
      specified BAR write when an eventfd is triggered.  This allows the
      KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
      userspace handling for these events.  On the same micro-benchmark
      where the ioeventfd got us to almost 90% of performance versus
      disabling the GeForce quirks, this gets us to within 95%.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      2b1dbd0d
    • A
      vfio/quirks: ioeventfd quirk acceleration · c958c51d
      Alex Williamson 提交于
      The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found
      in device MMIO space.  Normally PCI config space is considered a slow
      path and further optimization is unnecessary, however NVIDIA uses a
      register here to enable the MSI interrupt to re-trigger.  Exiting to
      QEMU for this MSI-ACK handling can therefore rate limit our interrupt
      handling.  Fortunately the MSI-ACK write is easily detected since the
      quirk MemoryRegion otherwise has very few accesses, so simply looking
      for consecutive writes with the same data is sufficient, in this case
      10 consecutive writes with the same data and size is arbitrarily
      chosen.  We configure the KVM ioeventfd with data match, so there's
      no risk of triggering for the wrong data or size, but we do risk that
      pathological driver behavior might consume all of QEMU's file
      descriptors, so we cap ourselves to 10 ioeventfds for this purpose.
      
      In support of the above, generic ioeventfd infrastructure is added
      for vfio quirks.  This automatically initializes an ioeventfd list
      per quirk, disables and frees ioeventfds on exit, and allows
      ioeventfds marked as dynamic to be dropped on device reset.  The
      rationale for this latter feature is that useful ioeventfds may
      depend on specific driver behavior and since we necessarily place a
      cap on our use of ioeventfds, a machine reset is a reasonable point
      at which to assume a new driver and re-profile.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      c958c51d
    • A
      vfio/quirks: Add quirk reset callback · 469d02de
      Alex Williamson 提交于
      Quirks can be self modifying, provide a hook to allow them to cleanup
      on device reset if desired.
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      469d02de
  14. 27 4月, 2018 1 次提交
    • T
      ui: introduce vfio_display_reset · 8983e3e3
      Tina Zhang 提交于
      During guest OS reboot, guest framebuffer is invalid. It will cause
      bugs, if the invalid guest framebuffer is still used by host.
      
      This patch is to introduce vfio_display_reset which is invoked
      during vfio display reset. This vfio_display_reset function is used
      to release the invalid display resource, disable scanout mode and
      replace the invalid surface with QemuConsole's DisplaySurafce.
      
      This patch can fix the GPU hang issue caused by gd_egl_draw during
      guest OS reboot.
      
      Changes v3->v4:
       - Move dma-buf based display check into the vfio_display_reset().
         (Gerd)
      
      Changes v2->v3:
       - Limit vfio_display_reset to dma-buf based vfio display. (Gerd)
      
      Changes v1->v2:
       - Use dpy_gfx_update_full() update screen after reset. (Gerd)
       - Remove dpy_gfx_switch_surface(). (Gerd)
      Signed-off-by: NTina Zhang <tina.zhang@intel.com>
      Message-id: 1524820266-27079-3-git-send-email-tina.zhang@intel.com
      Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
      8983e3e3
  15. 14 3月, 2018 3 次提交
  16. 06 3月, 2018 1 次提交
  17. 09 2月, 2018 2 次提交
  18. 07 2月, 2018 4 次提交
    • A
      vfio/pci: Add option to disable GeForce quirks · db32d0f4
      Alex Williamson 提交于
      These quirks are necessary for GeForce, but not for Quadro/GRID/Tesla
      assignment.  Leaving them enabled is fully functional and provides the
      most compatibility, but due to the unique NVIDIA MSI ACK behavior[1],
      it also introduces latency in re-triggering the MSI interrupt.  This
      overhead is typically negligible, but has been shown to adversely
      affect some (very) high interrupt rate applications.  This adds the
      vfio-pci device option "x-no-geforce-quirks=" which can be set to
      "on" to disable this additional overhead.
      
      A follow-on optimization for GeForce might be to make use of an
      ioeventfd to allow KVM to trigger an irqfd in the kernel vfio-pci
      driver, avoiding the bounce through userspace to handle this device
      write.
      
      [1] Background: the NVIDIA driver has been observed to issue a write
      to the MMIO mirror of PCI config space in BAR0 in order to allow the
      MSI interrupt for the device to retrigger.  Older reports indicated a
      write of 0xff to the (read-only) MSI capability ID register, while
      more recently a write of 0x0 is observed at config space offset 0x704,
      non-architected, extended config space of the device (BAR0 offset
      0x88704).  Virtualization of this range is only required for GeForce.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      db32d0f4
    • A
      vfio/pci: Allow relocating MSI-X MMIO · 89d5202e
      Alex Williamson 提交于
      Recently proposed vfio-pci kernel changes (v4.16) remove the
      restriction preventing userspace from mmap'ing PCI BARs in areas
      overlapping the MSI-X vector table.  This change is primarily intended
      to benefit host platforms which make use of system page sizes larger
      than the PCI spec recommendation for alignment of MSI-X data
      structures (ie. not x86_64).  In the case of POWER systems, the SPAPR
      spec requires the VM to program MSI-X using hypercalls, rendering the
      MSI-X vector table unused in the VM view of the device.  However,
      ARM64 platforms also support 64KB pages and rely on QEMU emulation of
      MSI-X.  Regardless of the kernel driver allowing mmaps overlapping
      the MSI-X vector table, emulation of the MSI-X vector table also
      prevents direct mapping of device MMIO spaces overlapping this page.
      Thanks to the fact that PCI devices have a standard self discovery
      mechanism, we can try to resolve this by relocating the MSI-X data
      structures, either by creating a new PCI BAR or extending an existing
      BAR and updating the MSI-X capability for the new location.  There's
      even a very slim chance that this could benefit devices which do not
      adhere to the PCI spec alignment guidelines on x86_64 systems.
      
      This new x-msix-relocation option accepts the following choices:
      
        off: Disable MSI-X relocation, use native device config (default)
        auto: Use a known good combination for the platform/device (none yet)
        bar0..bar5: Specify the target BAR for MSI-X data structures
      
      If compatible, the target BAR will either be created or extended and
      the new portion will be used for MSI-X emulation.
      
      The first obvious user question with this option is how to determine
      whether a given platform and device might benefit from this option.
      In most cases, the answer is that it won't, especially on x86_64.
      Devices often dedicate an entire BAR to MSI-X and therefore no
      performance sensitive registers overlap the MSI-X area.  Take for
      example:
      
      # lspci -vvvs 0a:00.0
      0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection
      	...
      	Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K]
      	Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K]
      	...
      	Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
      		Vector table: BAR=3 offset=00000000
      		PBA: BAR=3 offset=00002000
      
      This device uses the 16K bar3 for MSI-X with the vector table at
      offset zero and the pending bits arrary at offset 8K, fully honoring
      the PCI spec alignment guidance.  The data sheet specifically refers
      to this as an MSI-X BAR.  This device would not see a benefit from
      MSI-X relocation regardless of the platform, regardless of the page
      size.
      
      However, here's another example:
      
      # lspci -vvvs 02:00.0
      02:00.0 Serial Attached SCSI controller: xxxxxxxx
      	...
      	Region 0: I/O ports at c000 [size=256]
      	Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K]
      	Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K]
      	...
      	Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
      		Vector table: BAR=1 offset=0000e000
      		PBA: BAR=1 offset=0000f000
      
      Here the MSI-X data structures are placed on separate 4K pages at the
      end of a 64KB BAR.  If our host page size is 4K, we're likely fine,
      but at 64KB page size, MSI-X emulation at that location prevents the
      entire BAR from being directly mapped into the VM address space.
      Overlapping performance sensitive registers then starts to be a very
      likely scenario on such a platform.  At this point, the user could
      enable tracing on vfio_region_read and vfio_region_write to determine
      more conclusively if device accesses are being trapped through QEMU.
      
      Upon finding a device and platform in need of MSI-X relocation, the
      next problem is how to choose target PCI BAR to host the MSI-X data
      structures.  A few key rules to keep in mind for this selection
      include:
      
       * There are only 6 BAR slots, bar0..bar5
       * 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot
       * PCI BARs are always a power of 2 in size, extending == doubling
       * The maximum size of a 32-bit BAR is 2GB
       * MSI-X data structures must reside in an MMIO BAR
      
      Using these rules, we can evaluate each BAR of the second example
      device above as follows:
      
       bar0: I/O port BAR, incompatible with MSI-X tables
       bar1: BAR could be extended, incurring another 64KB of MMIO
       bar2: Unavailable, bar1 is 64-bit, this register is used by bar1
       bar3: BAR could be extended, incurring another 256KB of MMIO
       bar4: Unavailable, bar3 is 64bit, this register is used by bar3
       bar5: Available, empty BAR, minimum additional MMIO
      
      A secondary optimization we might wish to make in relocating MSI-X
      is to minimize the additional MMIO required for the device, therefore
      we might test the available choices in order of preference as bar5,
      bar1, and finally bar3.  The original proposal for this feature
      included an 'auto' option which would choose bar5 in this case, but
      various drivers have been found that make assumptions about the
      properties of the "first" BAR or the size of BARs such that there
      appears to be no foolproof automatic selection available, requiring
      known good combinations to be sourced from users.  This patch is
      pre-enabled for an 'auto' selection making use of a validated lookup
      table, but no entries are yet identified.
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Tested-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      89d5202e
    • A
      vfio/pci: Emulate BARs · 04f336b0
      Alex Williamson 提交于
      The kernel provides similar emulation of PCI BAR register access to
      QEMU, so up until now we've used that for things like BAR sizing and
      storing the BAR address.  However, if we intend to resize BARs or add
      BARs that don't exist on the physical device, we need to switch to the
      pure QEMU emulation of the BAR.
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Tested-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      04f336b0
    • A
      vfio/pci: Add base BAR MemoryRegion · 3a286732
      Alex Williamson 提交于
      Add one more layer to our stack of MemoryRegions, this base region
      allows us to register BARs independently of the vfio region or to
      extend the size of BARs which do map to a region.  This will be
      useful when we want hypervisor defined BARs or sections of BARs,
      for purposes such as relocating MSI-X emulation.  We therefore call
      msix_init() based on this new base MemoryRegion, while the quirks,
      which only modify regions still operate on those sub-MemoryRegions.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      3a286732
  19. 06 12月, 2017 1 次提交
  20. 15 10月, 2017 1 次提交
    • E
      pci: Add interface names to hybrid PCI devices · a5fa336f
      Eduardo Habkost 提交于
      The following devices support both PCI Express and Conventional
      PCI, by including special code to handle the QEMU_PCI_CAP_EXPRESS
      flag and/or conditional pcie_endpoint_cap_init() calls:
      
      * vfio-pci (is_express=1, but legacy PCI handled by
        vfio_populate_device())
      * vmxnet3 (is_express=0, but PCIe handled by vmxnet3_realize())
      * pvscsi (is_express=0, but PCIe handled by pvscsi_realize())
      * virtio-pci (is_express=0, but PCIe handled by
        virtio_pci_dc_realize(), and additional legacy PCI code at
        virtio_pci_realize())
      * base-xhci (is_express=1, but pcie_endpoint_cap_init() call
        is conditional on pci_bus_is_express(dev->bus)
        * Note that xhci does not clear QEMU_PCI_CAP_EXPRESS like the
          other hybrid devices
      
      Cc: Dmitry Fleytman <dmitry@daynix.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Gerd Hoffmann <kraxel@redhat.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NMarcel Apfelbaum <marcel@redhat.com>
      Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      a5fa336f
  21. 04 10月, 2017 3 次提交
    • A
      vfio/pci: Add NVIDIA GPUDirect Cliques support · dfbee78d
      Alex Williamson 提交于
      NVIDIA has defined a specification for creating GPUDirect "cliques",
      where devices with the same clique ID support direct peer-to-peer DMA.
      When running on bare-metal, tools like NVIDIA's p2pBandwidthLatencyTest
      (part of cuda-samples) determine which GPUs can support peer-to-peer
      based on chipset and topology.  When running in a VM, these tools have
      no visibility to the physical hardware support or topology.  This
      option allows the user to specify hints via a vendor defined
      capability.  For instance:
      
        <qemu:commandline>
          <qemu:arg value='-set'/>
          <qemu:arg value='device.hostdev0.x-nv-gpudirect-clique=0'/>
          <qemu:arg value='-set'/>
          <qemu:arg value='device.hostdev1.x-nv-gpudirect-clique=1'/>
          <qemu:arg value='-set'/>
          <qemu:arg value='device.hostdev2.x-nv-gpudirect-clique=1'/>
        </qemu:commandline>
      
      This enables two cliques.  The first is a singleton clique with ID 0,
      for the first hostdev defined in the XML (note that since cliques
      define peer-to-peer sets, singleton clique offer no benefit).  The
      subsequent two hostdevs are both added to clique ID 1, indicating
      peer-to-peer is possible between these devices.
      
      QEMU only provides validation that the clique ID is valid and applied
      to an NVIDIA graphics device, any validation that the resulting
      cliques are functional and valid is the user's responsibility.  The
      NVIDIA specification allows a 4-bit clique ID, thus valid values are
      0-15.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      dfbee78d
    • A
      vfio/pci: Add virtual capabilities quirk infrastructure · e3f79f3b
      Alex Williamson 提交于
      If the hypervisor needs to add purely virtual capabilties, give us a
      hook through quirks to do that.  Note that we determine the maximum
      size for a capability based on the physical device, if we insert a
      virtual capability, that can change.  Therefore if maximum size is
      smaller after added virt capabilities, use that.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      e3f79f3b
    • A
      vfio/pci: Do not unwind on error · 5b31c822
      Alex Williamson 提交于
      If vfio_add_std_cap() errors then going to out prepends irrelevant
      errors for capabilities we haven't attempted to add as we unwind our
      recursive stack.  Just return error.
      
      Fixes: 7ef165b9 ("vfio/pci: Pass an error object to vfio_add_capabilities")
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      5b31c822
  22. 27 7月, 2017 1 次提交
  23. 11 7月, 2017 2 次提交
    • A
      vfio/pci: Fixup v0 PCIe capabilities · 47985727
      Alex Williamson 提交于
      Intel 82599 VFs report a PCIe capability version of 0, which is
      invalid.  The earliest version of the PCIe spec used version 1.  This
      causes Windows to fail startup on the device and it will be disabled
      with error code 10.  Our choices are either to drop the PCIe cap on
      such devices, which has the side effect of likely preventing the guest
      from discovering any extended capabilities, or performing a fixup to
      update the capability to the earliest valid version.  This implements
      the latter.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      47985727
    • A
      vfio: Test realized when using VFIOGroup.device_list iterator · 7da624e2
      Alex Williamson 提交于
      VFIOGroup.device_list is effectively our reference tracking mechanism
      such that we can teardown a group when all of the device references
      are removed.  However, we also use this list from our machine reset
      handler for processing resets that affect multiple devices.  Generally
      device removals are fully processed (exitfn + finalize) when this
      reset handler is invoked, however if the removal is triggered via
      another reset handler (piix4_reset->acpi_pcihp_reset) then the device
      exitfn may run, but not finalize.  In this case we hit asserts when
      we start trying to access PCI helpers since much of the PCI state of
      the device is released.  To resolve this, add a pointer to the Object
      DeviceState in our common base-device and skip non-realized devices
      as we iterate.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      7da624e2