1. 22 2月, 2019 1 次提交
    • A
      vfio/common: Work around kernel overflow bug in DMA unmap · 567d7d3e
      Alex Williamson 提交于
      A kernel bug was introduced in v4.15 via commit 71a7d3d78e3c which
      adds a test for address space wrap-around in the vfio DMA unmap path.
      Unfortunately due to overflow, the kernel detects an unmap of the last
      page in the 64-bit address space as a wrap-around.  In QEMU, a Q35
      guest with VT-d emulation and guest IOMMU enabled will attempt to make
      such an unmap request during VM system reset, triggering an error:
      
        qemu-kvm: VFIO_UNMAP_DMA: -22
        qemu-kvm: vfio_dma_unmap(0x561f059948f0, 0xfef00000, 0xffffffff01100000) = -22 (Invalid argument)
      
      Here the IOVA start address (0xfef00000) and the size parameter
      (0xffffffff01100000) add to exactly 2^64, triggering the bug.  A
      kernel fix is queued for the Linux v5.0 release to address this.
      
      This patch implements a workaround to retry the unmap, excluding the
      final page of the range when we detect an unmap failing which matches
      the requirements for this issue.  This is expected to be a safe and
      complete workaround as the VT-d address space does not extend to the
      full 64-bit space and therefore the last page should never be mapped.
      
      This workaround can be removed once all kernels with this bug are
      sufficiently deprecated.
      
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291Reported-by: NPei Zhang <pezhang@redhat.com>
      Debugged-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      567d7d3e
  2. 24 1月, 2019 1 次提交
    • D
      trace: forbid use of %m in trace event format strings · 772f1b37
      Daniel P. Berrangé 提交于
      The '%m' format instructs glibc's printf()/syslog() implementation to
      insert the contents of strerror(errno). Since this is a glibc extension
      it should generally be avoided in QEMU due to need for portability to a
      variety of platforms.
      
      Even though vfio is Linux-only code that could otherwise use "%m", it
      must still be avoided in trace-events files because several of the
      backends do not use the format string and so this error information is
      invisible to them.
      
      The errno string value should be given as an explicit trace argument
      instead, making it accessible to all backends. This also allows it to
      work correctly with future patches that use the format string with
      systemtap's simple printf code.
      Reviewed-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NDaniel P. Berrangé <berrange@redhat.com>
      Message-id: 20190123120016.4538-4-berrange@redhat.com
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      772f1b37
  3. 17 8月, 2018 1 次提交
    • A
      vfio/ccw/pci: Allow devices to opt-in for ballooning · 238e9172
      Alex Williamson 提交于
      If a vfio assigned device makes use of a physical IOMMU, then memory
      ballooning is necessarily inhibited due to the page pinning, lack of
      page level granularity at the IOMMU, and sufficient notifiers to both
      remove the page on balloon inflation and add it back on deflation.
      However, not all devices are backed by a physical IOMMU.  In the case
      of mediated devices, if a vendor driver is well synchronized with the
      guest driver, such that only pages actively used by the guest driver
      are pinned by the host mdev vendor driver, then there should be no
      overlap between pages available for the balloon driver and pages
      actively in use by the device.  Under these conditions, ballooning
      should be safe.
      
      vfio-ccw devices are always mediated devices and always operate under
      the constraints above.  Therefore we can consider all vfio-ccw devices
      as balloon compatible.
      
      The situation is far from straightforward with vfio-pci.  These
      devices can be physical devices with physical IOMMU backing or
      mediated devices where it is unknown whether a physical IOMMU is in
      use or whether the vendor driver is well synchronized to the working
      set of the guest driver.  The safest approach is therefore to assume
      all vfio-pci devices are incompatible with ballooning, but allow user
      opt-in should they have further insight into mediated devices.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      238e9172
  4. 05 6月, 2018 2 次提交
    • A
      vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly · 2b1dbd0d
      Alex Williamson 提交于
      With vfio ioeventfd support, we can program vfio-pci to perform a
      specified BAR write when an eventfd is triggered.  This allows the
      KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
      userspace handling for these events.  On the same micro-benchmark
      where the ioeventfd got us to almost 90% of performance versus
      disabling the GeForce quirks, this gets us to within 95%.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      2b1dbd0d
    • A
      vfio/quirks: ioeventfd quirk acceleration · c958c51d
      Alex Williamson 提交于
      The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found
      in device MMIO space.  Normally PCI config space is considered a slow
      path and further optimization is unnecessary, however NVIDIA uses a
      register here to enable the MSI interrupt to re-trigger.  Exiting to
      QEMU for this MSI-ACK handling can therefore rate limit our interrupt
      handling.  Fortunately the MSI-ACK write is easily detected since the
      quirk MemoryRegion otherwise has very few accesses, so simply looking
      for consecutive writes with the same data is sufficient, in this case
      10 consecutive writes with the same data and size is arbitrarily
      chosen.  We configure the KVM ioeventfd with data match, so there's
      no risk of triggering for the wrong data or size, but we do risk that
      pathological driver behavior might consume all of QEMU's file
      descriptors, so we cap ourselves to 10 ioeventfds for this purpose.
      
      In support of the above, generic ioeventfd infrastructure is added
      for vfio quirks.  This automatically initializes an ioeventfd list
      per quirk, disables and frees ioeventfds on exit, and allows
      ioeventfds marked as dynamic to be dropped on device reset.  The
      rationale for this latter feature is that useful ioeventfds may
      depend on specific driver behavior and since we necessarily place a
      cap on our use of ioeventfds, a machine reset is a reasonable point
      at which to assume a new driver and re-profile.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      c958c51d
  5. 06 4月, 2018 1 次提交
    • E
      vfio: Use a trace point when a RAM section cannot be DMA mapped · 5c086005
      Eric Auger 提交于
      Commit 567b5b30 ("vfio/pci: Relax DMA map errors for MMIO regions")
      added an error message if a passed memory section address or size
      is not aligned to the page size and thus cannot be DMA mapped.
      
      This patch fixes the trace by printing the region name and the
      memory region section offset within the address space (instead of
      offset_within_region).
      
      We also turn the error_report into a trace event. Indeed, In some
      cases, the traces can be confusing to non expert end-users and
      let think the use case does not work (whereas it works as before).
      
      This is the case where a BAR is successively mapped at different
      GPAs and its sections are not compatible with dma map. The listener
      is called several times and traces are issued for each intermediate
      mapping.  The end-user cannot easily match those GPAs against the
      final GPA output by lscpi. So let's keep those information to
      informed users. In mid term, the plan is to advise the user about
      BAR relocation relevance.
      
      Fixes: 567b5b30 ("vfio/pci: Relax DMA map errors for MMIO regions")
      Signed-off-by: NEric Auger <eric.auger@redhat.com>
      Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      5c086005
  6. 07 2月, 2018 2 次提交
    • A
      vfio/pci: Allow relocating MSI-X MMIO · 89d5202e
      Alex Williamson 提交于
      Recently proposed vfio-pci kernel changes (v4.16) remove the
      restriction preventing userspace from mmap'ing PCI BARs in areas
      overlapping the MSI-X vector table.  This change is primarily intended
      to benefit host platforms which make use of system page sizes larger
      than the PCI spec recommendation for alignment of MSI-X data
      structures (ie. not x86_64).  In the case of POWER systems, the SPAPR
      spec requires the VM to program MSI-X using hypercalls, rendering the
      MSI-X vector table unused in the VM view of the device.  However,
      ARM64 platforms also support 64KB pages and rely on QEMU emulation of
      MSI-X.  Regardless of the kernel driver allowing mmaps overlapping
      the MSI-X vector table, emulation of the MSI-X vector table also
      prevents direct mapping of device MMIO spaces overlapping this page.
      Thanks to the fact that PCI devices have a standard self discovery
      mechanism, we can try to resolve this by relocating the MSI-X data
      structures, either by creating a new PCI BAR or extending an existing
      BAR and updating the MSI-X capability for the new location.  There's
      even a very slim chance that this could benefit devices which do not
      adhere to the PCI spec alignment guidelines on x86_64 systems.
      
      This new x-msix-relocation option accepts the following choices:
      
        off: Disable MSI-X relocation, use native device config (default)
        auto: Use a known good combination for the platform/device (none yet)
        bar0..bar5: Specify the target BAR for MSI-X data structures
      
      If compatible, the target BAR will either be created or extended and
      the new portion will be used for MSI-X emulation.
      
      The first obvious user question with this option is how to determine
      whether a given platform and device might benefit from this option.
      In most cases, the answer is that it won't, especially on x86_64.
      Devices often dedicate an entire BAR to MSI-X and therefore no
      performance sensitive registers overlap the MSI-X area.  Take for
      example:
      
      # lspci -vvvs 0a:00.0
      0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection
      	...
      	Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K]
      	Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K]
      	...
      	Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
      		Vector table: BAR=3 offset=00000000
      		PBA: BAR=3 offset=00002000
      
      This device uses the 16K bar3 for MSI-X with the vector table at
      offset zero and the pending bits arrary at offset 8K, fully honoring
      the PCI spec alignment guidance.  The data sheet specifically refers
      to this as an MSI-X BAR.  This device would not see a benefit from
      MSI-X relocation regardless of the platform, regardless of the page
      size.
      
      However, here's another example:
      
      # lspci -vvvs 02:00.0
      02:00.0 Serial Attached SCSI controller: xxxxxxxx
      	...
      	Region 0: I/O ports at c000 [size=256]
      	Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K]
      	Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K]
      	...
      	Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
      		Vector table: BAR=1 offset=0000e000
      		PBA: BAR=1 offset=0000f000
      
      Here the MSI-X data structures are placed on separate 4K pages at the
      end of a 64KB BAR.  If our host page size is 4K, we're likely fine,
      but at 64KB page size, MSI-X emulation at that location prevents the
      entire BAR from being directly mapped into the VM address space.
      Overlapping performance sensitive registers then starts to be a very
      likely scenario on such a platform.  At this point, the user could
      enable tracing on vfio_region_read and vfio_region_write to determine
      more conclusively if device accesses are being trapped through QEMU.
      
      Upon finding a device and platform in need of MSI-X relocation, the
      next problem is how to choose target PCI BAR to host the MSI-X data
      structures.  A few key rules to keep in mind for this selection
      include:
      
       * There are only 6 BAR slots, bar0..bar5
       * 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot
       * PCI BARs are always a power of 2 in size, extending == doubling
       * The maximum size of a 32-bit BAR is 2GB
       * MSI-X data structures must reside in an MMIO BAR
      
      Using these rules, we can evaluate each BAR of the second example
      device above as follows:
      
       bar0: I/O port BAR, incompatible with MSI-X tables
       bar1: BAR could be extended, incurring another 64KB of MMIO
       bar2: Unavailable, bar1 is 64-bit, this register is used by bar1
       bar3: BAR could be extended, incurring another 256KB of MMIO
       bar4: Unavailable, bar3 is 64bit, this register is used by bar3
       bar5: Available, empty BAR, minimum additional MMIO
      
      A secondary optimization we might wish to make in relocating MSI-X
      is to minimize the additional MMIO required for the device, therefore
      we might test the available choices in order of preference as bar5,
      bar1, and finally bar3.  The original proposal for this feature
      included an 'auto' option which would choose bar5 in this case, but
      various drivers have been found that make assumptions about the
      properties of the "first" BAR or the size of BARs such that there
      appears to be no foolproof automatic selection available, requiring
      known good combinations to be sourced from users.  This patch is
      pre-enabled for an 'auto' selection making use of a validated lookup
      table, but no entries are yet identified.
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Tested-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      89d5202e
    • A
      vfio/spapr: Use iommu memory region's get_attr() · 07bc681a
      Alexey Kardashevskiy 提交于
      In order to enable TCE operations support in KVM, we have to inform
      the KVM about VFIO groups being attached to specific LIOBNs. The KVM
      already knows about VFIO groups, the only bit missing is which
      in-kernel TCE table (the one with user visible TCEs) should update
      the attached broups. There is an KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE
      attribute of the VFIO KVM device which receives a groupfd/tablefd couple.
      
      This uses a new memory_region_iommu_get_attr() helper to get the IOMMU fd
      and calls KVM to establish the link.
      
      As get_attr() is not implemented yet, this should cause no behavioural
      change.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      07bc681a
  7. 01 8月, 2017 1 次提交
  8. 31 7月, 2017 1 次提交
  9. 18 2月, 2017 1 次提交
  10. 01 2月, 2017 1 次提交
  11. 18 10月, 2016 1 次提交
  12. 12 8月, 2016 1 次提交
  13. 05 7月, 2016 2 次提交
    • A
      vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) · 2e4109de
      Alexey Kardashevskiy 提交于
      New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
      This adds ability to VFIO common code to dynamically allocate/remove
      DMA windows in the host kernel when new VFIO container is added/removed.
      
      This adds a helper to vfio_listener_region_add which makes
      VFIO_IOMMU_SPAPR_TCE_CREATE ioctl and adds just created IOMMU into
      the host IOMMU list; the opposite action is taken in
      vfio_listener_region_del.
      
      When creating a new window, this uses heuristic to decide on the TCE table
      levels number.
      
      This should cause no guest visible change in behavior.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [dwg: Added some casts to prevent printf() warnings on certain targets
       where the kernel headers' __u64 doesn't match uint64_t or PRIx64]
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      2e4109de
    • A
      vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) · 318f67ce
      Alexey Kardashevskiy 提交于
      This makes use of the new "memory registering" feature. The idea is
      to provide the userspace ability to notify the host kernel about pages
      which are going to be used for DMA. Having this information, the host
      kernel can pin them all once per user process, do locked pages
      accounting (once) and not spent time on doing that in real time with
      possible failures which cannot be handled nicely in some cases.
      
      This adds a prereg memory listener which listens on address_space_memory
      and notifies a VFIO container about memory which needs to be
      pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
      
      The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
      are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
      not call it when v2 is detected and enabled.
      
      This enforces guest RAM blocks to be host page size aligned; however
      this is not new as KVM already requires memory slots to be host page
      size aligned.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [dwg: Fix compile error on 32-bit host]
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      318f67ce
  14. 01 7月, 2016 1 次提交
    • A
      vfio/pci: Hide SR-IOV capability · e37dac06
      Alex Williamson 提交于
      The kernel currently exposes the SR-IOV capability as read-only
      through vfio-pci.  This is sufficient to protect the host kernel, but
      has the potential to confuse guests without further virtualization.
      In particular, OVMF tries to size the VF BARs and comes up with absurd
      results, ending with an assert.  There's not much point in adding
      virtualization to a read-only capability, so we simply hide it for
      now.  If the kernel ever enables SR-IOV virtualization, we should
      easily be able to test it through VF BAR sizing or explicit flags.
      
      Testing whether we should parse extended capabilities is also pulled
      into the function to keep these assumptions in one place.
      Tested-by: NLaszlo Ersek <lersek@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      e37dac06
  15. 21 6月, 2016 1 次提交