1. 31 10月, 2016 4 次提交
    • Y
      vfio: Add support for mmapping sub-page MMIO BARs · 95251725
      Yongji Xie 提交于
      Now the kernel commit 05f0c03fbac1 ("vfio-pci: Allow to mmap
      sub-page MMIO BARs if the mmio page is exclusive") allows VFIO
      to mmap sub-page BARs. This is the corresponding QEMU patch.
      With those patches applied, we could passthrough sub-page BARs
      to guest, which can help to improve IO performance for some devices.
      
      In this patch, we expand MemoryRegions of these sub-page
      MMIO BARs to PAGE_SIZE in vfio_pci_write_config(), so that
      the BARs could be passed to KVM ioctl KVM_SET_USER_MEMORY_REGION
      with a valid size. The expanding size will be recovered when
      the base address of sub-page BAR is changed and not page aligned
      any more in guest. And we also set the priority of these BARs'
      memory regions to zero in case of overlap with BARs which share
      the same page with sub-page BARs in guest.
      Signed-off-by: NYongji Xie <xyjxie@linux.vnet.ibm.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      95251725
    • I
      vfio/pci: fix out-of-sync BAR information on reset · a52a4c47
      Ido Yariv 提交于
      When a PCI device is reset, pci_do_device_reset resets all BAR addresses
      in the relevant PCIDevice's config buffer.
      
      The VFIO configuration space stays untouched, so the guest OS may choose
      to skip restoring the BAR addresses as they would seem intact. The PCI
      device may be left non-operational.
      One example of such a scenario is when the guest exits S3.
      
      Fix this by resetting the BAR addresses in the VFIO configuration space
      as well.
      Signed-off-by: NIdo Yariv <ido@wizery.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      a52a4c47
    • A
      vfio: Handle zero-length sparse mmap ranges · 24acf72b
      Alex Williamson 提交于
      As reported in the link below, user has a PCI device with a 4KB BAR
      which contains the MSI-X table.  This seems to hit a corner case in
      the kernel where the region reports being mmap capable, but the sparse
      mmap information reports a zero sized range.  It's not entirely clear
      that the kernel is incorrect in doing this, but regardless, we need
      to handle it.  To do this, fill our mmap array only with non-zero
      sized sparse mmap entries and add an error return from the function
      so we can tell the difference between nr_mmaps being zero based on
      sparse mmap info vs lack of sparse mmap info.
      
      NB, this doesn't actually change the behavior of the device, it only
      removes the scary "Failed to mmap ... Performance may be slow" error
      message.  We cannot currently create an mmap over the MSI-X table.
      
      Link: http://lists.nongnu.org/archive/html/qemu-discuss/2016-10/msg00009.htmlSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>
      24acf72b
    • A
      memory: Replace skip_dump flag with "ram_device" · 21e00fa5
      Alex Williamson 提交于
      Setting skip_dump on a MemoryRegion allows us to modify one specific
      code path, but the restriction we're trying to address encompasses
      more than that.  If we have a RAM MemoryRegion backed by a physical
      device, it not only restricts our ability to dump that region, but
      also affects how we should manipulate it.  Here we recognize that
      MemoryRegions do not change to sometimes allow dumps and other times
      not, so we replace setting the skip_dump flag with a new initializer
      so that we know exactly the type of region to which we're applying
      this behavior.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      21e00fa5
  2. 18 10月, 2016 19 次提交
  3. 27 9月, 2016 1 次提交
    • P
      memory: introduce IOMMUNotifier and its caps · cdb30812
      Peter Xu 提交于
      IOMMU Notifier list is used for notifying IO address mapping changes.
      Currently VFIO is the only user.
      
      However it is possible that future consumer like vhost would like to
      only listen to part of its notifications (e.g., cache invalidations).
      
      This patch introduced IOMMUNotifier and IOMMUNotfierFlag bits for a
      finer grained control of it.
      
      IOMMUNotifier contains a bitfield for the notify consumer describing
      what kind of notification it is interested in. Currently two kinds of
      notifications are defined:
      
      - IOMMU_NOTIFIER_MAP:    for newly mapped entries (additions)
      - IOMMU_NOTIFIER_UNMAP:  for entries to be removed (cache invalidates)
      
      When registering the IOMMU notifier, we need to specify one or multiple
      types of messages to listen to.
      
      When notifications are triggered, its type will be checked against the
      notifier's type bits, and only notifiers with registered bits will be
      notified.
      
      (For any IOMMU implementation, an in-place mapping change should be
       notified with an UNMAP followed by a MAP.)
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <1474606948-14391-2-git-send-email-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cdb30812
  4. 16 9月, 2016 1 次提交
    • D
      vfio/pci: Fix regression in MSI routing configuration · 6d17a018
      David Gibson 提交于
      d1f6af6a "kvm-irqchip: simplify kvm_irqchip_add_msi_route" was a cleanup
      of kvmchip routing configuration, that was mostly intended for x86.
      However, it also contains a subtle change in behaviour which breaks EEH[1]
      error recovery on certain VFIO passthrough devices on spapr guests.  So far
      it's only been seen on a BCM5719 NIC on a POWER8 server, but there may be
      other hardware with the same problem.  It's also possible there could be
      circumstances where it causes a bug on x86 as well, though I don't know of
      any obvious candidates.
      
      Prior to d1f6af6a, both vfio_msix_vector_do_use() and
      vfio_add_kvm_msi_virq() used msg == NULL as a special flag to mark this
      as the "dummy" vector used to make the host hardware state sync with the
      guest expected hardware state in terms of MSI configuration.
      
      Specifically that flag caused vfio_add_kvm_msi_virq() to become a no-op,
      meaning the dummy irq would always be delivered via qemu. d1f6af6a changed
      vfio_add_kvm_msi_virq() so it takes a vector number instead of the msg
      parameter, and determines the correct message itself.  The test for !msg
      was removed, and not replaced with anything there or in the caller.
      
      With an spapr guest which has a VFIO device, if an EEH error occurs on the
      host hardware, then the device will be isolated then reset.  This is a
      combination of host and guest action, mediated by some EEH related
      hypercalls.  I haven't fully traced the mechanics, but somehow installing
      the kvm irqchip route for the dummy irq on the BCM5719 means that after EEH
      reset and recovery, at least some irqs are no longer delivered to the
      guest.
      
      In particular, the guest never gets the link up event, and so the NIC is
      effectively dead.
      
      [1] EEH (Enhanced Error Handling) is an IBM POWER server specific PCI-*
          error reporting and recovery mechanism.  The concept is somewhat
          similar to PCI-E AER, but the details are different.
      
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1373802
      
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Gavin Shan <gwshan@au1.ibm.com>
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Cc: qemu-stable@nongnu.org
      Fixes: d1f6af6a ("kvm-irqchip: simplify kvm_irqchip_add_msi_route")
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      6d17a018
  5. 12 8月, 2016 1 次提交
  6. 08 8月, 2016 1 次提交
  7. 22 7月, 2016 2 次提交
  8. 19 7月, 2016 1 次提交
    • A
      vfio/pci: Hide ARI capability · 383a7af7
      Alex Williamson 提交于
      QEMU supports ARI on downstream ports and assigned devices may support
      ARI in their extended capabilities.  The endpoint ARI capability
      specifies the next function, such that the OS doesn't need to walk
      each possible function, however this next function is relative to the
      host, not the guest.  This leads to device discovery issues when we
      combine separate functions into virtual multi-function packages in a
      guest.  For example, SR-IOV VFs are not enumerated by simply probing
      the function address space, therefore the ARI next-function field is
      zero.  When we combine multiple VFs together as a multi-function
      device in the guest, the guest OS identifies ARI is enabled, relies on
      this next-function field, and stops looking for additional function
      after the first is found.
      
      Long term we should expose the ARI capability to the guest to enable
      configurations with more than 8 functions per slot, but this requires
      additional QEMU PCI infrastructure to manage the next-function field
      for multiple, otherwise independent devices.  In the short term,
      hiding this capability allows equivalent functionality to what we
      currently have on non-express chipsets.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NMarcel Apfelbaum <marcel@redhat.com>
      383a7af7
  9. 18 7月, 2016 1 次提交
  10. 12 7月, 2016 1 次提交
  11. 05 7月, 2016 4 次提交
    • C
      pci: Convert msi_init() to Error and fix callers to check it · 1108b2f8
      Cao jin 提交于
      msi_init() reports errors with error_report(), which is wrong
      when it's used in realize().
      
      Fix by converting it to Error.
      
      Fix its callers to handle failure instead of ignoring it.
      
      For those callers who don't handle the failure, it might happen:
      when user want msi on, but he doesn't get what he want because of
      msi_init fails silently.
      
      cc: Gerd Hoffmann <kraxel@redhat.com>
      cc: John Snow <jsnow@redhat.com>
      cc: Dmitry Fleytman <dmitry@daynix.com>
      cc: Jason Wang <jasowang@redhat.com>
      cc: Michael S. Tsirkin <mst@redhat.com>
      cc: Hannes Reinecke <hare@suse.de>
      cc: Paolo Bonzini <pbonzini@redhat.com>
      cc: Alex Williamson <alex.williamson@redhat.com>
      cc: Markus Armbruster <armbru@redhat.com>
      cc: Marcel Apfelbaum <marcel@redhat.com>
      Reviewed-by: NMarkus Armbruster <armbru@redhat.com>
      Signed-off-by: NCao jin <caoj.fnst@cn.fujitsu.com>
      Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      1108b2f8
    • A
      vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) · 2e4109de
      Alexey Kardashevskiy 提交于
      New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
      This adds ability to VFIO common code to dynamically allocate/remove
      DMA windows in the host kernel when new VFIO container is added/removed.
      
      This adds a helper to vfio_listener_region_add which makes
      VFIO_IOMMU_SPAPR_TCE_CREATE ioctl and adds just created IOMMU into
      the host IOMMU list; the opposite action is taken in
      vfio_listener_region_del.
      
      When creating a new window, this uses heuristic to decide on the TCE table
      levels number.
      
      This should cause no guest visible change in behavior.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [dwg: Added some casts to prevent printf() warnings on certain targets
       where the kernel headers' __u64 doesn't match uint64_t or PRIx64]
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      2e4109de
    • A
      vfio: Add host side DMA window capabilities · f4ec5e26
      Alexey Kardashevskiy 提交于
      There are going to be multiple IOMMUs per a container. This moves
      the single host IOMMU parameter set to a list of VFIOHostDMAWindow.
      
      This should cause no behavioral change and will be used later by
      the SPAPR TCE IOMMU v2 which will also add a vfio_host_win_del() helper.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      f4ec5e26
    • A
      vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) · 318f67ce
      Alexey Kardashevskiy 提交于
      This makes use of the new "memory registering" feature. The idea is
      to provide the userspace ability to notify the host kernel about pages
      which are going to be used for DMA. Having this information, the host
      kernel can pin them all once per user process, do locked pages
      accounting (once) and not spent time on doing that in real time with
      possible failures which cannot be handled nicely in some cases.
      
      This adds a prereg memory listener which listens on address_space_memory
      and notifies a VFIO container about memory which needs to be
      pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
      
      The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
      are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
      not call it when v2 is detected and enabled.
      
      This enforces guest RAM blocks to be host page size aligned; however
      this is not new as KVM already requires memory slots to be host page
      size aligned.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [dwg: Fix compile error on 32-bit host]
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      318f67ce
  12. 01 7月, 2016 4 次提交