1. 24 8月, 2018 1 次提交
  2. 21 8月, 2018 1 次提交
    • A
      vfio/spapr: Allow backing bigger guest IOMMU pages with smaller physical pages · c26bc185
      Alexey Kardashevskiy 提交于
      At the moment the PPC64/pseries guest only supports 4K/64K/16M IOMMU
      pages and POWER8 CPU supports the exact same set of page size so
      so far things worked fine.
      
      However POWER9 supports different set of sizes - 4K/64K/2M/1G and
      the last two - 2M and 1G - are not even allowed in the paravirt interface
      (RTAS DDW) so we always end up using 64K IOMMU pages, although we could
      back guest's 16MB IOMMU pages with 2MB pages on the host.
      
      This stores the supported host IOMMU page sizes in VFIOContainer and uses
      this later when creating a new DMA window. This uses the system page size
      (64k normally, 2M/16M/1G if hugepages used) as the upper limit of
      the IOMMU pagesize.
      
      This changes the type of @pagesize to uint64_t as this is what
      memory_region_iommu_get_min_page_size() returns and clz64() takes.
      
      There should be no behavioral changes on platforms other than pseries.
      The guest will keep using the IOMMU page size selected by the PHB pagesize
      property as this only changes the underlying hardware TCE table
      granularity.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      c26bc185
  3. 17 8月, 2018 2 次提交
    • A
      vfio/ccw/pci: Allow devices to opt-in for ballooning · 238e9172
      Alex Williamson 提交于
      If a vfio assigned device makes use of a physical IOMMU, then memory
      ballooning is necessarily inhibited due to the page pinning, lack of
      page level granularity at the IOMMU, and sufficient notifiers to both
      remove the page on balloon inflation and add it back on deflation.
      However, not all devices are backed by a physical IOMMU.  In the case
      of mediated devices, if a vendor driver is well synchronized with the
      guest driver, such that only pages actively used by the guest driver
      are pinned by the host mdev vendor driver, then there should be no
      overlap between pages available for the balloon driver and pages
      actively in use by the device.  Under these conditions, ballooning
      should be safe.
      
      vfio-ccw devices are always mediated devices and always operate under
      the constraints above.  Therefore we can consider all vfio-ccw devices
      as balloon compatible.
      
      The situation is far from straightforward with vfio-pci.  These
      devices can be physical devices with physical IOMMU backing or
      mediated devices where it is unknown whether a physical IOMMU is in
      use or whether the vendor driver is well synchronized to the working
      set of the guest driver.  The safest approach is therefore to assume
      all vfio-pci devices are incompatible with ballooning, but allow user
      opt-in should they have further insight into mediated devices.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      238e9172
    • A
      vfio: Inhibit ballooning based on group attachment to a container · c65ee433
      Alex Williamson 提交于
      We use a VFIOContainer to associate an AddressSpace to one or more
      VFIOGroups.  The VFIOContainer represents the DMA context for that
      AdressSpace for those VFIOGroups and is synchronized to changes in
      that AddressSpace via a MemoryListener.  For IOMMU backed devices,
      maintaining the DMA context for a VFIOGroup generally involves
      pinning a host virtual address in order to create a stable host
      physical address and then mapping a translation from the associated
      guest physical address to that host physical address into the IOMMU.
      
      While the above maintains the VFIOContainer synchronized to the QEMU
      memory API of the VM, memory ballooning occurs outside of that API.
      Inflating the memory balloon (ie. cooperatively capturing pages from
      the guest for use by the host) simply uses MADV_DONTNEED to "zap"
      pages from QEMU's host virtual address space.  The page pinning and
      IOMMU mapping above remains in place, negating the host's ability to
      reuse the page, but the host virtual to host physical mapping of the
      page is invalidated outside of QEMU's memory API.
      
      When the balloon is later deflated, attempting to cooperatively
      return pages to the guest, the page is simply freed by the guest
      balloon driver, allowing it to be used in the guest and incurring a
      page fault when that occurs.  The page fault maps a new host physical
      page backing the existing host virtual address, meanwhile the
      VFIOContainer still maintains the translation to the original host
      physical address.  At this point the guest vCPU and any assigned
      devices will map different host physical addresses to the same guest
      physical address.  Badness.
      
      The IOMMU typically does not have page level granularity with which
      it can track this mapping without also incurring inefficiencies in
      using page size mappings throughout.  MMU notifiers in the host
      kernel also provide indicators for invalidating the mapping on
      balloon inflation, not for updating the mapping when the balloon is
      deflated.  For these reasons we assume a default behavior that the
      mapping of each VFIOGroup into the VFIOContainer is incompatible
      with memory ballooning and increment the balloon inhibitor to match
      the attached VFIOGroups.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      c65ee433
  4. 15 6月, 2018 1 次提交
    • P
      iommu: Add IOMMU index argument to notifier APIs · cb1efcf4
      Peter Maydell 提交于
      Add support for multiple IOMMU indexes to the IOMMU notifier APIs.
      When initializing a notifier with iommu_notifier_init(), the caller
      must pass the IOMMU index that it is interested in. When a change
      happens, the IOMMU implementation must pass
      memory_region_notify_iommu() the IOMMU index that has changed and
      that notifiers must be called for.
      
      IOMMUs which support only a single index don't need to change.
      Callers which only really support working with IOMMUs with a single
      index can use the result of passing MEMTXATTRS_UNSPECIFIED to
      memory_region_iommu_attrs_to_index().
      Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Message-id: 20180604152941.20374-3-peter.maydell@linaro.org
      cb1efcf4
  5. 31 5月, 2018 1 次提交
  6. 06 4月, 2018 1 次提交
    • E
      vfio: Use a trace point when a RAM section cannot be DMA mapped · 5c086005
      Eric Auger 提交于
      Commit 567b5b30 ("vfio/pci: Relax DMA map errors for MMIO regions")
      added an error message if a passed memory section address or size
      is not aligned to the page size and thus cannot be DMA mapped.
      
      This patch fixes the trace by printing the region name and the
      memory region section offset within the address space (instead of
      offset_within_region).
      
      We also turn the error_report into a trace event. Indeed, In some
      cases, the traces can be confusing to non expert end-users and
      let think the use case does not work (whereas it works as before).
      
      This is the case where a BAR is successively mapped at different
      GPAs and its sections are not compatible with dma map. The listener
      is called several times and traces are issued for each intermediate
      mapping.  The end-user cannot easily match those GPAs against the
      final GPA output by lscpi. So let's keep those information to
      informed users. In mid term, the plan is to advise the user about
      BAR relocation relevance.
      
      Fixes: 567b5b30 ("vfio/pci: Relax DMA map errors for MMIO regions")
      Signed-off-by: NEric Auger <eric.auger@redhat.com>
      Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      5c086005
  7. 14 3月, 2018 3 次提交
  8. 07 2月, 2018 3 次提交
  9. 14 12月, 2017 3 次提交
  10. 18 7月, 2017 1 次提交
    • A
      vfio-pci, ppc64/spapr: Reorder group-to-container attaching · 8c37faa4
      Alexey Kardashevskiy 提交于
      At the moment VFIO PCI device initialization works as follows:
      vfio_realize
      	vfio_get_group
      		vfio_connect_container
      			register memory listeners (1)
      			update QEMU groups lists
      		vfio_kvm_device_add_group
      
      Then (example for pseries) the machine reset hook triggers region_add()
      for all regions where listeners from (1) are listening:
      
      ppc_spapr_reset
      	spapr_phb_reset
      		spapr_tce_table_enable
      			memory_region_add_subregion
      				vfio_listener_region_add
      					vfio_spapr_create_window
      
      This scheme works fine until we need to handle VFIO PCI device hotplug
      and we want to enable PPC64/sPAPR in-kernel TCE acceleration on,
      i.e. after PCI hotplug we need a place to call
      ioctl(vfio_kvm_device_fd, KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE).
      Since the ioctl needs a LIOBN fd (from sPAPRTCETable) and a IOMMU group fd
      (from VFIOGroup), vfio_listener_region_add() seems to be the only place
      for this ioctl().
      
      However this only works during boot time because the machine reset
      happens strictly after all devices are finalized. When hotplug happens,
      vfio_listener_region_add() is called when a memory listener is registered
      but when this happens:
      1. new group is not added to the container->group_list yet;
      2. VFIO KVM device is unaware of the new IOMMU group.
      
      This moves bits around to have all necessary VFIO infrastructure
      in place for both initial startup and hotplug cases.
      
      [aw: ie, register vfio groups with kvm prior to memory listener
      registration such that kvm-vfio pseudo device ioctls are available
      during the region_add callback]
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      8c37faa4
  11. 14 7月, 2017 1 次提交
  12. 11 7月, 2017 1 次提交
    • A
      vfio: Test realized when using VFIOGroup.device_list iterator · 7da624e2
      Alex Williamson 提交于
      VFIOGroup.device_list is effectively our reference tracking mechanism
      such that we can teardown a group when all of the device references
      are removed.  However, we also use this list from our machine reset
      handler for processing resets that affect multiple devices.  Generally
      device removals are fully processed (exitfn + finalize) when this
      reset handler is invoked, however if the removal is triggered via
      another reset handler (piix4_reset->acpi_pcihp_reset) then the device
      exitfn may run, but not finalize.  In this case we hit asserts when
      we start trying to access PCI helpers since much of the PCI state of
      the device is released.  To resolve this, add a pointer to the Object
      DeviceState in our common base-device and skip non-realized devices
      as we iterate.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      7da624e2
  13. 26 5月, 2017 1 次提交
  14. 04 5月, 2017 2 次提交
    • J
      vfio: enable 8-byte reads/writes to vfio · 38d49e8c
      Jose Ricardo Ziviani 提交于
      This patch enables 8-byte writes and reads to VFIO. Such implemention
      is already done but it's missing the 'case' to handle such accesses in
      both vfio_region_write and vfio_region_read and the MemoryRegionOps:
      impl.max_access_size and impl.min_access_size.
      
      After this patch, 8-byte writes such as:
      
      qemu_mutex_lock locked mutex 0x10905ad8
      vfio_region_write  (0001:03:00.0:region1+0xc0, 0x4140c, 4)
      vfio_region_write  (0001:03:00.0:region1+0xc4, 0xa0000, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      
      goes like this:
      
      qemu_mutex_lock locked mutex 0x10905ad8
      vfio_region_write  (0001:03:00.0:region1+0xc0, 0xbfd0008, 8)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      Signed-off-by: NJose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      38d49e8c
    • J
      vfio: Set MemoryRegionOps:max_access_size and min_access_size · 15126cba
      Jose Ricardo Ziviani 提交于
      Sets valid.max_access_size and valid.min_access_size to ensure safe
      8-byte accesses to vfio. Today, 8-byte accesses are broken into pairs
      of 4-byte calls that goes unprotected:
      
      qemu_mutex_lock locked mutex 0x10905ad8
        vfio_region_write  (0001:03:00.0:region1+0xc0, 0x2020c, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      qemu_mutex_lock locked mutex 0x10905ad8
        vfio_region_write  (0001:03:00.0:region1+0xc4, 0xa0000, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      
      which occasionally leads to:
      
      qemu_mutex_lock locked mutex 0x10905ad8
        vfio_region_write  (0001:03:00.0:region1+0xc0, 0x2030c, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      qemu_mutex_lock locked mutex 0x10905ad8
        vfio_region_write  (0001:03:00.0:region1+0xc0, 0x1000c, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      qemu_mutex_lock locked mutex 0x10905ad8
        vfio_region_write  (0001:03:00.0:region1+0xc4, 0xb0000, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      qemu_mutex_lock locked mutex 0x10905ad8
        vfio_region_write  (0001:03:00.0:region1+0xc4, 0xa0000, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      
      causing strange errors in guest OS. With this patch, such accesses
      are protected by the same lock guard:
      
      qemu_mutex_lock locked mutex 0x10905ad8
      vfio_region_write  (0001:03:00.0:region1+0xc0, 0x2000c, 4)
      vfio_region_write  (0001:03:00.0:region1+0xc4, 0xb0000, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      
      This happens because the 8-byte write should be broken into 4-byte
      writes by memory.c:access_with_adjusted_size() in order to be under
      the same lock. Today, it's done in exec.c:address_space_write_continue()
      which was able to handle only 4 bytes due to a zero'ed
      valid.max_access_size (see exec.c:memory_access_size()).
      Signed-off-by: NJose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      15126cba
  15. 21 4月, 2017 1 次提交
    • P
      memory: add section range info for IOMMU notifier · 698feb5e
      Peter Xu 提交于
      In this patch, IOMMUNotifier.{start|end} are introduced to store section
      information for a specific notifier. When notification occurs, we not
      only check the notification type (MAP|UNMAP), but also check whether the
      notified iova range overlaps with the range of specific IOMMU notifier,
      and skip those notifiers if not in the listened range.
      
      When removing an region, we need to make sure we removed the correct
      VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
      
      This patch is solving the problem that vfio-pci devices receive
      duplicated UNMAP notification on x86 platform when vIOMMU is there. The
      issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is
      splitted by the (0xfee00000, 0xfeefffff) IRQ region. AFAIK
      this (splitted IOMMU region) is only happening on x86.
      
      This patch also helps vhost to leverage the new interface as well, so
      that vhost won't get duplicated cache flushes. In that sense, it's an
      slight performance improvement.
      Suggested-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <1491562755-23867-2-git-send-email-peterx@redhat.com>
      [ehabkost: included extra vhost_iommu_region_del() change from Peter Xu]
      Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
      698feb5e
  16. 18 2月, 2017 3 次提交
  17. 31 10月, 2016 3 次提交
    • Y
      vfio: Add support for mmapping sub-page MMIO BARs · 95251725
      Yongji Xie 提交于
      Now the kernel commit 05f0c03fbac1 ("vfio-pci: Allow to mmap
      sub-page MMIO BARs if the mmio page is exclusive") allows VFIO
      to mmap sub-page BARs. This is the corresponding QEMU patch.
      With those patches applied, we could passthrough sub-page BARs
      to guest, which can help to improve IO performance for some devices.
      
      In this patch, we expand MemoryRegions of these sub-page
      MMIO BARs to PAGE_SIZE in vfio_pci_write_config(), so that
      the BARs could be passed to KVM ioctl KVM_SET_USER_MEMORY_REGION
      with a valid size. The expanding size will be recovered when
      the base address of sub-page BAR is changed and not page aligned
      any more in guest. And we also set the priority of these BARs'
      memory regions to zero in case of overlap with BARs which share
      the same page with sub-page BARs in guest.
      Signed-off-by: NYongji Xie <xyjxie@linux.vnet.ibm.com>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      95251725
    • A
      vfio: Handle zero-length sparse mmap ranges · 24acf72b
      Alex Williamson 提交于
      As reported in the link below, user has a PCI device with a 4KB BAR
      which contains the MSI-X table.  This seems to hit a corner case in
      the kernel where the region reports being mmap capable, but the sparse
      mmap information reports a zero sized range.  It's not entirely clear
      that the kernel is incorrect in doing this, but regardless, we need
      to handle it.  To do this, fill our mmap array only with non-zero
      sized sparse mmap entries and add an error return from the function
      so we can tell the difference between nr_mmaps being zero based on
      sparse mmap info vs lack of sparse mmap info.
      
      NB, this doesn't actually change the behavior of the device, it only
      removes the scary "Failed to mmap ... Performance may be slow" error
      message.  We cannot currently create an mmap over the MSI-X table.
      
      Link: http://lists.nongnu.org/archive/html/qemu-discuss/2016-10/msg00009.htmlSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>
      24acf72b
    • A
      memory: Replace skip_dump flag with "ram_device" · 21e00fa5
      Alex Williamson 提交于
      Setting skip_dump on a MemoryRegion allows us to modify one specific
      code path, but the restriction we're trying to address encompasses
      more than that.  If we have a RAM MemoryRegion backed by a physical
      device, it not only restricts our ability to dump that region, but
      also affects how we should manipulate it.  Here we recognize that
      MemoryRegions do not change to sometimes allow dumps and other times
      not, so we replace setting the skip_dump flag with a new initializer
      so that we know exactly the type of region to which we're applying
      this behavior.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      21e00fa5
  18. 18 10月, 2016 3 次提交
  19. 27 9月, 2016 1 次提交
    • P
      memory: introduce IOMMUNotifier and its caps · cdb30812
      Peter Xu 提交于
      IOMMU Notifier list is used for notifying IO address mapping changes.
      Currently VFIO is the only user.
      
      However it is possible that future consumer like vhost would like to
      only listen to part of its notifications (e.g., cache invalidations).
      
      This patch introduced IOMMUNotifier and IOMMUNotfierFlag bits for a
      finer grained control of it.
      
      IOMMUNotifier contains a bitfield for the notify consumer describing
      what kind of notification it is interested in. Currently two kinds of
      notifications are defined:
      
      - IOMMU_NOTIFIER_MAP:    for newly mapped entries (additions)
      - IOMMU_NOTIFIER_UNMAP:  for entries to be removed (cache invalidates)
      
      When registering the IOMMU notifier, we need to specify one or multiple
      types of messages to listen to.
      
      When notifications are triggered, its type will be checked against the
      notifier's type bits, and only notifiers with registered bits will be
      notified.
      
      (For any IOMMU implementation, an in-place mapping change should be
       notified with an UNMAP followed by a MAP.)
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <1474606948-14391-2-git-send-email-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cdb30812
  20. 12 7月, 2016 1 次提交
  21. 05 7月, 2016 3 次提交
    • A
      vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) · 2e4109de
      Alexey Kardashevskiy 提交于
      New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
      This adds ability to VFIO common code to dynamically allocate/remove
      DMA windows in the host kernel when new VFIO container is added/removed.
      
      This adds a helper to vfio_listener_region_add which makes
      VFIO_IOMMU_SPAPR_TCE_CREATE ioctl and adds just created IOMMU into
      the host IOMMU list; the opposite action is taken in
      vfio_listener_region_del.
      
      When creating a new window, this uses heuristic to decide on the TCE table
      levels number.
      
      This should cause no guest visible change in behavior.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [dwg: Added some casts to prevent printf() warnings on certain targets
       where the kernel headers' __u64 doesn't match uint64_t or PRIx64]
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      2e4109de
    • A
      vfio: Add host side DMA window capabilities · f4ec5e26
      Alexey Kardashevskiy 提交于
      There are going to be multiple IOMMUs per a container. This moves
      the single host IOMMU parameter set to a list of VFIOHostDMAWindow.
      
      This should cause no behavioral change and will be used later by
      the SPAPR TCE IOMMU v2 which will also add a vfio_host_win_del() helper.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      f4ec5e26
    • A
      vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) · 318f67ce
      Alexey Kardashevskiy 提交于
      This makes use of the new "memory registering" feature. The idea is
      to provide the userspace ability to notify the host kernel about pages
      which are going to be used for DMA. Having this information, the host
      kernel can pin them all once per user process, do locked pages
      accounting (once) and not spent time on doing that in real time with
      possible failures which cannot be handled nicely in some cases.
      
      This adds a prereg memory listener which listens on address_space_memory
      and notifies a VFIO container about memory which needs to be
      pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
      
      The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
      are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
      not call it when v2 is detected and enabled.
      
      This enforces guest RAM blocks to be host page size aligned; however
      this is not new as KVM already requires memory slots to be host page
      size aligned.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [dwg: Fix compile error on 32-bit host]
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      318f67ce
  22. 01 7月, 2016 1 次提交
  23. 22 6月, 2016 1 次提交
  24. 17 6月, 2016 1 次提交