提交 · 17182bb47fe62ac6a005b325a7007488056f3a2d · openeuler / qemu

24 8月, 2018 1 次提交

vfio/pci: Fix failure to close file descriptor on error · 8709b395

由 Alex Williamson 提交于 8月 23, 2018

A new error path fails to close the device file descriptor when
triggered by a ballooning incompatibility within the group. Fix it.

Fixes: 238e9172 ("vfio/ccw/pci: Allow devices to opt-in for ballooning")
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

8709b395

21 8月, 2018 1 次提交

vfio/spapr: Allow backing bigger guest IOMMU pages with smaller physical pages · c26bc185

由 Alexey Kardashevskiy 提交于 6月 20, 2018

At the moment the PPC64/pseries guest only supports 4K/64K/16M IOMMU
pages and POWER8 CPU supports the exact same set of page size so
so far things worked fine.

However POWER9 supports different set of sizes - 4K/64K/2M/1G and
the last two - 2M and 1G - are not even allowed in the paravirt interface
(RTAS DDW) so we always end up using 64K IOMMU pages, although we could
back guest's 16MB IOMMU pages with 2MB pages on the host.

This stores the supported host IOMMU page sizes in VFIOContainer and uses
this later when creating a new DMA window. This uses the system page size
(64k normally, 2M/16M/1G if hugepages used) as the upper limit of
the IOMMU pagesize.

This changes the type of @pagesize to uint64_t as this is what
memory_region_iommu_get_min_page_size() returns and clz64() takes.

There should be no behavioral changes on platforms other than pseries.
The guest will keep using the IOMMU page size selected by the PHB pagesize
property as this only changes the underlying hardware TCE table
granularity.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>

c26bc185

17 8月, 2018 2 次提交

vfio/ccw/pci: Allow devices to opt-in for ballooning · 238e9172

由 Alex Williamson 提交于 8月 17, 2018

If a vfio assigned device makes use of a physical IOMMU, then memory
ballooning is necessarily inhibited due to the page pinning, lack of
page level granularity at the IOMMU, and sufficient notifiers to both
remove the page on balloon inflation and add it back on deflation.
However, not all devices are backed by a physical IOMMU. In the case
of mediated devices, if a vendor driver is well synchronized with the
guest driver, such that only pages actively used by the guest driver
are pinned by the host mdev vendor driver, then there should be no
overlap between pages available for the balloon driver and pages
actively in use by the device. Under these conditions, ballooning
should be safe.

vfio-ccw devices are always mediated devices and always operate under
the constraints above. Therefore we can consider all vfio-ccw devices
as balloon compatible.

The situation is far from straightforward with vfio-pci. These
devices can be physical devices with physical IOMMU backing or
mediated devices where it is unknown whether a physical IOMMU is in
use or whether the vendor driver is well synchronized to the working
set of the guest driver. The safest approach is therefore to assume
all vfio-pci devices are incompatible with ballooning, but allow user
opt-in should they have further insight into mediated devices.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

238e9172

vfio: Inhibit ballooning based on group attachment to a container · c65ee433

由 Alex Williamson 提交于 8月 17, 2018

We use a VFIOContainer to associate an AddressSpace to one or more
VFIOGroups. The VFIOContainer represents the DMA context for that
AdressSpace for those VFIOGroups and is synchronized to changes in
that AddressSpace via a MemoryListener. For IOMMU backed devices,
maintaining the DMA context for a VFIOGroup generally involves
pinning a host virtual address in order to create a stable host
physical address and then mapping a translation from the associated
guest physical address to that host physical address into the IOMMU.

While the above maintains the VFIOContainer synchronized to the QEMU
memory API of the VM, memory ballooning occurs outside of that API.
Inflating the memory balloon (ie. cooperatively capturing pages from
the guest for use by the host) simply uses MADV_DONTNEED to "zap"
pages from QEMU's host virtual address space. The page pinning and
IOMMU mapping above remains in place, negating the host's ability to
reuse the page, but the host virtual to host physical mapping of the
page is invalidated outside of QEMU's memory API.

When the balloon is later deflated, attempting to cooperatively
return pages to the guest, the page is simply freed by the guest
balloon driver, allowing it to be used in the guest and incurring a
page fault when that occurs. The page fault maps a new host physical
page backing the existing host virtual address, meanwhile the
VFIOContainer still maintains the translation to the original host
physical address. At this point the guest vCPU and any assigned
devices will map different host physical addresses to the same guest
physical address. Badness.

The IOMMU typically does not have page level granularity with which
it can track this mapping without also incurring inefficiencies in
using page size mappings throughout. MMU notifiers in the host
kernel also provide indicators for invalidating the mapping on
balloon inflation, not for updating the mapping when the balloon is
deflated. For these reasons we assume a default behavior that the
mapping of each VFIOGroup into the VFIOContainer is incompatible
with memory ballooning and increment the balloon inhibitor to match
the attached VFIOGroups.
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

c65ee433

15 6月, 2018 1 次提交

iommu: Add IOMMU index argument to notifier APIs · cb1efcf4

由 Peter Maydell 提交于 6月 15, 2018

Add support for multiple IOMMU indexes to the IOMMU notifier APIs.
When initializing a notifier with iommu_notifier_init(), the caller
must pass the IOMMU index that it is interested in. When a change
happens, the IOMMU implementation must pass
memory_region_notify_iommu() the IOMMU index that has changed and
that notifiers must be called for.

IOMMUs which support only a single index don't need to change.
Callers which only really support working with IOMMUs with a single
index can use the result of passing MEMTXATTRS_UNSPECIFIED to
memory_region_iommu_attrs_to_index().
Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
Message-id: 20180604152941.20374-3-peter.maydell@linaro.org

cb1efcf4

31 5月, 2018 1 次提交

Make address_space_translate{, _cached}() take a MemTxAttrs argument · bc6b1cec

由 Peter Maydell 提交于 5月 31, 2018

As part of plumbing MemTxAttrs down to the IOMMU translate method,
add MemTxAttrs as an argument to address_space_translate()
and address_space_translate_cached(). Callers either have an
attrs value to hand, or don't care and can use MEMTXATTRS_UNSPECIFIED.
Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
Message-id: 20180521140402.23318-4-peter.maydell@linaro.org

bc6b1cec

06 4月, 2018 1 次提交

vfio: Use a trace point when a RAM section cannot be DMA mapped · 5c086005

由 Eric Auger 提交于 4月 04, 2018

Commit 567b5b30 ("vfio/pci: Relax DMA map errors for MMIO regions")
added an error message if a passed memory section address or size
is not aligned to the page size and thus cannot be DMA mapped.

This patch fixes the trace by printing the region name and the
memory region section offset within the address space (instead of
offset_within_region).

We also turn the error_report into a trace event. Indeed, In some
cases, the traces can be confusing to non expert end-users and
let think the use case does not work (whereas it works as before).

This is the case where a BAR is successively mapped at different
GPAs and its sections are not compatible with dma map. The listener
is called several times and traces are issued for each intermediate
mapping.  The end-user cannot easily match those GPAs against the
final GPA output by lscpi. So let's keep those information to
informed users. In mid term, the plan is to advise the user about
BAR relocation relevance.

Fixes: 567b5b30 ("vfio/pci: Relax DMA map errors for MMIO regions")
Signed-off-by: NEric Auger <eric.auger@redhat.com>
Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

5c086005

14 3月, 2018 3 次提交

vfio-pci: Allow mmap of MSIX BAR · ae0215b2

由 Alexey Kardashevskiy 提交于 3月 13, 2018

At the moment we unconditionally avoid mapping MSIX data of a BAR and
emulate MSIX table in QEMU. However it is 1) not always necessary as
a platform may provide a paravirt interface for MSIX configuration;
2) can affect the speed of MMIO access by emulating them in QEMU when
frequently accessed registers share same system page with MSIX data,
this is particularly a problem for systems with the page size bigger
than 4KB.

A new capability - VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - has been added
to the kernel [1] which tells the userspace that mapping of the MSIX data
is possible now. This makes use of it so from now on QEMU tries mapping
the entire BAR as a whole and emulate MSIX on top of that.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

ae0215b2

vfio/pci: Relax DMA map errors for MMIO regions · 567b5b30

由 Alexey Kardashevskiy 提交于 3月 13, 2018

At the moment if vfio_memory_listener is registered in the system memory
address space, it maps/unmaps every RAM memory region for DMA.
It expects system page size aligned memory sections so vfio_dma_map
would not fail and so far this has been the case. A mapping failure
would be fatal. A side effect of such behavior is that some MMIO pages
would not be mapped silently.

However we are going to change MSIX BAR handling so we will end having
non-aligned sections in vfio_memory_listener (more details is in
the next patch) and vfio_dma_map will exit QEMU.

In order to avoid fatal failures on what previously was not a failure and
was just silently ignored, this checks the section alignment to
the smallest supported IOMMU page size and prints an error if not aligned;
it also prints an error if vfio_dma_map failed despite the page size check.
Both errors are not fatal; only MMIO RAM regions are checked
(aka "RAM device" regions).

If the amount of errors printed is overwhelming, the MSIX relocation
could be used to avoid excessive error output.

This is unlikely to cause any behavioral change.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
[aw: Fix Int128 bit ops]
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

567b5b30

vfio/common: cleanup in vfio_region_finalize · 92f86bff

由 Gerd Hoffmann 提交于 3月 13, 2018

Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

92f86bff

07 2月, 2018 3 次提交

vfio: listener unregister before unset container · 36968626

由 Peter Xu 提交于 1月 22, 2018

After next patch, listener unregister will need the container to be
alive.  Let's move this unregister phase to be before unset container,
since that operation will free the backend container in kernel,
otherwise we'll get these after next patch:

qemu-system-x86_64: VFIO_UNMAP_DMA: -22
qemu-system-x86_64: vfio_dma_unmap(0x559bf53a4590, 0x0, 0xa0000) = -22 (Invalid argument)
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20180122060244.29368-4-peterx@redhat.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Acked-by: NAlex Williamson <alex.williamson@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

36968626

vfio/common: Remove redundant copy of local variable · a5b04f7c

由 Alexey Kardashevskiy 提交于 2月 06, 2018

There is already @hostwin in vfio_listener_region_add() so there is no
point in having the other one.

Fixes: 2e4109de ("vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)")
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

a5b04f7c

vfio/spapr: Use iommu memory region's get_attr() · 07bc681a

由 Alexey Kardashevskiy 提交于 2月 06, 2018

In order to enable TCE operations support in KVM, we have to inform
the KVM about VFIO groups being attached to specific LIOBNs. The KVM
already knows about VFIO groups, the only bit missing is which
in-kernel TCE table (the one with user visible TCEs) should update
the attached broups. There is an KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE
attribute of the VFIO KVM device which receives a groupfd/tablefd couple.

This uses a new memory_region_iommu_get_attr() helper to get the IOMMU fd
and calls KVM to establish the link.

As get_attr() is not implemented yet, this should cause no behavioural
change.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

07bc681a

14 12月, 2017 3 次提交

vfio/spapr: Allow fallback to SPAPR TCE IOMMU v1 · c6e7958e

由 Alexey Kardashevskiy 提交于 12月 13, 2017

The vfio_iommu_spapr_tce driver advertises kernel's support for
v1 and v2 IOMMU support, however it is not always possible to use
the requested IOMMU type. For example, a pseries host platform does not
support dynamic DMA windows so v2 cannot initialize and QEMU fails to
start.

This adds a fallback to the v1 IOMMU if v2 cannot be used.

Fixes: 318f67ce ("vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)")
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

c6e7958e

vfio/common: init giommu_list and hostwin_list of vfio container · f7f9c7b2

由 Liu, Yi L 提交于 12月 13, 2017

The init of giommu_list and hostwin_list is missed during container
initialization.
Signed-off-by: NLiu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

f7f9c7b2

vfio: Fix vfio-kvm group registration · 2016986a

由 Alex Williamson 提交于 12月 13, 2017

Commit 8c37faa4 ("vfio-pci, ppc64/spapr: Reorder group-to-container
attaching") moved registration of groups with the vfio-kvm device from
vfio_get_group() to vfio_connect_container(), but it missed the case
where a group is attached to an existing container and takes an early
exit.  Perhaps this is a less common case on ppc64/spapr, but on x86
(without viommu) all groups are connected to the same container and
thus only the first group gets registered with the vfio-kvm device.
This becomes a problem if we then hot-unplug the devices associated
with that first group and we end up with KVM being misinformed about
any vfio connections that might remain.  Fix by including the call to
vfio_kvm_device_add_group() in this early exit path.

Fixes: 8c37faa4 ("vfio-pci, ppc64/spapr: Reorder group-to-container attaching")
Cc: qemu-stable@nongnu.org # qemu-2.10+
Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Tested-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Tested-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

2016986a

18 7月, 2017 1 次提交

vfio-pci, ppc64/spapr: Reorder group-to-container attaching · 8c37faa4

由 Alexey Kardashevskiy 提交于 7月 17, 2017

At the moment VFIO PCI device initialization works as follows:
vfio_realize
	vfio_get_group
		vfio_connect_container
			register memory listeners (1)
			update QEMU groups lists
		vfio_kvm_device_add_group

Then (example for pseries) the machine reset hook triggers region_add()
for all regions where listeners from (1) are listening:

ppc_spapr_reset
	spapr_phb_reset
		spapr_tce_table_enable
			memory_region_add_subregion
				vfio_listener_region_add
					vfio_spapr_create_window

This scheme works fine until we need to handle VFIO PCI device hotplug
and we want to enable PPC64/sPAPR in-kernel TCE acceleration on,
i.e. after PCI hotplug we need a place to call
ioctl(vfio_kvm_device_fd, KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE).
Since the ioctl needs a LIOBN fd (from sPAPRTCETable) and a IOMMU group fd
(from VFIOGroup), vfio_listener_region_add() seems to be the only place
for this ioctl().

However this only works during boot time because the machine reset
happens strictly after all devices are finalized. When hotplug happens,
vfio_listener_region_add() is called when a memory listener is registered
but when this happens:
1. new group is not added to the container->group_list yet;
2. VFIO KVM device is unaware of the new IOMMU group.

This moves bits around to have all necessary VFIO infrastructure
in place for both initial startup and hotplug cases.

[aw: ie, register vfio groups with kvm prior to memory listener
registration such that kvm-vfio pseudo device ioctls are available
during the region_add callback]
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

8c37faa4

14 7月, 2017 1 次提交

memory/iommu: QOM'fy IOMMU MemoryRegion · 3df9d748

由 Alexey Kardashevskiy 提交于 7月 11, 2017

This defines new QOM object - IOMMUMemoryRegion - with MemoryRegion
as a parent.

This moves IOMMU-related fields from MR to IOMMU MR. However to avoid
dymanic QOM casting in fast path (address_space_translate, etc),
this adds an @is_iommu boolean flag to MR and provides new helper to
do simple cast to IOMMU MR - memory_region_get_iommu. The flag
is set in the instance init callback. This defines
memory_region_is_iommu as memory_region_get_iommu()!=NULL.

This switches MemoryRegion to IOMMUMemoryRegion in most places except
the ones where MemoryRegion may be an alias.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Message-Id: <20170711035620.4232-2-aik@ozlabs.ru>
Acked-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3df9d748

11 7月, 2017 1 次提交

vfio: Test realized when using VFIOGroup.device_list iterator · 7da624e2

由 Alex Williamson 提交于 7月 10, 2017

VFIOGroup.device_list is effectively our reference tracking mechanism
such that we can teardown a group when all of the device references
are removed. However, we also use this list from our machine reset
handler for processing resets that affect multiple devices. Generally
device removals are fully processed (exitfn + finalize) when this
reset handler is invoked, however if the removal is triggered via
another reset handler (piix4_reset->acpi_pcihp_reset) then the device
exitfn may run, but not finalize. In this case we hit asserts when
we start trying to access PCI helpers since much of the PCI state of
the device is released. To resolve this, add a pointer to the Object
DeviceState in our common base-device and skip non-realized devices
as we iterate.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

7da624e2

26 5月, 2017 1 次提交

memory: remove the last param in memory_region_iommu_replay() · ad523590

由 Peter Xu 提交于 5月 19, 2017

We were always passing in that one as "false" to assume that's an read
operation, and we also assume that IOMMU translation would always have
that read permission. A better permission would be IOMMU_NONE since the
replay is after all not a real read operation, but just a page table
rebuilding process.

CC: David Gibson <david@gibson.dropbear.id.au>
CC: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Reviewed-by: NJason Wang <jasowang@redhat.com>

ad523590

04 5月, 2017 2 次提交

vfio: enable 8-byte reads/writes to vfio · 38d49e8c

由 Jose Ricardo Ziviani 提交于 5月 03, 2017

This patch enables 8-byte writes and reads to VFIO. Such implemention
is already done but it's missing the 'case' to handle such accesses in
both vfio_region_write and vfio_region_read and the MemoryRegionOps:
impl.max_access_size and impl.min_access_size.

After this patch, 8-byte writes such as:

qemu_mutex_lock locked mutex 0x10905ad8
vfio_region_write  (0001:03:00.0:region1+0xc0, 0x4140c, 4)
vfio_region_write  (0001:03:00.0:region1+0xc4, 0xa0000, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8

goes like this:

qemu_mutex_lock locked mutex 0x10905ad8
vfio_region_write  (0001:03:00.0:region1+0xc0, 0xbfd0008, 8)
qemu_mutex_unlock unlocked mutex 0x10905ad8
Signed-off-by: NJose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

38d49e8c

vfio: Set MemoryRegionOps:max_access_size and min_access_size · 15126cba

由 Jose Ricardo Ziviani 提交于 5月 03, 2017

Sets valid.max_access_size and valid.min_access_size to ensure safe
8-byte accesses to vfio. Today, 8-byte accesses are broken into pairs
of 4-byte calls that goes unprotected:

qemu_mutex_lock locked mutex 0x10905ad8
  vfio_region_write  (0001:03:00.0:region1+0xc0, 0x2020c, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8
qemu_mutex_lock locked mutex 0x10905ad8
  vfio_region_write  (0001:03:00.0:region1+0xc4, 0xa0000, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8

which occasionally leads to:

qemu_mutex_lock locked mutex 0x10905ad8
  vfio_region_write  (0001:03:00.0:region1+0xc0, 0x2030c, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8
qemu_mutex_lock locked mutex 0x10905ad8
  vfio_region_write  (0001:03:00.0:region1+0xc0, 0x1000c, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8
qemu_mutex_lock locked mutex 0x10905ad8
  vfio_region_write  (0001:03:00.0:region1+0xc4, 0xb0000, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8
qemu_mutex_lock locked mutex 0x10905ad8
  vfio_region_write  (0001:03:00.0:region1+0xc4, 0xa0000, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8

causing strange errors in guest OS. With this patch, such accesses
are protected by the same lock guard:

qemu_mutex_lock locked mutex 0x10905ad8
vfio_region_write  (0001:03:00.0:region1+0xc0, 0x2000c, 4)
vfio_region_write  (0001:03:00.0:region1+0xc4, 0xb0000, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8

This happens because the 8-byte write should be broken into 4-byte
writes by memory.c:access_with_adjusted_size() in order to be under
the same lock. Today, it's done in exec.c:address_space_write_continue()
which was able to handle only 4 bytes due to a zero'ed
valid.max_access_size (see exec.c:memory_access_size()).
Signed-off-by: NJose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

15126cba

21 4月, 2017 1 次提交

memory: add section range info for IOMMU notifier · 698feb5e

由 Peter Xu 提交于 4月 07, 2017

In this patch, IOMMUNotifier.{start|end} are introduced to store section
information for a specific notifier. When notification occurs, we not
only check the notification type (MAP|UNMAP), but also check whether the
notified iova range overlaps with the range of specific IOMMU notifier,
and skip those notifiers if not in the listened range.

When removing an region, we need to make sure we removed the correct
VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.

This patch is solving the problem that vfio-pci devices receive
duplicated UNMAP notification on x86 platform when vIOMMU is there. The
issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is
splitted by the (0xfee00000, 0xfeefffff) IRQ region. AFAIK
this (splitted IOMMU region) is only happening on x86.

This patch also helps vhost to leverage the new interface as well, so
that vhost won't get duplicated cache flushes. In that sense, it's an
slight performance improvement.
Suggested-by: NDavid Gibson <david@gibson.dropbear.id.au>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NAlex Williamson <alex.williamson@redhat.com>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <1491562755-23867-2-git-send-email-peterx@redhat.com>
[ehabkost: included extra vhost_iommu_region_del() change from Peter Xu]
Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>

698feb5e

18 2月, 2017 3 次提交

vfio: allow to notify unmap for very large region · dfbd90e5

由 Peter Xu 提交于 2月 07, 2017

Linux vfio driver supports to do VFIO_IOMMU_UNMAP_DMA for a very big
region. This can be leveraged by QEMU IOMMU implementation to cleanup
existing page mappings for an entire iova address space (by notifying
with an IOTLB with extremely huge addr_mask). However current
vfio_iommu_map_notify() does not allow that. It make sure that all the
translated address in IOTLB is falling into RAM range.

The check makes sense, but it should only be a sensible checker for
mapping operations, and mean little for unmap operations.

This patch moves this check into map logic only, so that we'll get
faster unmap handling (no need to translate again), and also we can then
better support unmapping a very big region when it covers non-ram ranges
or even not-existing ranges.
Acked-by: NAlex Williamson <alex.williamson@redhat.com>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

dfbd90e5

vfio: introduce vfio_get_vaddr() · 4a4b88fb

由 Peter Xu 提交于 2月 07, 2017

A cleanup for vfio_iommu_map_notify(). Now we will fetch vaddr even if
the operation is unmap, but it won't hurt much.

One thing to mention is that we need the RCU read lock to protect the
whole translation and map/unmap procedure.
Acked-by: NAlex Williamson <alex.williamson@redhat.com>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

4a4b88fb

vfio: trace map/unmap for notify as well · 32138357

由 Peter Xu 提交于 2月 07, 2017

We traces its range, but we don't know whether it's a MAP/UNMAP. Let's
dump it as well.
Acked-by: NAlex Williamson <alex.williamson@redhat.com>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

32138357

31 10月, 2016 3 次提交

vfio: Add support for mmapping sub-page MMIO BARs · 95251725

由 Yongji Xie 提交于 10月 31, 2016

Now the kernel commit 05f0c03fbac1 ("vfio-pci: Allow to mmap
sub-page MMIO BARs if the mmio page is exclusive") allows VFIO
to mmap sub-page BARs. This is the corresponding QEMU patch.
With those patches applied, we could passthrough sub-page BARs
to guest, which can help to improve IO performance for some devices.

In this patch, we expand MemoryRegions of these sub-page
MMIO BARs to PAGE_SIZE in vfio_pci_write_config(), so that
the BARs could be passed to KVM ioctl KVM_SET_USER_MEMORY_REGION
with a valid size. The expanding size will be recovered when
the base address of sub-page BAR is changed and not page aligned
any more in guest. And we also set the priority of these BARs'
memory regions to zero in case of overlap with BARs which share
the same page with sub-page BARs in guest.
Signed-off-by: NYongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

95251725

vfio: Handle zero-length sparse mmap ranges · 24acf72b

由 Alex Williamson 提交于 10月 31, 2016

As reported in the link below, user has a PCI device with a 4KB BAR
which contains the MSI-X table. This seems to hit a corner case in
the kernel where the region reports being mmap capable, but the sparse
mmap information reports a zero sized range. It's not entirely clear
that the kernel is incorrect in doing this, but regardless, we need
to handle it. To do this, fill our mmap array only with non-zero
sized sparse mmap entries and add an error return from the function
so we can tell the difference between nr_mmaps being zero based on
sparse mmap info vs lack of sparse mmap info.

NB, this doesn't actually change the behavior of the device, it only
removes the scary "Failed to mmap ... Performance may be slow" error
message. We cannot currently create an mmap over the MSI-X table.

Link: http://lists.nongnu.org/archive/html/qemu-discuss/2016-10/msg00009.htmlSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

24acf72b

memory: Replace skip_dump flag with "ram_device" · 21e00fa5

由 Alex Williamson 提交于 10月 31, 2016

Setting skip_dump on a MemoryRegion allows us to modify one specific
code path, but the restriction we're trying to address encompasses
more than that.  If we have a RAM MemoryRegion backed by a physical
device, it not only restricts our ability to dump that region, but
also affects how we should manipulate it.  Here we recognize that
MemoryRegions do not change to sometimes allow dumps and other times
not, so we replace setting the skip_dump flag with a new initializer
so that we know exactly the type of region to which we're applying
this behavior.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>

21e00fa5

18 10月, 2016 3 次提交

vfio: Pass an error object to vfio_get_device · 59f7d674

由 Eric Auger 提交于 10月 17, 2016

Pass an error object to prepare for migration to VFIO-PCI realize.

In vfio platform vfio_base_device_init we currently just report the
error. Subsequent patches will propagate the error up to the realize
function.
Signed-off-by: NEric Auger <eric.auger@redhat.com>
Reviewed-by: NMarkus Armbruster <armbru@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

59f7d674

vfio: Pass an error object to vfio_get_group · 1b808d5b

由 Eric Auger 提交于 10月 17, 2016

Pass an error object to prepare for migration to VFIO-PCI realize.

For the time being let's just simply report the error in
vfio platform's vfio_base_device_init(). A subsequent patch will
duly propagate the error up to vfio_platform_realize.
Signed-off-by: NEric Auger <eric.auger@redhat.com>
Reviewed-by: NMarkus Armbruster <armbru@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

1b808d5b

vfio: Pass an Error object to vfio_connect_container · 01905f58

由 Eric Auger 提交于 10月 17, 2016

The error is currently simply reported in vfio_get_group. Don't
bother too much with the prefix which will be handled at upper level,
later on.

Also return an error value in case container->error is not 0 and
the container is teared down.

On vfio_spapr_remove_window failure, we also report an error whereas
it was silent before.
Signed-off-by: NEric Auger <eric.auger@redhat.com>
Reviewed-by: NMarkus Armbruster <armbru@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

01905f58

27 9月, 2016 1 次提交

memory: introduce IOMMUNotifier and its caps · cdb30812

由 Peter Xu 提交于 9月 23, 2016

IOMMU Notifier list is used for notifying IO address mapping changes.
Currently VFIO is the only user.

However it is possible that future consumer like vhost would like to
only listen to part of its notifications (e.g., cache invalidations).

This patch introduced IOMMUNotifier and IOMMUNotfierFlag bits for a
finer grained control of it.

IOMMUNotifier contains a bitfield for the notify consumer describing
what kind of notification it is interested in. Currently two kinds of
notifications are defined:

- IOMMU_NOTIFIER_MAP:    for newly mapped entries (additions)
- IOMMU_NOTIFIER_UNMAP:  for entries to be removed (cache invalidates)

When registering the IOMMU notifier, we need to specify one or multiple
types of messages to listen to.

When notifications are triggered, its type will be checked against the
notifier's type bits, and only notifiers with registered bits will be
notified.

(For any IOMMU implementation, an in-place mapping change should be
 notified with an UNMAP followed by a MAP.)
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <1474606948-14391-2-git-send-email-peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cdb30812

12 7月, 2016 1 次提交

Use #include "..." for our own headers, <...> for others · a9c94277

由 Markus Armbruster 提交于 6月 22, 2016

Tracked down with an ugly, brittle and probably buggy Perl script.

Also move includes converted to <...> up so they get included before
ours where that's obviously okay.
Signed-off-by: NMarkus Armbruster <armbru@redhat.com>
Reviewed-by: NEric Blake <eblake@redhat.com>
Tested-by: NEric Blake <eblake@redhat.com>
Reviewed-by: NRichard Henderson <rth@twiddle.net>

a9c94277

05 7月, 2016 3 次提交

vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2) · 2e4109de

由 Alexey Kardashevskiy 提交于 7月 04, 2016

New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
This adds ability to VFIO common code to dynamically allocate/remove
DMA windows in the host kernel when new VFIO container is added/removed.

This adds a helper to vfio_listener_region_add which makes
VFIO_IOMMU_SPAPR_TCE_CREATE ioctl and adds just created IOMMU into
the host IOMMU list; the opposite action is taken in
vfio_listener_region_del.

When creating a new window, this uses heuristic to decide on the TCE table
levels number.

This should cause no guest visible change in behavior.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
[dwg: Added some casts to prevent printf() warnings on certain targets
 where the kernel headers' __u64 doesn't match uint64_t or PRIx64]
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>

2e4109de

vfio: Add host side DMA window capabilities · f4ec5e26

由 Alexey Kardashevskiy 提交于 7月 04, 2016

There are going to be multiple IOMMUs per a container. This moves
the single host IOMMU parameter set to a list of VFIOHostDMAWindow.

This should cause no behavioral change and will be used later by
the SPAPR TCE IOMMU v2 which will also add a vfio_host_win_del() helper.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>

f4ec5e26

vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2) · 318f67ce

由 Alexey Kardashevskiy 提交于 7月 04, 2016

This makes use of the new "memory registering" feature. The idea is
to provide the userspace ability to notify the host kernel about pages
which are going to be used for DMA. Having this information, the host
kernel can pin them all once per user process, do locked pages
accounting (once) and not spent time on doing that in real time with
possible failures which cannot be handled nicely in some cases.

This adds a prereg memory listener which listens on address_space_memory
and notifies a VFIO container about memory which needs to be
pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.

The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
not call it when v2 is detected and enabled.

This enforces guest RAM blocks to be host page size aligned; however
this is not new as KVM already requires memory slots to be host page
size aligned.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
[dwg: Fix compile error on 32-bit host]
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>

318f67ce

01 7月, 2016 1 次提交

memory: Add MemoryRegionIOMMUOps.notify_started/stopped callbacks · d22d8956

由 Alexey Kardashevskiy 提交于 6月 30, 2016

The IOMMU driver may change behavior depending on whether a notifier
client is present.  In the case of POWER, this represents a change in
the visibility of the IOTLB, for other drivers such as intel-iommu and
future AMD-Vi emulation, notifier support is not yet enabled and this
provides the opportunity to flag that incompatibility.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Tested-by: NPeter Xu <peterx@redhat.com>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
[new log & extracted from [PATCH qemu v17 12/12] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping listening]
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

d22d8956

22 6月, 2016 1 次提交

memory: Add reporting of supported page sizes · f682e9c2

由 Alexey Kardashevskiy 提交于 6月 21, 2016

Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
uses when translating, however this information is not available outside
the translate context for various checks.

This adds a get_min_page_size callback to MemoryRegionIOMMUOps and
a wrapper for it so IOMMU users (such as VFIO) can know the minimum
actual page size supported by an IOMMU.

As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
as fallback.

This removes vfio_container_granularity() and uses new helper in
memory_region_iommu_replay() when replaying IOMMU mappings on added
IOMMU memory region.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Acked-by: NAlex Williamson <alex.williamson@redhat.com>
[dwg: Removed an unnecessary calculation]
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>

f682e9c2

17 6月, 2016 1 次提交

vfio: Fix broken EEH · d917e88d

由 Gavin Shan 提交于 6月 15, 2016

vfio_eeh_container_op() is the backend that communicates with
host kernel to support EEH functionality in QEMU. However, the
functon should return the value from host kernel instead of 0
unconditionally.

dwg: Specifically the problem occurs for the handful of EEH
sub-operations which can return a non-zero, non-error result.
Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: NAlex Williamson <alex.williamson@redhat.com>
[dwg: clarification to commit message]
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>

d917e88d