提交 · 17182bb47fe62ac6a005b325a7007488056f3a2d · openeuler / qemu

24 8月, 2018 2 次提交

vfio/pci: Fix failure to close file descriptor on error · 8709b395

由 Alex Williamson 提交于 8月 23, 2018

A new error path fails to close the device file descriptor when
triggered by a ballooning incompatibility within the group. Fix it.

Fixes: 238e9172 ("vfio/ccw/pci: Allow devices to opt-in for ballooning")
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

8709b395

vfio/pci: Handle subsystem realpath() returning NULL · a1c0f886

由 Alex Williamson 提交于 8月 23, 2018

Fix error reported by Coverity where realpath can return NULL,
resulting in a segfault in strcmp().  This should never happen given
that we're working through regularly structured sysfs paths, but
trivial enough to easily avoid.

Fixes: 238e9172 ("vfio/ccw/pci: Allow devices to opt-in for ballooning")
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

a1c0f886

21 8月, 2018 1 次提交

vfio/spapr: Allow backing bigger guest IOMMU pages with smaller physical pages · c26bc185

由 Alexey Kardashevskiy 提交于 6月 20, 2018

At the moment the PPC64/pseries guest only supports 4K/64K/16M IOMMU
pages and POWER8 CPU supports the exact same set of page size so
so far things worked fine.

However POWER9 supports different set of sizes - 4K/64K/2M/1G and
the last two - 2M and 1G - are not even allowed in the paravirt interface
(RTAS DDW) so we always end up using 64K IOMMU pages, although we could
back guest's 16MB IOMMU pages with 2MB pages on the host.

This stores the supported host IOMMU page sizes in VFIOContainer and uses
this later when creating a new DMA window. This uses the system page size
(64k normally, 2M/16M/1G if hugepages used) as the upper limit of
the IOMMU pagesize.

This changes the type of @pagesize to uint64_t as this is what
memory_region_iommu_get_min_page_size() returns and clz64() takes.

There should be no behavioral changes on platforms other than pseries.
The guest will keep using the IOMMU page size selected by the PHB pagesize
property as this only changes the underlying hardware TCE table
granularity.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>

c26bc185

17 8月, 2018 2 次提交

vfio/ccw/pci: Allow devices to opt-in for ballooning · 238e9172

由 Alex Williamson 提交于 8月 17, 2018

If a vfio assigned device makes use of a physical IOMMU, then memory
ballooning is necessarily inhibited due to the page pinning, lack of
page level granularity at the IOMMU, and sufficient notifiers to both
remove the page on balloon inflation and add it back on deflation.
However, not all devices are backed by a physical IOMMU. In the case
of mediated devices, if a vendor driver is well synchronized with the
guest driver, such that only pages actively used by the guest driver
are pinned by the host mdev vendor driver, then there should be no
overlap between pages available for the balloon driver and pages
actively in use by the device. Under these conditions, ballooning
should be safe.

vfio-ccw devices are always mediated devices and always operate under
the constraints above. Therefore we can consider all vfio-ccw devices
as balloon compatible.

The situation is far from straightforward with vfio-pci. These
devices can be physical devices with physical IOMMU backing or
mediated devices where it is unknown whether a physical IOMMU is in
use or whether the vendor driver is well synchronized to the working
set of the guest driver. The safest approach is therefore to assume
all vfio-pci devices are incompatible with ballooning, but allow user
opt-in should they have further insight into mediated devices.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

238e9172

vfio: Inhibit ballooning based on group attachment to a container · c65ee433

由 Alex Williamson 提交于 8月 17, 2018

We use a VFIOContainer to associate an AddressSpace to one or more
VFIOGroups. The VFIOContainer represents the DMA context for that
AdressSpace for those VFIOGroups and is synchronized to changes in
that AddressSpace via a MemoryListener. For IOMMU backed devices,
maintaining the DMA context for a VFIOGroup generally involves
pinning a host virtual address in order to create a stable host
physical address and then mapping a translation from the associated
guest physical address to that host physical address into the IOMMU.

While the above maintains the VFIOContainer synchronized to the QEMU
memory API of the VM, memory ballooning occurs outside of that API.
Inflating the memory balloon (ie. cooperatively capturing pages from
the guest for use by the host) simply uses MADV_DONTNEED to "zap"
pages from QEMU's host virtual address space. The page pinning and
IOMMU mapping above remains in place, negating the host's ability to
reuse the page, but the host virtual to host physical mapping of the
page is invalidated outside of QEMU's memory API.

When the balloon is later deflated, attempting to cooperatively
return pages to the guest, the page is simply freed by the guest
balloon driver, allowing it to be used in the guest and incurring a
page fault when that occurs. The page fault maps a new host physical
page backing the existing host virtual address, meanwhile the
VFIOContainer still maintains the translation to the original host
physical address. At this point the guest vCPU and any assigned
devices will map different host physical addresses to the same guest
physical address. Badness.

The IOMMU typically does not have page level granularity with which
it can track this mapping without also incurring inefficiencies in
using page size mappings throughout. MMU notifiers in the host
kernel also provide indicators for invalidating the mapping on
balloon inflation, not for updating the mapping when the balloon is
deflated. For these reasons we assume a default behavior that the
mapping of each VFIOGroup into the VFIOContainer is incompatible
with memory ballooning and increment the balloon inhibitor to match
the attached VFIOGroups.
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

c65ee433

12 7月, 2018 1 次提交

vfio/pci: do not set the PCIDevice 'has_rom' attribute · 26c0ae56

由 Cédric Le Goater 提交于 7月 11, 2018

PCI devices needing a ROM allocate an optional MemoryRegion with
pci_add_option_rom(). pci_del_option_rom() does the cleanup when the
device is destroyed. The only action taken by this routine is to call
vmstate_unregister_ram() which clears the id string of the optional
ROM RAMBlock and now, also flags the RAMBlock as non-migratable. This
was recently added by commit b895de50 ("migration: discard
non-migratable RAMBlocks"), .

VFIO devices do their own loading of the PCI option ROM in
vfio_pci_size_rom(). The memory region is switched to an I/O region
and the PCI attribute 'has_rom' is set but the RAMBlock of the ROM
region is not allocated. When the associated PCI device is deleted,
pci_del_option_rom() calls vmstate_unregister_ram() which tries to
flag a NULL RAMBlock, leading to a SEGV.

It seems that 'has_rom' was set to have memory_region_destroy()
called, but since commit 469b046e ("memory: remove
memory_region_destroy") this is not necessary anymore as the
MemoryRegion is freed automagically.

Remove the PCIDevice 'has_rom' attribute setting in vfio.

Fixes: b895de50 ("migration: discard non-migratable RAMBlocks")
Signed-off-by: NCédric Le Goater <clg@kaod.org>
Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

26c0ae56

02 7月, 2018 1 次提交

hw/vfio: Use the IEC binary prefix definitions · e0255bb1

由 Philippe Mathieu-Daudé 提交于 6月 25, 2018

It eases code review, unit is explicit.

Patch generated using:

  $ git grep -E '(1024|2048|4096|8192|(<<|>>).?(10|20|30))' hw/ include/hw/

and modified manually.
Signed-off-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20180625124238.25339-38-f4bug@amsat.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e0255bb1

18 6月, 2018 1 次提交

vfio-ccw: add force unlimited prefetch property · 9a51c9ee

由 Halil Pasic 提交于 5月 24, 2018

There is at least one guest (OS) such that although it does not rely on
the guarantees provided by ORB 1 word 9 bit (aka unlimited prefetch, aka
P bit) not being set, it fails to tell this to the machine.

Usually this ain't a big deal, as the original purpose of the P bit is to
allow for performance optimizations. vfio-ccw however can not provide the
guarantees required if the bit is not set.

It is not possible to implement support for the P bit not set without
transitioning to lower level protocols for vfio-ccw.  So let's give the
user the opportunity to force setting the P bit, if the user knows this
is safe.  For self modifying channel programs forcing the P bit is not
safe.  If the P bit is forced for a self modifying channel program things
are expected to break in strange ways.

Let's also avoid warning multiple about P bit not set in the ORB in case
P bit is not told to be forced, and designate the affected vfio-ccw
device.
Signed-off-by: NHalil Pasic <pasic@linux.ibm.com>
Suggested-by: NDong Jia Shi <bjsdjshi@linux.ibm.com>
Acked-by: NJason J. Herne <jjherne@linux.ibm.com>
Tested-by: NJason J. Herne <jjherne@linux.ibm.com>
Message-Id: <20180524175828.3143-2-pasic@linux.ibm.com>
Signed-off-by: NCornelia Huck <cohuck@redhat.com>

9a51c9ee

15 6月, 2018 1 次提交

iommu: Add IOMMU index argument to notifier APIs · cb1efcf4

由 Peter Maydell 提交于 6月 15, 2018

Add support for multiple IOMMU indexes to the IOMMU notifier APIs.
When initializing a notifier with iommu_notifier_init(), the caller
must pass the IOMMU index that it is interested in. When a change
happens, the IOMMU implementation must pass
memory_region_notify_iommu() the IOMMU index that has changed and
that notifiers must be called for.

IOMMUs which support only a single index don't need to change.
Callers which only really support working with IOMMUs with a single
index can use the result of passing MEMTXATTRS_UNSPECIFIED to
memory_region_iommu_attrs_to_index().
Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
Message-id: 20180604152941.20374-3-peter.maydell@linaro.org

cb1efcf4

05 6月, 2018 5 次提交

vfio/pci: Default display option to "off" · 8151a9c5

由 Alex Williamson 提交于 6月 05, 2018

Commit a9994687 ("vfio/display: core & wireup") added display
support to vfio-pci with the default being "auto", which breaks
existing VMs when the vGPU requires GL support but had no previous
requirement for a GL compatible configuration.  "Off" is the safer
default as we impose no new requirements to VM configurations.

Fixes: a9994687 ("vfio/display: core & wireup")
Cc: qemu-stable@nongnu.org
Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

8151a9c5

vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly · 2b1dbd0d

由 Alex Williamson 提交于 6月 05, 2018

With vfio ioeventfd support, we can program vfio-pci to perform a
specified BAR write when an eventfd is triggered.  This allows the
KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
userspace handling for these events.  On the same micro-benchmark
where the ioeventfd got us to almost 90% of performance versus
disabling the GeForce quirks, this gets us to within 95%.
Reviewed-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

2b1dbd0d

vfio/quirks: ioeventfd quirk acceleration · c958c51d

由 Alex Williamson 提交于 6月 05, 2018

The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found
in device MMIO space. Normally PCI config space is considered a slow
path and further optimization is unnecessary, however NVIDIA uses a
register here to enable the MSI interrupt to re-trigger. Exiting to
QEMU for this MSI-ACK handling can therefore rate limit our interrupt
handling. Fortunately the MSI-ACK write is easily detected since the
quirk MemoryRegion otherwise has very few accesses, so simply looking
for consecutive writes with the same data is sufficient, in this case
10 consecutive writes with the same data and size is arbitrarily
chosen. We configure the KVM ioeventfd with data match, so there's
no risk of triggering for the wrong data or size, but we do risk that
pathological driver behavior might consume all of QEMU's file
descriptors, so we cap ourselves to 10 ioeventfds for this purpose.

In support of the above, generic ioeventfd infrastructure is added
for vfio quirks. This automatically initializes an ioeventfd list
per quirk, disables and frees ioeventfds on exit, and allows
ioeventfds marked as dynamic to be dropped on device reset. The
rationale for this latter feature is that useful ioeventfds may
depend on specific driver behavior and since we necessarily place a
cap on our use of ioeventfds, a machine reset is a reasonable point
at which to assume a new driver and re-profile.
Reviewed-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

c958c51d

vfio/quirks: Add quirk reset callback · 469d02de

由 Alex Williamson 提交于 6月 05, 2018

Quirks can be self modifying, provide a hook to allow them to cleanup
on device reset if desired.
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

469d02de

vfio/quirks: Add common quirk alloc helper · bcf3c3d0

由 Alex Williamson 提交于 6月 05, 2018

This will later be used to include list initialization.
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

bcf3c3d0

01 6月, 2018 1 次提交

vfio: Include "exec/address-spaces.h" directly in the source file · d791937f

由 Philippe Mathieu-Daudé 提交于 5月 28, 2018

No declaration of "hw/vfio/vfio-common.h" directly requires to include
the "exec/address-spaces.h" header.  To simplify dependencies and
ease the upcoming cleanup of "exec/address-spaces.h", directly include
it in the source file where the declaration are used.
Signed-off-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20180528232719.4721-2-f4bug@amsat.org>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d791937f

31 5月, 2018 1 次提交

Make address_space_translate{, _cached}() take a MemTxAttrs argument · bc6b1cec

由 Peter Maydell 提交于 5月 31, 2018

As part of plumbing MemTxAttrs down to the IOMMU translate method,
add MemTxAttrs as an argument to address_space_translate()
and address_space_translate_cached(). Callers either have an
attrs value to hand, or don't care and can use MEMTXATTRS_UNSPECIFIED.
Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
Message-id: 20180521140402.23318-4-peter.maydell@linaro.org

bc6b1cec

30 4月, 2018 1 次提交

vfio-ccw: introduce vfio_ccw_get_device() · c96f2c2a

由 Greg Kurz 提交于 4月 09, 2018

A recent patch fixed leaks of the dynamically allocated vcdev->vdev.name
field in vfio_ccw_realize(), but we now have three freeing sites for it.
This is unfortunate and seems to indicate something is wrong with its
life cycle.

The root issue is that vcdev->vdev.name is set before vfio_get_device()
is called, which theoretically prevents to call vfio_put_device() to
do the freeing. Well actually, we could call it anyway  because
vfio_put_base_device() is a nop if the device isn't attached, but this
would be confusing.

This patch hence moves all the logic of attaching the device, including
the "already attached" check, to a separate vfio_ccw_get_device() function,
counterpart of vfio_put_device(). While here, vfio_put_device() is renamed
to vfio_ccw_put_device() for consistency.
Signed-off-by: NGreg Kurz <groug@kaod.org>
Message-Id: <152326891065.266543.9487977590811413472.stgit@bahia.lan>
Signed-off-by: NCornelia Huck <cohuck@redhat.com>

c96f2c2a

27 4月, 2018 1 次提交

ui: introduce vfio_display_reset · 8983e3e3

由 Tina Zhang 提交于 4月 27, 2018

During guest OS reboot, guest framebuffer is invalid. It will cause
bugs, if the invalid guest framebuffer is still used by host.

This patch is to introduce vfio_display_reset which is invoked
during vfio display reset. This vfio_display_reset function is used
to release the invalid display resource, disable scanout mode and
replace the invalid surface with QemuConsole's DisplaySurafce.

This patch can fix the GPU hang issue caused by gd_egl_draw during
guest OS reboot.

Changes v3->v4:
 - Move dma-buf based display check into the vfio_display_reset().
   (Gerd)

Changes v2->v3:
 - Limit vfio_display_reset to dma-buf based vfio display. (Gerd)

Changes v1->v2:
 - Use dpy_gfx_update_full() update screen after reset. (Gerd)
 - Remove dpy_gfx_switch_surface(). (Gerd)
Signed-off-by: NTina Zhang <tina.zhang@intel.com>
Message-id: 1524820266-27079-3-git-send-email-tina.zhang@intel.com
Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>

8983e3e3

09 4月, 2018 1 次提交

vfio-ccw: fix memory leaks in vfio_ccw_realize() · be4d026f

由 Greg Kurz 提交于 4月 07, 2018

If the subchannel is already attached or if vfio_get_device() fails, the
code jumps to the 'out_device_err' label and doesn't free the string it
has just allocated.

The code should be reworked so that vcdev->vdev.name only gets set when
the device has been attached, and freed when it is about to be detached.
This could be achieved  with the addition of a vfio_ccw_get_device()
function that would be the counterpart of vfio_put_device(). But this is
a more elaborate cleanup that should be done in a follow-up. For now,
let's just add calls to g_free() on the buggy error paths.
Signed-off-by: NGreg Kurz <groug@kaod.org>
Message-Id: <152311222681.203086.8874800175539040298.stgit@bahia>
Signed-off-by: NCornelia Huck <cohuck@redhat.com>

be4d026f

06 4月, 2018 1 次提交

vfio: Use a trace point when a RAM section cannot be DMA mapped · 5c086005

由 Eric Auger 提交于 4月 04, 2018

Commit 567b5b30 ("vfio/pci: Relax DMA map errors for MMIO regions")
added an error message if a passed memory section address or size
is not aligned to the page size and thus cannot be DMA mapped.

This patch fixes the trace by printing the region name and the
memory region section offset within the address space (instead of
offset_within_region).

We also turn the error_report into a trace event. Indeed, In some
cases, the traces can be confusing to non expert end-users and
let think the use case does not work (whereas it works as before).

This is the case where a BAR is successively mapped at different
GPAs and its sections are not compatible with dma map. The listener
is called several times and traces are issued for each intermediate
mapping.  The end-user cannot easily match those GPAs against the
final GPA output by lscpi. So let's keep those information to
informed users. In mid term, the plan is to advise the user about
BAR relocation relevance.

Fixes: 567b5b30 ("vfio/pci: Relax DMA map errors for MMIO regions")
Signed-off-by: NEric Auger <eric.auger@redhat.com>
Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

5c086005

14 3月, 2018 7 次提交

ppc/spapr, vfio: Turn off MSIX emulation for VFIO devices · fcad0d21

由 Alexey Kardashevskiy 提交于 3月 13, 2018

This adds a possibility for the platform to tell VFIO not to emulate MSIX
so MMIO memory regions do not get split into chunks in flatview and
the entire page can be registered as a KVM memory slot and make direct
MMIO access possible for the guest.

This enables the entire MSIX BAR mapping to the guest for the pseries
platform in order to achieve the maximum MMIO preformance for certain
devices.

Tested on:
LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

fcad0d21

vfio-pci: Allow mmap of MSIX BAR · ae0215b2

由 Alexey Kardashevskiy 提交于 3月 13, 2018

At the moment we unconditionally avoid mapping MSIX data of a BAR and
emulate MSIX table in QEMU. However it is 1) not always necessary as
a platform may provide a paravirt interface for MSIX configuration;
2) can affect the speed of MMIO access by emulating them in QEMU when
frequently accessed registers share same system page with MSIX data,
this is particularly a problem for systems with the page size bigger
than 4KB.

A new capability - VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - has been added
to the kernel [1] which tells the userspace that mapping of the MSIX data
is possible now. This makes use of it so from now on QEMU tries mapping
the entire BAR as a whole and emulate MSIX on top of that.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

ae0215b2

vfio/pci: Relax DMA map errors for MMIO regions · 567b5b30

由 Alexey Kardashevskiy 提交于 3月 13, 2018

At the moment if vfio_memory_listener is registered in the system memory
address space, it maps/unmaps every RAM memory region for DMA.
It expects system page size aligned memory sections so vfio_dma_map
would not fail and so far this has been the case. A mapping failure
would be fatal. A side effect of such behavior is that some MMIO pages
would not be mapped silently.

However we are going to change MSIX BAR handling so we will end having
non-aligned sections in vfio_memory_listener (more details is in
the next patch) and vfio_dma_map will exit QEMU.

In order to avoid fatal failures on what previously was not a failure and
was just silently ignored, this checks the section alignment to
the smallest supported IOMMU page size and prints an error if not aligned;
it also prints an error if vfio_dma_map failed despite the page size check.
Both errors are not fatal; only MMIO RAM regions are checked
(aka "RAM device" regions).

If the amount of errors printed is overwhelming, the MSIX relocation
could be used to avoid excessive error output.

This is unlikely to cause any behavioral change.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
[aw: Fix Int128 bit ops]
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

567b5b30

vfio/display: adding dmabuf support · 8b818e05

由 Gerd Hoffmann 提交于 3月 13, 2018

Wire up dmabuf-based display.
Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

8b818e05

vfio/display: adding region support · 00195ba7

由 Gerd Hoffmann 提交于 3月 13, 2018

Wire up region-based display.
Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
Reviewed By: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

00195ba7

vfio/display: core & wireup · a9994687

由 Gerd Hoffmann 提交于 3月 13, 2018

Infrastructure for display support.  Must be enabled
using 'display' property.
Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
Reviewed By: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

a9994687

vfio/common: cleanup in vfio_region_finalize · 92f86bff

由 Gerd Hoffmann 提交于 3月 13, 2018

Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

92f86bff

08 3月, 2018 1 次提交

vfio-ccw: license text should indicate GPL v2 or later · 08b824aa

由 Cornelia Huck 提交于 2月 27, 2018

The license text currently specifies "any version" of the GPL. It
is unlikely that GPL v1 was ever intended; change this to the
standard "or any later version" text.

Cc: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Cc: Xiao Feng Ren <renxiaof@linux.vnet.ibm.com>
Cc: Pierre Morel <pmorel@linux.vnet.ibm.com>
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: NDong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Acked-by: NPierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: NCornelia Huck <cohuck@redhat.com>

08b824aa

06 3月, 2018 1 次提交

use g_path_get_basename instead of basename · 3e015d81

由 Julia Suvorova 提交于 3月 01, 2018

basename(3) and dirname(3) modify their argument and may return
pointers to statically allocated memory which may be overwritten by
subsequent calls.
g_path_get_basename and g_path_get_dirname have no such issues, and
therefore more preferable.
Signed-off-by: NJulia Suvorova <jusual@mail.ru>
Message-Id: <1519888086-4207-1-git-send-email-jusual@mail.ru>
Reviewed-by: NMarc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3e015d81

09 2月, 2018 2 次提交

Move include qemu/option.h from qemu-common.h to actual users · 922a01a0

由 Markus Armbruster 提交于 2月 01, 2018

qemu-common.h includes qemu/option.h, but most places that include the
former don't actually need the latter.  Drop the include, and add it
to the places that actually need it.

While there, drop superfluous includes of both headers, and
separate #include from file comment with a blank line.

This cleanup makes the number of objects depending on qemu/option.h
drop from 4545 (out of 4743) to 284 in my "build everything" tree.
Reviewed-by: NEric Blake <eblake@redhat.com>
Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: NMarkus Armbruster <armbru@redhat.com>
Message-Id: <20180201111846.21846-20-armbru@redhat.com>
[Semantic conflict with commit bdd6a90a in block/nvme.c resolved]

922a01a0

pci: removed the is_express field since a uniform interface was inserted · d61a363d

由 Yoni Bettan 提交于 1月 16, 2018

according to Eduardo Habkost's commit fd3b02c8 all PCIEs now implement
INTERFACE_PCIE_DEVICE so we don't need is_express field anymore.

Devices that implements only INTERFACE_PCIE_DEVICE (is_express == 1)
or
devices that implements only INTERFACE_CONVENTIONAL_PCI_DEVICE (is_express == 0)
where not affected by the change.

The only devices that were affected are those that are hybrid and also
had (is_express == 1) - therefor only:
  - hw/vfio/pci.c
  - hw/usb/hcd-xhci.c
  - hw/xen/xen_pt.c

For those 3 I made sure that QEMU_PCI_CAP_EXPRESS is on in instance_init()
Reviewed-by: NMarcel Apfelbaum <marcel@redhat.com>
Reviewed-by: NEduardo Habkost <ehabkost@redhat.com>
Signed-off-by: NYoni Bettan <ybettan@redhat.com>
Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

d61a363d

07 2月, 2018 9 次提交

vfio: listener unregister before unset container · 36968626

由 Peter Xu 提交于 1月 22, 2018

After next patch, listener unregister will need the container to be
alive.  Let's move this unregister phase to be before unset container,
since that operation will free the backend container in kernel,
otherwise we'll get these after next patch:

qemu-system-x86_64: VFIO_UNMAP_DMA: -22
qemu-system-x86_64: vfio_dma_unmap(0x559bf53a4590, 0x0, 0xa0000) = -22 (Invalid argument)
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20180122060244.29368-4-peterx@redhat.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Acked-by: NAlex Williamson <alex.williamson@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

36968626

vfio/pci: Add option to disable GeForce quirks · db32d0f4

由 Alex Williamson 提交于 2月 06, 2018

These quirks are necessary for GeForce, but not for Quadro/GRID/Tesla
assignment.  Leaving them enabled is fully functional and provides the
most compatibility, but due to the unique NVIDIA MSI ACK behavior[1],
it also introduces latency in re-triggering the MSI interrupt.  This
overhead is typically negligible, but has been shown to adversely
affect some (very) high interrupt rate applications.  This adds the
vfio-pci device option "x-no-geforce-quirks=" which can be set to
"on" to disable this additional overhead.

A follow-on optimization for GeForce might be to make use of an
ioeventfd to allow KVM to trigger an irqfd in the kernel vfio-pci
driver, avoiding the bounce through userspace to handle this device
write.

[1] Background: the NVIDIA driver has been observed to issue a write
to the MMIO mirror of PCI config space in BAR0 in order to allow the
MSI interrupt for the device to retrigger.  Older reports indicated a
write of 0xff to the (read-only) MSI capability ID register, while
more recently a write of 0x0 is observed at config space offset 0x704,
non-architected, extended config space of the device (BAR0 offset
0x88704).  Virtualization of this range is only required for GeForce.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

db32d0f4

vfio/common: Remove redundant copy of local variable · a5b04f7c

由 Alexey Kardashevskiy 提交于 2月 06, 2018

There is already @hostwin in vfio_listener_region_add() so there is no
point in having the other one.

Fixes: 2e4109de ("vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)")
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

a5b04f7c

hw/vfio/platform: Init the interrupt mutex · 89202c6f

由 Eric Auger 提交于 2月 06, 2018

Add the initialization of the mutex protecting the interrupt list.
Signed-off-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

89202c6f

vfio/pci: Allow relocating MSI-X MMIO · 89d5202e

由 Alex Williamson 提交于 2月 06, 2018

Recently proposed vfio-pci kernel changes (v4.16) remove the
restriction preventing userspace from mmap'ing PCI BARs in areas
overlapping the MSI-X vector table.  This change is primarily intended
to benefit host platforms which make use of system page sizes larger
than the PCI spec recommendation for alignment of MSI-X data
structures (ie. not x86_64).  In the case of POWER systems, the SPAPR
spec requires the VM to program MSI-X using hypercalls, rendering the
MSI-X vector table unused in the VM view of the device.  However,
ARM64 platforms also support 64KB pages and rely on QEMU emulation of
MSI-X.  Regardless of the kernel driver allowing mmaps overlapping
the MSI-X vector table, emulation of the MSI-X vector table also
prevents direct mapping of device MMIO spaces overlapping this page.
Thanks to the fact that PCI devices have a standard self discovery
mechanism, we can try to resolve this by relocating the MSI-X data
structures, either by creating a new PCI BAR or extending an existing
BAR and updating the MSI-X capability for the new location.  There's
even a very slim chance that this could benefit devices which do not
adhere to the PCI spec alignment guidelines on x86_64 systems.

This new x-msix-relocation option accepts the following choices:

  off: Disable MSI-X relocation, use native device config (default)
  auto: Use a known good combination for the platform/device (none yet)
  bar0..bar5: Specify the target BAR for MSI-X data structures

If compatible, the target BAR will either be created or extended and
the new portion will be used for MSI-X emulation.

The first obvious user question with this option is how to determine
whether a given platform and device might benefit from this option.
In most cases, the answer is that it won't, especially on x86_64.
Devices often dedicate an entire BAR to MSI-X and therefore no
performance sensitive registers overlap the MSI-X area.  Take for
example:

# lspci -vvvs 0a:00.0
0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection
	...
	Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K]
	Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K]
	...
	Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
		Vector table: BAR=3 offset=00000000
		PBA: BAR=3 offset=00002000

This device uses the 16K bar3 for MSI-X with the vector table at
offset zero and the pending bits arrary at offset 8K, fully honoring
the PCI spec alignment guidance.  The data sheet specifically refers
to this as an MSI-X BAR.  This device would not see a benefit from
MSI-X relocation regardless of the platform, regardless of the page
size.

However, here's another example:

# lspci -vvvs 02:00.0
02:00.0 Serial Attached SCSI controller: xxxxxxxx
	...
	Region 0: I/O ports at c000 [size=256]
	Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K]
	Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K]
	...
	Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
		Vector table: BAR=1 offset=0000e000
		PBA: BAR=1 offset=0000f000

Here the MSI-X data structures are placed on separate 4K pages at the
end of a 64KB BAR.  If our host page size is 4K, we're likely fine,
but at 64KB page size, MSI-X emulation at that location prevents the
entire BAR from being directly mapped into the VM address space.
Overlapping performance sensitive registers then starts to be a very
likely scenario on such a platform.  At this point, the user could
enable tracing on vfio_region_read and vfio_region_write to determine
more conclusively if device accesses are being trapped through QEMU.

Upon finding a device and platform in need of MSI-X relocation, the
next problem is how to choose target PCI BAR to host the MSI-X data
structures.  A few key rules to keep in mind for this selection
include:

 * There are only 6 BAR slots, bar0..bar5
 * 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot
 * PCI BARs are always a power of 2 in size, extending == doubling
 * The maximum size of a 32-bit BAR is 2GB
 * MSI-X data structures must reside in an MMIO BAR

Using these rules, we can evaluate each BAR of the second example
device above as follows:

 bar0: I/O port BAR, incompatible with MSI-X tables
 bar1: BAR could be extended, incurring another 64KB of MMIO
 bar2: Unavailable, bar1 is 64-bit, this register is used by bar1
 bar3: BAR could be extended, incurring another 256KB of MMIO
 bar4: Unavailable, bar3 is 64bit, this register is used by bar3
 bar5: Available, empty BAR, minimum additional MMIO

A secondary optimization we might wish to make in relocating MSI-X
is to minimize the additional MMIO required for the device, therefore
we might test the available choices in order of preference as bar5,
bar1, and finally bar3.  The original proposal for this feature
included an 'auto' option which would choose bar5 in this case, but
various drivers have been found that make assumptions about the
properties of the "first" BAR or the size of BARs such that there
appears to be no foolproof automatic selection available, requiring
known good combinations to be sourced from users.  This patch is
pre-enabled for an 'auto' selection making use of a validated lookup
table, but no entries are yet identified.
Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Tested-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

89d5202e

vfio/pci: Emulate BARs · 04f336b0

由 Alex Williamson 提交于 2月 06, 2018

The kernel provides similar emulation of PCI BAR register access to
QEMU, so up until now we've used that for things like BAR sizing and
storing the BAR address.  However, if we intend to resize BARs or add
BARs that don't exist on the physical device, we need to switch to the
pure QEMU emulation of the BAR.
Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Tested-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

04f336b0

vfio/pci: Add base BAR MemoryRegion · 3a286732

由 Alex Williamson 提交于 2月 06, 2018

Add one more layer to our stack of MemoryRegions, this base region
allows us to register BARs independently of the vfio region or to
extend the size of BARs which do map to a region. This will be
useful when we want hypervisor defined BARs or sections of BARs,
for purposes such as relocating MSI-X emulation. We therefore call
msix_init() based on this new base MemoryRegion, while the quirks,
which only modify regions still operate on those sub-MemoryRegions.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

3a286732

vfio/pci: Fixup VFIOMSIXInfo comment · edd09278

由 Alex Williamson 提交于 2月 06, 2018

The fields were removed in the referenced commit, but the comment
still mentions them.

Fixes: 2fb9636e ("vfio-pci: Remove unused fields from VFIOMSIXInfo")
Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Tested-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

edd09278

vfio/spapr: Use iommu memory region's get_attr() · 07bc681a

由 Alexey Kardashevskiy 提交于 2月 06, 2018

In order to enable TCE operations support in KVM, we have to inform
the KVM about VFIO groups being attached to specific LIOBNs. The KVM
already knows about VFIO groups, the only bit missing is which
in-kernel TCE table (the one with user visible TCEs) should update
the attached broups. There is an KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE
attribute of the VFIO KVM device which receives a groupfd/tablefd couple.

This uses a new memory_region_iommu_get_attr() helper to get the IOMMU fd
and calls KVM to establish the link.

As get_attr() is not implemented yet, this should cause no behavioural
change.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

07bc681a