提交 · c4600d5d417ea13e0f1cc047b227a2b5b0e694f5 · openeuler / qemu

22 5月, 2019 2 次提交

pci: msix: move 'MSIX_CAP_LENGTH' to header file · 2d9574bd

由 Li Qiang 提交于 5月 21, 2019

'MSIX_CAP_LENGTH' is defined in two .c file. Move it
to hw/pci/msix.h file to reduce duplicated code.

CC: qemu-trivial@nongnu.org
Signed-off-by: NLi Qiang <liq3ea@163.com>
Message-Id: <20190521151543.92274-5-liq3ea@163.com>
Acked-by: NAlex Williamson <alex.williamson@redhat.com>
Reviewed-by: NPhilippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NLaurent Vivier <laurent@vivier.eu>

2d9574bd

vfio: pci: make "vfio-pci-nohotplug" as MACRO · 0c0c8f8a

由 Li Qiang 提交于 5月 21, 2019

The QOMConventions recommends we should use TYPE_FOO
for a TypeInfo's name. Though "vfio-pci-nohotplug" is not
used in other parts, for consistency we should make this change.

CC: qemu-trivial@nongnu.org
Signed-off-by: NLi Qiang <liq3ea@163.com>
Message-Id: <20190521151543.92274-2-liq3ea@163.com>
Acked-by: NAlex Williamson <alex.williamson@redhat.com>
Reviewed-by: NPhilippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NLaurent Vivier <laurent@vivier.eu>

0c0c8f8a

26 4月, 2019 1 次提交

spapr: Support NVIDIA V100 GPU with NVLink2 · ec132efa

由 Alexey Kardashevskiy 提交于 3月 12, 2019

NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
implements special regions for such GPUs and emulates an NVLink bridge.
NVLink2-enabled POWER9 CPUs also provide address translation services
which includes an ATS shootdown (ATSD) register exported via the NVLink
bridge device.

This adds a quirk to VFIO to map the GPU memory and create an MR;
the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
this to get the MR and map it to the system address space.
Another quirk does the same for ATSD.

This adds additional steps to sPAPR PHB setup:

1. Search for specific GPUs and NPUs, collect findings in
sPAPRPHBState::nvgpus, manage system address space mappings;

2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
"memory-block", "link-speed" to advertise the NVLink2 function to
the guest;

3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;

4. Add new memory blocks (with extra "linux,memory-usable" to prevent
the guest OS from accessing the new memory until it is onlined) and
npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
uses it for link discovery.

This allocates space for GPU RAM and ATSD like we do for MMIOs by
adding 2 new parameters to the phb_placement() hook. Older machine types
set these to zero.

This puts new memory nodes in a separate NUMA node to as the GPU RAM
needs to be configured equally distant from any other node in the system.
Unlike the host setup which assigns numa ids from 255 downwards, this
adds new NUMA nodes after the user configures nodes or from 1 if none
were configured.

This adds requirement similar to EEH - one IOMMU group per vPHB.
The reason for this is that ATSD registers belong to a physical NPU
so they cannot invalidate translations on GPUs attached to another NPU.
It is guaranteed by the host platform as it does not mix NVLink bridges
or GPUs from different NPU in the same IOMMU group. If more than one
IOMMU group is detected on a vPHB, this disables ATSD support for that
vPHB and prints a warning.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
[aw: for vfio portions]
Acked-by: NAlex Williamson <alex.williamson@redhat.com>
Message-Id: <20190312082103.130561-1-aik@ozlabs.ru>
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>

ec132efa

19 4月, 2019 1 次提交

vfio: Report warnings with warn_report(), not error_printf() · 8f8f5885

由 Markus Armbruster 提交于 4月 17, 2019

Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: NMarkus Armbruster <armbru@redhat.com>
Message-Id: <20190417190641.26814-8-armbru@redhat.com>
Acked-by: NAlex Williamson <alex.williamson@redhat.com>

8f8f5885

12 3月, 2019 1 次提交

vfio/display: add xres + yres properties · c62a0c7c

由 Gerd Hoffmann 提交于 3月 11, 2019

This allows configure the display resolution which the vgpu should use.
The information will be passed to the guest using EDID, so the mdev
driver must support the vfio edid region for this to work.
Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

c62a0c7c

24 1月, 2019 1 次提交

trace: forbid use of %m in trace event format strings · 772f1b37

由 Daniel P. Berrangé 提交于 1月 23, 2019

The '%m' format instructs glibc's printf()/syslog() implementation to
insert the contents of strerror(errno). Since this is a glibc extension
it should generally be avoided in QEMU due to need for portability to a
variety of platforms.

Even though vfio is Linux-only code that could otherwise use "%m", it
must still be avoided in trace-events files because several of the
backends do not use the format string and so this error information is
invisible to them.

The errno string value should be given as an explicit trace argument
instead, making it accessible to all backends. This also allows it to
work correctly with future patches that use the format string with
systemtap's simple printf code.
Reviewed-by: NEric Blake <eblake@redhat.com>
Signed-off-by: NDaniel P. Berrangé <berrange@redhat.com>
Message-id: 20190123120016.4538-4-berrange@redhat.com
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>

772f1b37

20 12月, 2018 2 次提交

vfio/pci: Remove PCIe Link Status emulation · d26e5438

由 Alex Williamson 提交于 12月 12, 2018

Now that the downstream port will virtually negotiate itself to the
link status of the downstream device, we can remove this emulation.
It's not clear that it was every terribly useful anyway.
Tested-by: NGeoffrey McRae <geoff@hostfission.com>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

d26e5438

pcie: Create enums for link speed and width · d96a0ac7

由 Alex Williamson 提交于 12月 12, 2018

In preparation for reporting higher virtual link speeds and widths,
create enums and macros to help us manage them.

Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Tested-by: NGeoffrey McRae <geoff@hostfission.com>
Reviewed-by: NPhilippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

d96a0ac7

19 10月, 2018 3 次提交

vfio: Clean up error reporting after previous commit · c3b8e3e0

由 Markus Armbruster 提交于 10月 17, 2018

The previous commit changed vfio's warning messages from

    vfio warning: DEV-NAME: Could not frobnicate

to

    warning: vfio DEV-NAME: Could not frobnicate

To match this change, change error messages from

    vfio error: DEV-NAME: On fire

to

    vfio DEV-NAME: On fire

Note the loss of "error".  If we think marking error messages that way
is a good idea, we should mark *all* error messages, i.e. make
error_report() print it.

Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: NMarkus Armbruster <armbru@redhat.com>
Acked-by: NAlex Williamson <alex.williamson@redhat.com>
Message-Id: <20181017082702.5581-7-armbru@redhat.com>

c3b8e3e0

vfio: Use warn_report() & friends to report warnings · e1eb292a

由 Markus Armbruster 提交于 10月 17, 2018

The vfio code reports warnings like

    error_report(WARN_PREFIX "Could not frobnicate", DEV-NAME);

where WARN_PREFIX is defined so the message comes out as

    vfio warning: DEV-NAME: Could not frobnicate

This usage predates the introduction of warn_report() & friends in
commit 97f40301.  It's time to convert to that interface.  Since
these functions already prefix the message with "warning: ", replace
WARN_PREFIX by VFIO_MSG_PREFIX, so the messages come out like

    warning: vfio DEV-NAME: Could not frobnicate

The next commit will replace ERR_PREFIX.

Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: NMarkus Armbruster <armbru@redhat.com>
Acked-by: NAlex Williamson <alex.williamson@redhat.com>
Reviewed-by: NEric Blake <eblake@redhat.com>
Message-Id: <20181017082702.5581-6-armbru@redhat.com>

e1eb292a

error: Fix use of error_prepend() with &error_fatal, &error_abort · 4b576648

由 Markus Armbruster 提交于 10月 17, 2018

From include/qapi/error.h:

  * Pass an existing error to the caller with the message modified:
  *     error_propagate(errp, err);
  *     error_prepend(errp, "Could not frobnicate '%s': ", name);

Fei Li pointed out that doing error_propagate() first doesn't work
well when @errp is &error_fatal or &error_abort: the error_prepend()
is never reached.

Since I doubt fixing the documentation will stop people from getting
it wrong, introduce error_propagate_prepend(), in the hope that it
lures people away from using its constituents in the wrong order.
Update the instructions in error.h accordingly.

Convert existing error_prepend() next to error_propagate to
error_propagate_prepend().  If any of these get reached with
&error_fatal or &error_abort, the error messages improve.  I didn't
check whether that's the case anywhere.

Cc: Fei Li <fli@suse.com>
Signed-off-by: NMarkus Armbruster <armbru@redhat.com>
Reviewed-by: NPhilippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: NEric Blake <eblake@redhat.com>
Message-Id: <20181017082702.5581-2-armbru@redhat.com>

4b576648

16 10月, 2018 2 次提交

vfio-pci: make vfio-pci device more QOM conventional · 2683ccd5

由 Li Qiang 提交于 10月 15, 2018

Define a TYPE_VFIO_PCI and drop DO_UPCAST.
Signed-off-by: NLi Qiang <liq3ea@gmail.com>
Reviewed-by: NPhilippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

2683ccd5

hw/vfio/display: add ramfb support · b290659f

由 Gerd Hoffmann 提交于 10月 15, 2018

So we have a boot display when using a vgpu as primary display.

ramfb depends on a fw_cfg file.  fw_cfg files can not be added and
removed at runtime, therefore a ramfb-enabled vfio device can't be
hotplugged.

Add a nohotplug variant of the vfio-pci device (as child class).  Add
the ramfb property to the nohotplug variant only.  So to enable the vgpu
display with boot support use this:

  -device vfio-pci-nohotplug,display=on,ramfb=on,sysfsdev=...
Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

b290659f

24 8月, 2018 1 次提交

vfio/pci: Handle subsystem realpath() returning NULL · a1c0f886

由 Alex Williamson 提交于 8月 23, 2018

Fix error reported by Coverity where realpath can return NULL,
resulting in a segfault in strcmp().  This should never happen given
that we're working through regularly structured sysfs paths, but
trivial enough to easily avoid.

Fixes: 238e9172 ("vfio/ccw/pci: Allow devices to opt-in for ballooning")
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

a1c0f886

17 8月, 2018 1 次提交

vfio/ccw/pci: Allow devices to opt-in for ballooning · 238e9172

由 Alex Williamson 提交于 8月 17, 2018

If a vfio assigned device makes use of a physical IOMMU, then memory
ballooning is necessarily inhibited due to the page pinning, lack of
page level granularity at the IOMMU, and sufficient notifiers to both
remove the page on balloon inflation and add it back on deflation.
However, not all devices are backed by a physical IOMMU. In the case
of mediated devices, if a vendor driver is well synchronized with the
guest driver, such that only pages actively used by the guest driver
are pinned by the host mdev vendor driver, then there should be no
overlap between pages available for the balloon driver and pages
actively in use by the device. Under these conditions, ballooning
should be safe.

vfio-ccw devices are always mediated devices and always operate under
the constraints above. Therefore we can consider all vfio-ccw devices
as balloon compatible.

The situation is far from straightforward with vfio-pci. These
devices can be physical devices with physical IOMMU backing or
mediated devices where it is unknown whether a physical IOMMU is in
use or whether the vendor driver is well synchronized to the working
set of the guest driver. The safest approach is therefore to assume
all vfio-pci devices are incompatible with ballooning, but allow user
opt-in should they have further insight into mediated devices.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

238e9172

12 7月, 2018 1 次提交

vfio/pci: do not set the PCIDevice 'has_rom' attribute · 26c0ae56

由 Cédric Le Goater 提交于 7月 11, 2018

PCI devices needing a ROM allocate an optional MemoryRegion with
pci_add_option_rom(). pci_del_option_rom() does the cleanup when the
device is destroyed. The only action taken by this routine is to call
vmstate_unregister_ram() which clears the id string of the optional
ROM RAMBlock and now, also flags the RAMBlock as non-migratable. This
was recently added by commit b895de50 ("migration: discard
non-migratable RAMBlocks"), .

VFIO devices do their own loading of the PCI option ROM in
vfio_pci_size_rom(). The memory region is switched to an I/O region
and the PCI attribute 'has_rom' is set but the RAMBlock of the ROM
region is not allocated. When the associated PCI device is deleted,
pci_del_option_rom() calls vmstate_unregister_ram() which tries to
flag a NULL RAMBlock, leading to a SEGV.

It seems that 'has_rom' was set to have memory_region_destroy()
called, but since commit 469b046e ("memory: remove
memory_region_destroy") this is not necessary anymore as the
MemoryRegion is freed automagically.

Remove the PCIDevice 'has_rom' attribute setting in vfio.

Fixes: b895de50 ("migration: discard non-migratable RAMBlocks")
Signed-off-by: NCédric Le Goater <clg@kaod.org>
Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

26c0ae56

02 7月, 2018 1 次提交

hw/vfio: Use the IEC binary prefix definitions · e0255bb1

由 Philippe Mathieu-Daudé 提交于 6月 25, 2018

It eases code review, unit is explicit.

Patch generated using:

  $ git grep -E '(1024|2048|4096|8192|(<<|>>).?(10|20|30))' hw/ include/hw/

and modified manually.
Signed-off-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20180625124238.25339-38-f4bug@amsat.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e0255bb1

05 6月, 2018 4 次提交

vfio/pci: Default display option to "off" · 8151a9c5

由 Alex Williamson 提交于 6月 05, 2018

Commit a9994687 ("vfio/display: core & wireup") added display
support to vfio-pci with the default being "auto", which breaks
existing VMs when the vGPU requires GL support but had no previous
requirement for a GL compatible configuration.  "Off" is the safer
default as we impose no new requirements to VM configurations.

Fixes: a9994687 ("vfio/display: core & wireup")
Cc: qemu-stable@nongnu.org
Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

8151a9c5

vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly · 2b1dbd0d

由 Alex Williamson 提交于 6月 05, 2018

With vfio ioeventfd support, we can program vfio-pci to perform a
specified BAR write when an eventfd is triggered.  This allows the
KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
userspace handling for these events.  On the same micro-benchmark
where the ioeventfd got us to almost 90% of performance versus
disabling the GeForce quirks, this gets us to within 95%.
Reviewed-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

2b1dbd0d

vfio/quirks: ioeventfd quirk acceleration · c958c51d

由 Alex Williamson 提交于 6月 05, 2018

The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found
in device MMIO space. Normally PCI config space is considered a slow
path and further optimization is unnecessary, however NVIDIA uses a
register here to enable the MSI interrupt to re-trigger. Exiting to
QEMU for this MSI-ACK handling can therefore rate limit our interrupt
handling. Fortunately the MSI-ACK write is easily detected since the
quirk MemoryRegion otherwise has very few accesses, so simply looking
for consecutive writes with the same data is sufficient, in this case
10 consecutive writes with the same data and size is arbitrarily
chosen. We configure the KVM ioeventfd with data match, so there's
no risk of triggering for the wrong data or size, but we do risk that
pathological driver behavior might consume all of QEMU's file
descriptors, so we cap ourselves to 10 ioeventfds for this purpose.

In support of the above, generic ioeventfd infrastructure is added
for vfio quirks. This automatically initializes an ioeventfd list
per quirk, disables and frees ioeventfds on exit, and allows
ioeventfds marked as dynamic to be dropped on device reset. The
rationale for this latter feature is that useful ioeventfds may
depend on specific driver behavior and since we necessarily place a
cap on our use of ioeventfds, a machine reset is a reasonable point
at which to assume a new driver and re-profile.
Reviewed-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

c958c51d

vfio/quirks: Add quirk reset callback · 469d02de

由 Alex Williamson 提交于 6月 05, 2018

Quirks can be self modifying, provide a hook to allow them to cleanup
on device reset if desired.
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

469d02de

27 4月, 2018 1 次提交

ui: introduce vfio_display_reset · 8983e3e3

由 Tina Zhang 提交于 4月 27, 2018

During guest OS reboot, guest framebuffer is invalid. It will cause
bugs, if the invalid guest framebuffer is still used by host.

This patch is to introduce vfio_display_reset which is invoked
during vfio display reset. This vfio_display_reset function is used
to release the invalid display resource, disable scanout mode and
replace the invalid surface with QemuConsole's DisplaySurafce.

This patch can fix the GPU hang issue caused by gd_egl_draw during
guest OS reboot.

Changes v3->v4:
 - Move dma-buf based display check into the vfio_display_reset().
   (Gerd)

Changes v2->v3:
 - Limit vfio_display_reset to dma-buf based vfio display. (Gerd)

Changes v1->v2:
 - Use dpy_gfx_update_full() update screen after reset. (Gerd)
 - Remove dpy_gfx_switch_surface(). (Gerd)
Signed-off-by: NTina Zhang <tina.zhang@intel.com>
Message-id: 1524820266-27079-3-git-send-email-tina.zhang@intel.com
Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>

8983e3e3

14 3月, 2018 3 次提交

ppc/spapr, vfio: Turn off MSIX emulation for VFIO devices · fcad0d21

由 Alexey Kardashevskiy 提交于 3月 13, 2018

This adds a possibility for the platform to tell VFIO not to emulate MSIX
so MMIO memory regions do not get split into chunks in flatview and
the entire page can be registered as a KVM memory slot and make direct
MMIO access possible for the guest.

This enables the entire MSIX BAR mapping to the guest for the pseries
platform in order to achieve the maximum MMIO preformance for certain
devices.

Tested on:
LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

fcad0d21

vfio-pci: Allow mmap of MSIX BAR · ae0215b2

由 Alexey Kardashevskiy 提交于 3月 13, 2018

At the moment we unconditionally avoid mapping MSIX data of a BAR and
emulate MSIX table in QEMU. However it is 1) not always necessary as
a platform may provide a paravirt interface for MSIX configuration;
2) can affect the speed of MMIO access by emulating them in QEMU when
frequently accessed registers share same system page with MSIX data,
this is particularly a problem for systems with the page size bigger
than 4KB.

A new capability - VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - has been added
to the kernel [1] which tells the userspace that mapping of the MSIX data
is possible now. This makes use of it so from now on QEMU tries mapping
the entire BAR as a whole and emulate MSIX on top of that.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

ae0215b2

vfio/display: core & wireup · a9994687

由 Gerd Hoffmann 提交于 3月 13, 2018

Infrastructure for display support.  Must be enabled
using 'display' property.
Signed-off-by: NGerd Hoffmann <kraxel@redhat.com>
Reviewed By: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

a9994687

06 3月, 2018 1 次提交

use g_path_get_basename instead of basename · 3e015d81

由 Julia Suvorova 提交于 3月 01, 2018

basename(3) and dirname(3) modify their argument and may return
pointers to statically allocated memory which may be overwritten by
subsequent calls.
g_path_get_basename and g_path_get_dirname have no such issues, and
therefore more preferable.
Signed-off-by: NJulia Suvorova <jusual@mail.ru>
Message-Id: <1519888086-4207-1-git-send-email-jusual@mail.ru>
Reviewed-by: NMarc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3e015d81

09 2月, 2018 2 次提交

Move include qemu/option.h from qemu-common.h to actual users · 922a01a0

由 Markus Armbruster 提交于 2月 01, 2018

qemu-common.h includes qemu/option.h, but most places that include the
former don't actually need the latter.  Drop the include, and add it
to the places that actually need it.

While there, drop superfluous includes of both headers, and
separate #include from file comment with a blank line.

This cleanup makes the number of objects depending on qemu/option.h
drop from 4545 (out of 4743) to 284 in my "build everything" tree.
Reviewed-by: NEric Blake <eblake@redhat.com>
Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: NMarkus Armbruster <armbru@redhat.com>
Message-Id: <20180201111846.21846-20-armbru@redhat.com>
[Semantic conflict with commit bdd6a90a in block/nvme.c resolved]

922a01a0

pci: removed the is_express field since a uniform interface was inserted · d61a363d

由 Yoni Bettan 提交于 1月 16, 2018

according to Eduardo Habkost's commit fd3b02c8 all PCIEs now implement
INTERFACE_PCIE_DEVICE so we don't need is_express field anymore.

Devices that implements only INTERFACE_PCIE_DEVICE (is_express == 1)
or
devices that implements only INTERFACE_CONVENTIONAL_PCI_DEVICE (is_express == 0)
where not affected by the change.

The only devices that were affected are those that are hybrid and also
had (is_express == 1) - therefor only:
  - hw/vfio/pci.c
  - hw/usb/hcd-xhci.c
  - hw/xen/xen_pt.c

For those 3 I made sure that QEMU_PCI_CAP_EXPRESS is on in instance_init()
Reviewed-by: NMarcel Apfelbaum <marcel@redhat.com>
Reviewed-by: NEduardo Habkost <ehabkost@redhat.com>
Signed-off-by: NYoni Bettan <ybettan@redhat.com>
Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

d61a363d

07 2月, 2018 4 次提交

vfio/pci: Add option to disable GeForce quirks · db32d0f4

由 Alex Williamson 提交于 2月 06, 2018

These quirks are necessary for GeForce, but not for Quadro/GRID/Tesla
assignment.  Leaving them enabled is fully functional and provides the
most compatibility, but due to the unique NVIDIA MSI ACK behavior[1],
it also introduces latency in re-triggering the MSI interrupt.  This
overhead is typically negligible, but has been shown to adversely
affect some (very) high interrupt rate applications.  This adds the
vfio-pci device option "x-no-geforce-quirks=" which can be set to
"on" to disable this additional overhead.

A follow-on optimization for GeForce might be to make use of an
ioeventfd to allow KVM to trigger an irqfd in the kernel vfio-pci
driver, avoiding the bounce through userspace to handle this device
write.

[1] Background: the NVIDIA driver has been observed to issue a write
to the MMIO mirror of PCI config space in BAR0 in order to allow the
MSI interrupt for the device to retrigger.  Older reports indicated a
write of 0xff to the (read-only) MSI capability ID register, while
more recently a write of 0x0 is observed at config space offset 0x704,
non-architected, extended config space of the device (BAR0 offset
0x88704).  Virtualization of this range is only required for GeForce.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

db32d0f4

vfio/pci: Allow relocating MSI-X MMIO · 89d5202e

由 Alex Williamson 提交于 2月 06, 2018

Recently proposed vfio-pci kernel changes (v4.16) remove the
restriction preventing userspace from mmap'ing PCI BARs in areas
overlapping the MSI-X vector table.  This change is primarily intended
to benefit host platforms which make use of system page sizes larger
than the PCI spec recommendation for alignment of MSI-X data
structures (ie. not x86_64).  In the case of POWER systems, the SPAPR
spec requires the VM to program MSI-X using hypercalls, rendering the
MSI-X vector table unused in the VM view of the device.  However,
ARM64 platforms also support 64KB pages and rely on QEMU emulation of
MSI-X.  Regardless of the kernel driver allowing mmaps overlapping
the MSI-X vector table, emulation of the MSI-X vector table also
prevents direct mapping of device MMIO spaces overlapping this page.
Thanks to the fact that PCI devices have a standard self discovery
mechanism, we can try to resolve this by relocating the MSI-X data
structures, either by creating a new PCI BAR or extending an existing
BAR and updating the MSI-X capability for the new location.  There's
even a very slim chance that this could benefit devices which do not
adhere to the PCI spec alignment guidelines on x86_64 systems.

This new x-msix-relocation option accepts the following choices:

  off: Disable MSI-X relocation, use native device config (default)
  auto: Use a known good combination for the platform/device (none yet)
  bar0..bar5: Specify the target BAR for MSI-X data structures

If compatible, the target BAR will either be created or extended and
the new portion will be used for MSI-X emulation.

The first obvious user question with this option is how to determine
whether a given platform and device might benefit from this option.
In most cases, the answer is that it won't, especially on x86_64.
Devices often dedicate an entire BAR to MSI-X and therefore no
performance sensitive registers overlap the MSI-X area.  Take for
example:

# lspci -vvvs 0a:00.0
0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection
	...
	Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K]
	Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K]
	...
	Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
		Vector table: BAR=3 offset=00000000
		PBA: BAR=3 offset=00002000

This device uses the 16K bar3 for MSI-X with the vector table at
offset zero and the pending bits arrary at offset 8K, fully honoring
the PCI spec alignment guidance.  The data sheet specifically refers
to this as an MSI-X BAR.  This device would not see a benefit from
MSI-X relocation regardless of the platform, regardless of the page
size.

However, here's another example:

# lspci -vvvs 02:00.0
02:00.0 Serial Attached SCSI controller: xxxxxxxx
	...
	Region 0: I/O ports at c000 [size=256]
	Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K]
	Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K]
	...
	Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
		Vector table: BAR=1 offset=0000e000
		PBA: BAR=1 offset=0000f000

Here the MSI-X data structures are placed on separate 4K pages at the
end of a 64KB BAR.  If our host page size is 4K, we're likely fine,
but at 64KB page size, MSI-X emulation at that location prevents the
entire BAR from being directly mapped into the VM address space.
Overlapping performance sensitive registers then starts to be a very
likely scenario on such a platform.  At this point, the user could
enable tracing on vfio_region_read and vfio_region_write to determine
more conclusively if device accesses are being trapped through QEMU.

Upon finding a device and platform in need of MSI-X relocation, the
next problem is how to choose target PCI BAR to host the MSI-X data
structures.  A few key rules to keep in mind for this selection
include:

 * There are only 6 BAR slots, bar0..bar5
 * 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot
 * PCI BARs are always a power of 2 in size, extending == doubling
 * The maximum size of a 32-bit BAR is 2GB
 * MSI-X data structures must reside in an MMIO BAR

Using these rules, we can evaluate each BAR of the second example
device above as follows:

 bar0: I/O port BAR, incompatible with MSI-X tables
 bar1: BAR could be extended, incurring another 64KB of MMIO
 bar2: Unavailable, bar1 is 64-bit, this register is used by bar1
 bar3: BAR could be extended, incurring another 256KB of MMIO
 bar4: Unavailable, bar3 is 64bit, this register is used by bar3
 bar5: Available, empty BAR, minimum additional MMIO

A secondary optimization we might wish to make in relocating MSI-X
is to minimize the additional MMIO required for the device, therefore
we might test the available choices in order of preference as bar5,
bar1, and finally bar3.  The original proposal for this feature
included an 'auto' option which would choose bar5 in this case, but
various drivers have been found that make assumptions about the
properties of the "first" BAR or the size of BARs such that there
appears to be no foolproof automatic selection available, requiring
known good combinations to be sourced from users.  This patch is
pre-enabled for an 'auto' selection making use of a validated lookup
table, but no entries are yet identified.
Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Tested-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

89d5202e

vfio/pci: Emulate BARs · 04f336b0

由 Alex Williamson 提交于 2月 06, 2018

The kernel provides similar emulation of PCI BAR register access to
QEMU, so up until now we've used that for things like BAR sizing and
storing the BAR address.  However, if we intend to resize BARs or add
BARs that don't exist on the physical device, we need to switch to the
pure QEMU emulation of the BAR.
Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Tested-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

04f336b0

vfio/pci: Add base BAR MemoryRegion · 3a286732

由 Alex Williamson 提交于 2月 06, 2018

Add one more layer to our stack of MemoryRegions, this base region
allows us to register BARs independently of the vfio region or to
extend the size of BARs which do map to a region. This will be
useful when we want hypervisor defined BARs or sections of BARs,
for purposes such as relocating MSI-X emulation. We therefore call
msix_init() based on this new base MemoryRegion, while the quirks,
which only modify regions still operate on those sub-MemoryRegions.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

3a286732

06 12月, 2017 1 次提交

pci: Eliminate redundant PCIDevice::bus pointer · fd56e061

由 David Gibson 提交于 11月 29, 2017

The bus pointer in PCIDevice is basically redundant with QOM information.
It's always initialized to the qdev_get_parent_bus(), the only difference
is the type.

Therefore this patch eliminates the field, instead creating a pci_get_bus()
helper to do the type mangling to derive it conveniently from the QOM
Device object underneath.
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Reviewed-by: NEduardo Habkost <ehabkost@redhat.com>
Reviewed-by: NMarcel Apfelbaum <marcel@redhat.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>

fd56e061

15 10月, 2017 1 次提交

pci: Add interface names to hybrid PCI devices · a5fa336f

由 Eduardo Habkost 提交于 9月 27, 2017

The following devices support both PCI Express and Conventional
PCI, by including special code to handle the QEMU_PCI_CAP_EXPRESS
flag and/or conditional pcie_endpoint_cap_init() calls:

* vfio-pci (is_express=1, but legacy PCI handled by
  vfio_populate_device())
* vmxnet3 (is_express=0, but PCIe handled by vmxnet3_realize())
* pvscsi (is_express=0, but PCIe handled by pvscsi_realize())
* virtio-pci (is_express=0, but PCIe handled by
  virtio_pci_dc_realize(), and additional legacy PCI code at
  virtio_pci_realize())
* base-xhci (is_express=1, but pcie_endpoint_cap_init() call
  is conditional on pci_bus_is_express(dev->bus)
  * Note that xhci does not clear QEMU_PCI_CAP_EXPRESS like the
    other hybrid devices

Cc: Dmitry Fleytman <dmitry@daynix.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Reviewed-by: NMarcel Apfelbaum <marcel@redhat.com>
Reviewed-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

a5fa336f

04 10月, 2017 3 次提交

vfio/pci: Add NVIDIA GPUDirect Cliques support · dfbee78d

由 Alex Williamson 提交于 8月 29, 2017

NVIDIA has defined a specification for creating GPUDirect "cliques",
where devices with the same clique ID support direct peer-to-peer DMA.
When running on bare-metal, tools like NVIDIA's p2pBandwidthLatencyTest
(part of cuda-samples) determine which GPUs can support peer-to-peer
based on chipset and topology.  When running in a VM, these tools have
no visibility to the physical hardware support or topology.  This
option allows the user to specify hints via a vendor defined
capability.  For instance:

  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.hostdev0.x-nv-gpudirect-clique=0'/>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.hostdev1.x-nv-gpudirect-clique=1'/>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.hostdev2.x-nv-gpudirect-clique=1'/>
  </qemu:commandline>

This enables two cliques.  The first is a singleton clique with ID 0,
for the first hostdev defined in the XML (note that since cliques
define peer-to-peer sets, singleton clique offer no benefit).  The
subsequent two hostdevs are both added to clique ID 1, indicating
peer-to-peer is possible between these devices.

QEMU only provides validation that the clique ID is valid and applied
to an NVIDIA graphics device, any validation that the resulting
cliques are functional and valid is the user's responsibility.  The
NVIDIA specification allows a 4-bit clique ID, thus valid values are
0-15.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

dfbee78d

vfio/pci: Add virtual capabilities quirk infrastructure · e3f79f3b

由 Alex Williamson 提交于 8月 29, 2017

If the hypervisor needs to add purely virtual capabilties, give us a
hook through quirks to do that. Note that we determine the maximum
size for a capability based on the physical device, if we insert a
virtual capability, that can change. Therefore if maximum size is
smaller after added virt capabilities, use that.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

e3f79f3b

vfio/pci: Do not unwind on error · 5b31c822

由 Alex Williamson 提交于 8月 29, 2017

If vfio_add_std_cap() errors then going to out prepends irrelevant
errors for capabilities we haven't attempted to add as we unwind our
recursive stack. Just return error.

Fixes: 7ef165b9 ("vfio/pci: Pass an error object to vfio_add_capabilities")
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

5b31c822

27 7月, 2017 1 次提交

vfio/pci: fix use of freed memory · 96d2c2c5

由 Philippe Mathieu-Daudé 提交于 7月 26, 2017

hw/vfio/pci.c:308:29: warning: Use of memory after it is freed
        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
                            ^~~~

Reported-by: Clang Static Analyzer
Signed-off-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

96d2c2c5

11 7月, 2017 2 次提交

vfio/pci: Fixup v0 PCIe capabilities · 47985727

由 Alex Williamson 提交于 7月 10, 2017

Intel 82599 VFs report a PCIe capability version of 0, which is
invalid.  The earliest version of the PCIe spec used version 1.  This
causes Windows to fail startup on the device and it will be disabled
with error code 10.  Our choices are either to drop the PCIe cap on
such devices, which has the side effect of likely preventing the guest
from discovering any extended capabilities, or performing a fixup to
update the capability to the earliest valid version.  This implements
the latter.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

47985727

vfio: Test realized when using VFIOGroup.device_list iterator · 7da624e2

由 Alex Williamson 提交于 7月 10, 2017

VFIOGroup.device_list is effectively our reference tracking mechanism
such that we can teardown a group when all of the device references
are removed. However, we also use this list from our machine reset
handler for processing resets that affect multiple devices. Generally
device removals are fully processed (exitfn + finalize) when this
reset handler is invoked, however if the removal is triggered via
another reset handler (piix4_reset->acpi_pcihp_reset) then the device
exitfn may run, but not finalize. In this case we hit asserts when
we start trying to access PCI helpers since much of the PCI state of
the device is released. To resolve this, add a pointer to the Object
DeviceState in our common base-device and skip non-realized devices
as we iterate.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

7da624e2