提交 aa2added 编写于 作者: E Eric Auger 提交者: Zheng Zengkai

vfio: Document nested stage control

virt inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I401IF
CVE: NA

------------------------------

The VFIO API was enhanced to support nested stage control: a bunch of
new iotcls, one DMA FAULT region and an associated specific IRQ.

Let's document the process to follow to set up nested mode.
Signed-off-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: Kunkun Jiang<jiangkunkun@huawei.com>
Reviewed-by: NKeqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
上级 b6f29e4d
...@@ -239,6 +239,83 @@ group and can access them as follows:: ...@@ -239,6 +239,83 @@ group and can access them as follows::
/* Gratuitous device reset and go... */ /* Gratuitous device reset and go... */
ioctl(device, VFIO_DEVICE_RESET); ioctl(device, VFIO_DEVICE_RESET);
IOMMU Dual Stage Control
------------------------
Some IOMMUs support 2 stages/levels of translation. "Stage" corresponds to
the ARM terminology while "level" corresponds to Intel's VTD terminology. In
the following text we use either without distinction.
This is useful when the guest is exposed with a virtual IOMMU and some
devices are assigned to the guest through VFIO. Then the guest OS can use
stage 1 (IOVA -> GPA), while the hypervisor uses stage 2 for VM isolation
(GPA -> HPA).
The guest gets ownership of the stage 1 page tables and also owns stage 1
configuration structures. The hypervisor owns the root configuration structure
(for security reason), including stage 2 configuration. This works as long
configuration structures and page table format are compatible between the
virtual IOMMU and the physical IOMMU.
Assuming the HW supports it, this nested mode is selected by choosing the
VFIO_TYPE1_NESTING_IOMMU type through:
ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU);
This forces the hypervisor to use the stage 2, leaving stage 1 available for
guest usage.
Once groups are attached to the container, the guest stage 1 translation
configuration data can be passed to VFIO by using
ioctl(container, VFIO_IOMMU_SET_PASID_TABLE, &pasid_table_info);
This allows to combine the guest stage 1 configuration structure along with
the hypervisor stage 2 configuration structure. Stage 1 configuration
structures are dependent on the IOMMU type.
As the stage 1 translation is fully delegated to the HW, translation faults
encountered during the translation process need to be propagated up to
the virtualizer and re-injected into the guest.
The userspace must be prepared to receive faults. The VFIO-PCI device
exposes one dedicated DMA FAULT region: it contains a ring buffer and
its header that allows to manage the head/tail indices. The region is
identified by the following index/subindex:
- VFIO_REGION_TYPE_NESTED/VFIO_REGION_SUBTYPE_NESTED_DMA_FAULT
The DMA FAULT region exposes a VFIO_REGION_INFO_CAP_DMA_FAULT
region capability that allows the userspace to retrieve the ABI version
of the fault records filled by the host.
On top of that region, the userspace can be notified whenever a fault
occurs at the physical level. It can use the VFIO_IRQ_TYPE_NESTED/
VFIO_IRQ_SUBTYPE_DMA_FAULT specific IRQ to attach the eventfd to be
signalled.
The ring buffer containing the fault records can be mmapped. When
the userspace consumes a fault in the queue, it should increment
the consumer index to allow new fault records to replace the used ones.
The queue size and the entry size can be retrieved in the header.
The tail index should never overshoot the producer index as in any
other circular buffer scheme. Also it must be less than the queue size
otherwise the change fails.
When the guest invalidates stage 1 related caches, invalidations must be
forwarded to the host through
ioctl(container, VFIO_IOMMU_CACHE_INVALIDATE, &inv_data);
Those invalidations can happen at various granularity levels, page, context, ...
The ARM SMMU specification introduces another challenge: MSIs are translated by
both the virtual SMMU and the physical SMMU. To build a nested mapping for the
IOVA programmed into the assigned device, the guest needs to pass its IOVA/MSI
doorbell GPA binding to the host. Then the hypervisor can build a nested stage 2
binding eventually translating into the physical MSI doorbell.
This is achieved by calling
ioctl(container, VFIO_IOMMU_SET_MSI_BINDING, &guest_binding);
VFIO User API VFIO User API
------------------------------------------------------------------------------- -------------------------------------------------------------------------------
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册