提交 · c490513c818d1ec61aff1614f5d0e38de680665f · openeuler / Kernel

24 5月, 2022 2 次提交

vfio/pci: Add driver_managed_dma to the new vfio_pci drivers · c490513c

由 Jason Gunthorpe 提交于 5月 19, 2022

When the iommu series adding driver_managed_dma was rebased it missed that
new VFIO drivers were added and did not update them too.

Without this vfio will claim the groups are not viable.

Add driver_managed_dma to mlx5 and hisi.

Fixes: 70693f47 ("vfio: Set DMA ownership for VFIO devices")
Reported-by: NYishai Hadas <yishaih@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Tested-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/0-v1-f9dfa642fab0+2b3-vfio_managed_dma_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

c490513c

vfio: Do not manipulate iommu dma_owner for fake iommu groups · a3da1ab6

由 Jason Gunthorpe 提交于 5月 19, 2022

Since asserting dma ownership now causes the group to have its DMA blocked
the iommu layer requires a working iommu. This means the dma_owner APIs
cannot be used on the fake groups that VFIO creates. Test for this and
avoid calling them.

Otherwise asserting dma ownership will fail for VFIO mdev devices as a
BLOCKING iommu_domain cannot be allocated due to the NULL iommu ops.

Fixes: 0286300e ("iommu: iommu_group_claim_dma_owner() must always assign a domain")
Reported-by: NEric Farman <farman@linux.ibm.com>
Tested-by: NEric Farman <farman@linux.ibm.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/0-v1-9cfc47edbcd4+13546-vfio_dma_owner_fix_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

a3da1ab6

19 5月, 2022 3 次提交

vfio/pci: Move the unused device into low power state with runtime PM · 7ab5e10e

由 Abhishek Sahu 提交于 5月 18, 2022

Currently, there is very limited power management support
available in the upstream vfio_pci_core based drivers. If there
are no users of the device, then the PCI device will be moved into
D3hot state by writing directly into PCI PM registers. This D3hot
state help in saving power but we can achieve zero power consumption
if we go into the D3cold state. The D3cold state cannot be possible
with native PCI PM. It requires interaction with platform firmware
which is system-specific. To go into low power states (including D3cold),
the runtime PM framework can be used which internally interacts with PCI
and platform firmware and puts the device into the lowest possible
D-States.

This patch registers vfio_pci_core based drivers with the
runtime PM framework.

1. The PCI core framework takes care of most of the runtime PM
   related things. For enabling the runtime PM, the PCI driver needs to
   decrement the usage count and needs to provide 'struct dev_pm_ops'
   at least. The runtime suspend/resume callbacks are optional and needed
   only if we need to do any extra handling. Now there are multiple
   vfio_pci_core based drivers. Instead of assigning the
   'struct dev_pm_ops' in individual parent driver, the vfio_pci_core
   itself assigns the 'struct dev_pm_ops'. There are other drivers where
   the 'struct dev_pm_ops' is being assigned inside core layer
   (For example, wlcore_probe() and some sound based driver, etc.).

2. This patch provides the stub implementation of 'struct dev_pm_ops'.
   The subsequent patch will provide the runtime suspend/resume
   callbacks. All the config state saving, and PCI power management
   related things will be done by PCI core framework itself inside its
   runtime suspend/resume callbacks (pci_pm_runtime_suspend() and
   pci_pm_runtime_resume()).

3. Inside pci_reset_bus(), all the devices in dev_set needs to be
   runtime resumed. vfio_pci_dev_set_pm_runtime_get() will take
   care of the runtime resume and its error handling.

4. Inside vfio_pci_core_disable(), the device usage count always needs
   to be decremented which was incremented in vfio_pci_core_enable().

5. Since the runtime PM framework will provide the same functionality,
   so directly writing into PCI PM config register can be replaced with
   the use of runtime PM routines. Also, the use of runtime PM can help
   us in more power saving.

   In the systems which do not support D3cold,

   With the existing implementation:

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3hot
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D0

   With runtime PM:

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3hot
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D3hot

   So, with runtime PM, the upstream bridge or root port will also go
   into lower power state which is not possible with existing
   implementation.

   In the systems which support D3cold,

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3hot
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D0

   With runtime PM:

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3cold
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D3cold

   So, with runtime PM, both the PCI device and upstream bridge will
   go into D3cold state.

6. If 'disable_idle_d3' module parameter is set, then also the runtime
   PM will be enabled, but in this case, the usage count should not be
   decremented.

7. vfio_pci_dev_set_try_reset() return value is unused now, so this
   function return type can be changed to void.

8. Use the runtime PM API's in vfio_pci_core_sriov_configure().
   The device can be in low power state either with runtime
   power management (when there is no user) or PCI_PM_CTRL register
   write by the user. In both the cases, the PF should be moved to
   D0 state. For preventing any runtime usage mismatch, pci_num_vf()
   has been called explicitly during disable.
Signed-off-by: NAbhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-5-abhsahu@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

7ab5e10e

vfio/pci: Virtualize PME related registers bits and initialize to zero · 54918c28

由 Abhishek Sahu 提交于 5月 18, 2022

If any PME event will be generated by PCI, then it will be mostly
handled in the host by the root port PME code. For example, in the case
of PCIe, the PME event will be sent to the root port and then the PME
interrupt will be generated. This will be handled in
drivers/pci/pcie/pme.c at the host side. Inside this, the
pci_check_pme_status() will be called where PME_Status and PME_En bits
will be cleared. So, the guest OS which is using vfio-pci device will
not come to know about this PME event.

To handle these PME events inside guests, we need some framework so
that if any PME events will happen, then it needs to be forwarded to
virtual machine monitor. We can virtualize PME related registers bits
and initialize these bits to zero so vfio-pci device user will assume
that it is not capable of asserting the PME# signal from any power state.
Signed-off-by: NAbhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-4-abhsahu@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

54918c28

vfio/pci: Change the PF power state to D0 before enabling VFs · f4162eb1

由 Abhishek Sahu 提交于 5月 18, 2022

According to [PCIe v5 9.6.2] for PF Device Power Management States

"The PF's power management state (D-state) has global impact on its
associated VFs. If a VF does not implement the Power Management
Capability, then it behaves as if it is in an equivalent
power state of its associated PF.

If a VF implements the Power Management Capability, the Device behavior
is undefined if the PF is placed in a lower power state than the VF.
Software should avoid this situation by placing all VFs in lower power
state before lowering their associated PF's power state."

From the vfio driver side, user can enable SR-IOV when the PF is in D3hot
state. If VF does not implement the Power Management Capability, then
the VF will be actually in D3hot state and then the VF BAR access will
fail. If VF implements the Power Management Capability, then VF will
assume that its current power state is D0 when the PF is D3hot and
in this case, the behavior is undefined.

To support PF power management, we need to create power management
dependency between PF and its VF's. The runtime power management support
may help with this where power management dependencies are supported
through device links. But till we have such support in place, we can
disallow the PF to go into low power state, if PF has VF enabled.
There can be a case, where user first enables the VF's and then
disables the VF's. If there is no user of PF, then the PF can put into
D3hot state again. But with this patch, the PF will still be in D0
state after disabling VF's since detecting this case inside
vfio_pci_core_sriov_configure() requires access to
struct vfio_device::open_count along with its locks. But the subsequent
patches related to runtime PM will handle this case since runtime PM
maintains its own usage count.

Also, vfio_pci_core_sriov_configure() can be called at any time
(with and without vfio pci device user), so the power state change
and SR-IOV enablement need to be protected with the required locks.
Signed-off-by: NAbhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-3-abhsahu@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

f4162eb1

18 5月, 2022 8 次提交

vfio/pci: Invalidate mmaps and block the access in D3hot power state · 2b2c651b

由 Abhishek Sahu 提交于 5月 18, 2022

According to [PCIe v5 5.3.1.4.1] for D3hot state

"Configuration and Message requests are the only TLPs accepted by a
Function in the D3Hot state. All other received Requests must be
handled as Unsupported Requests, and all received Completions may
optionally be handled as Unexpected Completions."

Currently, if the vfio PCI device has been put into D3hot state and if
user makes non-config related read/write request in D3hot state, these
requests will be forwarded to the host and this access may cause
issues on a few systems.

This patch leverages the memory-disable support added in commit
'abafbc55 ("vfio-pci: Invalidate mmaps and block MMIO access on
disabled memory")' to generate page fault on mmap access and
return error for the direct read/write. If the device is D3hot state,
then the error will be returned for MMIO access. The IO access generally
does not make the system unresponsive so the IO access can still happen
in D3hot state. The default value should be returned in this case
without bringing down the complete system.

Also, the power related structure fields need to be protected so
we can use the same 'memory_lock' to protect these fields also.
This protection is mainly needed when user changes the PCI
power state by writing into PCI_PM_CTRL register.
vfio_lock_and_set_power_state() wrapper function will take the
required locks and then it will invoke the vfio_pci_set_power_state().
Signed-off-by: NAbhishek Sahu <abhsahu@nvidia.com>
Link: https://lore.kernel.org/r/20220518111612.16985-2-abhsahu@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

2b2c651b

vfio: Change struct vfio_group::container_users to a non-atomic int · 3ca54708

由 Jason Gunthorpe 提交于 5月 16, 2022

Now that everything is fully locked there is no need for container_users
to remain as an atomic, change it to an unsigned int.

Use 'if (group->container)' as the test to determine if the container is
present or not instead of using container_users.
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Tested-by: NNicolin Chen <nicolinc@nvidia.com>
Tested-by: NMatthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/6-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

3ca54708

vfio: Simplify the life cycle of the group FD · b76c0eed

由 Jason Gunthorpe 提交于 5月 16, 2022

Once userspace opens a group FD it is prevented from opening another
instance of that same group FD until all the prior group FDs and users of
the container are done.

The first is done trivially by checking the group->opened during group FD
open.

However, things get a little weird if userspace creates a device FD and
then closes the group FD. The group FD still cannot be re-opened, but this
time it is because the group->container is still set and container_users
is elevated by the device FD.

Due to this mismatched lifecycle we have the
vfio_group_try_dissolve_container() which tries to auto-free a container
after the group FD is closed but the device FD remains open.

Instead have the device FD hold onto a reference to the single group
FD. This directly prevents vfio_group_fops_release() from being called
when any device FD exists and makes the lifecycle model more
understandable.

vfio_group_try_dissolve_container() is removed as the only place a
container is auto-deleted is during vfio_group_fops_release(). At this
point the container_users is either 1 or 0 since all device FDs must be
closed.

Change group->opened to group->opened_file which points to the single
struct file * that is open for the group. If the group->open_file is
NULL then group->container == NULL.

If all device FDs have closed then the group's notifier list must be
empty.
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Tested-by: NNicolin Chen <nicolinc@nvidia.com>
Tested-by: NMatthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/5-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

b76c0eed

vfio: Fully lock struct vfio_group::container · e0e29bdb

由 Jason Gunthorpe 提交于 5月 16, 2022

This is necessary to avoid various user triggerable races, for instance
racing SET_CONTAINER/UNSET_CONTAINER:

                                  ioctl(VFIO_GROUP_SET_CONTAINER)
ioctl(VFIO_GROUP_UNSET_CONTAINER)
 vfio_group_unset_container
    int users = atomic_cmpxchg(&group->container_users, 1, 0);
    // users == 1 container_users == 0
    __vfio_group_unset_container(group);
      container = group->container;
                                    vfio_group_set_container()
	                              if (!atomic_read(&group->container_users))
				        down_write(&container->group_lock);
				        group->container = container;
				        up_write(&container->group_lock);

      down_write(&container->group_lock);
      group->container = NULL;
      up_write(&container->group_lock);
      vfio_container_put(container);
      /* woops we lost/leaked the new container  */

This can then go on to NULL pointer deref since container == 0 and
container_users == 1.

Wrap all touches of container, except those on a performance path with a
known open device, with the group_rwsem.

The only user of vfio_group_add_container_user() holds the user count for
a simple operation, change it to just hold the group_lock over the
operation and delete vfio_group_add_container_user(). Containers now only
gain a user when a device FD is opened.
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Tested-by: NNicolin Chen <nicolinc@nvidia.com>
Tested-by: NMatthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/4-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

e0e29bdb

vfio: Split up vfio_group_get_device_fd() · 805bb6c1

由 Jason Gunthorpe 提交于 5月 16, 2022

The split follows the pairing with the destroy functions:

 - vfio_group_get_device_fd() destroyed by close()

 - vfio_device_open() destroyed by vfio_device_fops_release()

 - vfio_device_assign_container() destroyed by
   vfio_group_try_dissolve_container()

The next patch will put a lock around vfio_device_assign_container().
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Tested-by: NNicolin Chen <nicolinc@nvidia.com>
Tested-by: NMatthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/3-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

805bb6c1

vfio: Change struct vfio_group::opened from an atomic to bool · c6f4860e

由 Jason Gunthorpe 提交于 5月 16, 2022

This is not a performance path, just use the group_rwsem to protect the
value.
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Tested-by: NNicolin Chen <nicolinc@nvidia.com>
Tested-by: NMatthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/2-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

c6f4860e

vfio: Add missing locking for struct vfio_group::kvm · be8d3ada

由 Jason Gunthorpe 提交于 5月 16, 2022

Without locking userspace can trigger a UAF by racing
KVM_DEV_VFIO_GROUP_DEL with VFIO_GROUP_GET_DEVICE_FD:

              CPU1                               CPU2
					    ioctl(KVM_DEV_VFIO_GROUP_DEL)
 ioctl(VFIO_GROUP_GET_DEVICE_FD)
    vfio_group_get_device_fd
     open_device()
      intel_vgpu_open_device()
        vfio_register_notifier()
	 vfio_register_group_notifier()
	   blocking_notifier_call_chain(&group->notifier,
               VFIO_GROUP_NOTIFY_SET_KVM, group->kvm);

					      set_kvm()
						group->kvm = NULL
					    close()
					     kfree(kvm)

             intel_vgpu_group_notifier()
                vdev->kvm = data
    [..]
        kvm_get_kvm(vgpu->kvm);
	    // UAF!

Add a simple rwsem in the group to protect the kvm while the notifier is
using it.

Note this doesn't fix the race internal to i915 where userspace can
trigger two VFIO_GROUP_NOTIFY_SET_KVM's before we reach a consumer of
vgpu->kvm and trigger this same UAF, it just makes the notifier
self-consistent.

Fixes: ccd46dba ("vfio: support notifier chain in vfio_group")
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Tested-by: NNicolin Chen <nicolinc@nvidia.com>
Tested-by: NMatthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/1-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

be8d3ada

kvm/vfio: Fix potential deadlock problem in vfio · 6b17ca8e

由 Wan Jiabing 提交于 5月 17, 2022

Fix following coccicheck warning:
./virt/kvm/vfio.c:258:1-7: preceding lock on line 236

If kvm_vfio_file_iommu_group() failed, code would goto err_fdput with
mutex_lock acquired and then return ret. It might cause potential
deadlock. Move mutex_unlock bellow err_fdput tag to fix it.

Fixes: d55d9e7a ("kvm/vfio: Store the struct file in the kvm_vfio_group")
Signed-off-by: NWan Jiabing <wanjiabing@vivo.com>
Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20220517023441.4258-1-wanjiabing@vivo.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

6b17ca8e

17 5月, 2022 1 次提交

include/uapi/linux/vfio.h: Fix trivial typo - _IORW should be _IOWR instead · 1c05bb94

由 Thomas Huth 提交于 5月 16, 2022

There is no macro called _IORW, so use _IOWR in the comment instead.
Signed-off-by: NThomas Huth <thuth@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/r/20220516101202.88373-1-thuth@redhat.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

1c05bb94

14 5月, 2022 10 次提交

vfio/pci: Use the struct file as the handle not the vfio_group · 6a985ae8

由 Jason Gunthorpe 提交于 5月 04, 2022

VFIO PCI does a security check as part of hot reset to prove that the user
has permission to manipulate all the devices that will be impacted by the
reset.

Use a new API vfio_file_has_dev() to perform this security check against
the struct file directly and remove the vfio_group from VFIO PCI.

Since VFIO PCI was the last user of vfio_group_get_external_user() and
vfio_group_put_external_user() remove it as well.
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

6a985ae8

kvm/vfio: Remove vfio_group from kvm · 3e5449d5

由 Jason Gunthorpe 提交于 5月 04, 2022

None of the VFIO APIs take in the vfio_group anymore, so we can remove it
completely.

This has a subtle side effect on the enforced coherency tracking. The
vfio_group_get_external_user() was holding on to the container_users which
would prevent the iommu_domain and thus the enforced coherency value from
changing while the group is registered with kvm.

It changes the security proof slightly into 'user must hold a group FD
that has a device that cannot enforce DMA coherence'. As opening the group
FD, not attaching the container, is the privileged operation this doesn't
change the security properties much.

On the flip side it paves the way to changing the iommu_domain/container
attached to a group at runtime which is something that will be required to
support nested translation.
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>i
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

3e5449d5

vfio: Change vfio_group_set_kvm() to vfio_file_set_kvm() · ba70a89f

由 Jason Gunthorpe 提交于 5月 04, 2022

Just change the argument from struct vfio_group to struct file *.
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

ba70a89f

vfio: Change vfio_external_check_extension() to vfio_file_enforced_coherent() · a905ad04

由 Jason Gunthorpe 提交于 5月 04, 2022

Instead of a general extension check change the function into a limited
test if the iommu_domain has enforced coherency, which is the only thing
kvm needs to query.

Make the new op self contained by properly refcounting the container
before touching it.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

a905ad04

vfio: Remove vfio_external_group_match_file() · c38ff5b0

由 Jason Gunthorpe 提交于 5月 04, 2022

vfio_group_fops_open() ensures there is only ever one struct file open for
any struct vfio_group at any time:

	/* Do we need multiple instances of the group open?  Seems not. */
	opened = atomic_cmpxchg(&group->opened, 0, 1);
	if (opened) {
		vfio_group_put(group);
		return -EBUSY;

Therefor the struct file * can be used directly to search the list of VFIO
groups that KVM keeps instead of using the
vfio_external_group_match_file() callback to try to figure out if the
passed in FD matches the list or not.

Delete vfio_external_group_match_file().
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NYi Liu <yi.l.liu@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

c38ff5b0

vfio: Change vfio_external_user_iommu_id() to vfio_file_iommu_group() · 50d63b5b

由 Jason Gunthorpe 提交于 5月 04, 2022

The only caller wants to get a pointer to the struct iommu_group
associated with the VFIO group file. Instead of returning the group ID
then searching sysfs for that string to get the struct iommu_group just
directly return the iommu_group pointer already held by the vfio_group
struct.

It already has a safe lifetime due to the struct file kref, the vfio_group
and thus the iommu_group cannot be destroyed while the group file is open.
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NYi Liu <yi.l.liu@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

50d63b5b

kvm/vfio: Store the struct file in the kvm_vfio_group · d55d9e7a

由 Jason Gunthorpe 提交于 5月 04, 2022

Following patches will change the APIs to use the struct file as the handle
instead of the vfio_group, so hang on to a reference to it with the same
duration of as the vfio_group.
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NYi Liu <yi.l.liu@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

d55d9e7a

kvm/vfio: Move KVM_DEV_VFIO_GROUP_* ioctls into functions · 73b0565f

由 Jason Gunthorpe 提交于 5月 04, 2022

To make it easier to read and change in following patches.
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Reviewed-by: NYi Liu <yi.l.liu@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v3-f7729924a7ea+25e33-vfio_kvm_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

73b0565f

vfio: Delete container_q · dc15f82f

由 Jason Gunthorpe 提交于 4月 29, 2022

Now that the iommu core takes care of isolation there is no race between
driver attach and container unset. Once iommu_group_release_dma_owner()
returns the device can immediately be re-used.

Remove this mechanism.
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/0-v1-a1e8791d795b+6b-vfio_container_q_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

dc15f82f

A
Merge remote-tracking branch 'iommu/vfio-notifier-fix' into v5.19/vfio/next · c5e8c392
由 Alex Williamson 提交于 5月 13, 2022
```
Merge IOMMU dependencies for vfio.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
```
c5e8c392

13 5月, 2022 1 次提交

iommu: iommu_group_claim_dma_owner() must always assign a domain · 0286300e

由 Jason Gunthorpe via iommu 提交于 5月 09, 2022

Once the group enters 'owned' mode it can never be assigned back to the
default_domain or to a NULL domain. It must always be actively assigned to
a current domain. If the caller hasn't provided a domain then the core
must provide an explicit DMA blocking domain that has no DMA map.

Lazily create a group-global blocking DMA domain when
iommu_group_claim_dma_owner is first called and immediately assign the
group to it. This ensures that DMA is immediately fully isolated on all
IOMMU drivers.

If the user attaches/detaches while owned then detach will set the group
back to the blocking domain.

Slightly reorganize the call chains so that
__iommu_group_set_core_domain() is the function that removes any caller
configured domain and sets the domains back a core owned domain with an
appropriate lifetime.

__iommu_group_set_domain() is the worker function that can change the
domain assigned to a group to any target domain, including NULL.

Add comments clarifying how the NULL vs detach_dev vs default_domain works
based on Robin's remarks.

This fixes an oops with VFIO and SMMUv3 because VFIO will call
iommu_detach_group() and then immediately iommu_domain_free(), but
SMMUv3 has no way to know that the domain it is holding a pointer to
has been freed. Now the iommu_detach_group() will assign the blocking
domain and SMMUv3 will no longer hold a stale domain reference.

Fixes: 1ea2a07a ("iommu: Add DMA ownership management interfaces")
Reported-by: NQian Cai <quic_qiancai@quicinc.com>
Tested-by: NBaolu Lu <baolu.lu@linux.intel.com>
Tested-by: NNicolin Chen <nicolinc@nvidia.com>
Co-developed-by: NRobin Murphy <robin.murphy@arm.com>
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
--

Just minor polishing as discussed

v3:
 - Change names to __iommu_group_set_domain() /
   __iommu_group_set_core_domain()
 - Clarify comments
 - Call __iommu_group_set_domain() directly in
   iommu_group_release_dma_owner() since we know it is always selecting
   the default_domain
 - Remove redundant detach_dev ops check in __iommu_detach_device and
   make the added WARN_ON fail instead
 - Check for blocking_domain in __iommu_attach_group() so VFIO can
   actually attach a new group
 - Update comments and spelling
 - Fix missed change to new_domain in iommu_group_do_detach_device()

v2: https://lore.kernel.org/r/0-v2-f62259511ac0+6-iommu_dma_block_jgg@nvidia.com
v1: https://lore.kernel.org/r/0-v1-6e9d2d0a759d+11b-iommu_dma_block_jgg@nvidia.comReviewed-by: NKevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/0-v3-db7f0785022b+149-iommu_dma_block_jgg@nvidia.comSigned-off-by: NJoerg Roedel <jroedel@suse.de>

0286300e

12 5月, 2022 12 次提交

vfio/pci: Remove vfio_device_get_from_dev() · ff806cbd

由 Jason Gunthorpe 提交于 5月 11, 2022

The last user of this function is in PCI callbacks that want to convert
their struct pci_dev to a vfio_device. Instead of searching use the
vfio_device available trivially through the drvdata.

When a callback in the device_driver is called, the caller must hold the
device_lock() on dev. The purpose of the device_lock is to prevent
remove() from being called (see __device_release_driver), and allow the
driver to safely interact with its drvdata without races.

The PCI core correctly follows this and holds the device_lock() when
calling error_detected (see report_error_detected) and
sriov_configure (see sriov_numvfs_store).

Further, since the drvdata holds a positive refcount on the vfio_device
any access of the drvdata, under the device_lock(), from a driver callback
needs no further protection or refcounting.

Thus the remark in the vfio_device_get_from_dev() comment does not apply
here, VFIO PCI drivers all call vfio_unregister_group_dev() from their
remove callbacks under the device_lock() and cannot race with the
remaining callers.
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Reviewed-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v4-c841817a0349+8f-vfio_get_from_dev_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

ff806cbd

vfio/pci: Have all VFIO PCI drivers store the vfio_pci_core_device in drvdata · 91be0bd6

由 Jason Gunthorpe 提交于 5月 11, 2022

Having a consistent pointer in the drvdata will allow the next patch to
make use of the drvdata from some of the core code helpers.

Use a WARN_ON inside vfio_pci_core_register_device() to detect drivers
that miss this.
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v4-c841817a0349+8f-vfio_get_from_dev_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

91be0bd6

vfio: Remove calls to vfio_group_add_container_user() · eadd86f8

由 Jason Gunthorpe 提交于 5月 11, 2022

When the open_device() op is called the container_users is incremented and
held incremented until close_device(). Thus, so long as drivers call
functions within their open_device()/close_device() region they do not
need to worry about the container_users.

These functions can all only be called between open_device() and
close_device():

  vfio_pin_pages()
  vfio_unpin_pages()
  vfio_dma_rw()
  vfio_register_notifier()
  vfio_unregister_notifier()

Eliminate the calls to vfio_group_add_container_user() and add
vfio_assert_device_open() to detect driver mis-use. This causes the
close_device() op to check device->open_count so always leave it elevated
while calling the op.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

eadd86f8

vfio: Remove dead code · 231657b3

由 Jason Gunthorpe 提交于 5月 11, 2022

Now that callers have been updated to use the vfio_device APIs the driver
facing group interface is no longer used, delete it:

- vfio_group_get_external_user_from_dev()
- vfio_group_pin_pages()
- vfio_group_unpin_pages()
- vfio_group_iommu_domain()

--
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

231657b3

drm/i915/gvt: Change from vfio_group_(un)pin_pages to vfio_(un)pin_pages · 5eb20a78

由 Jason Gunthorpe 提交于 5月 11, 2022

Use the existing vfio_device versions of vfio_(un)pin_pages(). There is no
reason to use a group interface here, kvmgt has easy access to a
vfio_device.

Delete kvmgt_vdev::vfio_group since these calls were the last users.
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Acked-by: NZhi Wang <zhi.a.wang@intel.com>
Link: https://lore.kernel.org/r/5-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

5eb20a78

vfio/mdev: Pass in a struct vfio_device * to vfio_dma_rw() · c6250ffb

由 Jason Gunthorpe 提交于 5月 11, 2022

Every caller has a readily available vfio_device pointer, use that instead
of passing in a generic struct device. Change vfio_dma_rw() to take in the
struct vfio_device and move the container users that would have been held
by vfio_group_get_external_user_from_dev() to vfio_dma_rw() directly, like
vfio_pin/unpin_pages().
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

c6250ffb

vfio/mdev: Pass in a struct vfio_device * to vfio_pin/unpin_pages() · 8e432bb0

由 Jason Gunthorpe 提交于 5月 11, 2022

Every caller has a readily available vfio_device pointer, use that instead
of passing in a generic struct device. The struct vfio_device already
contains the group we need so this avoids complexity, extra refcountings,
and a confusing lifecycle model.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NEric Farman <farman@linux.ibm.com>
Reviewed-by: NJason J. Herne <jjherne@linux.ibm.com>
Reviewed-by: NTony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

8e432bb0

vfio/ccw: Remove mdev from struct channel_program · 0a587956

由 Jason Gunthorpe 提交于 5月 11, 2022

The next patch wants the vfio_device instead. There is no reason to store
a pointer here since we can container_of back to the vfio_device.
Reviewed-by: NEric Farman <farman@linux.ibm.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

0a587956

vfio: Make vfio_(un)register_notifier accept a vfio_device · 09ea48ef

由 Jason Gunthorpe 提交于 5月 11, 2022

All callers have a struct vfio_device trivially available, pass it in
directly and avoid calling the expensive vfio_group_get_from_dev().
Acked-by: NEric Farman <farman@linux.ibm.com>
Reviewed-by: NJason J. Herne <jjherne@linux.ibm.com>
Reviewed-by: NTony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: NKevin Tian <kevin.tian@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v4-8045e76bf00b+13d-vfio_mdev_no_group_jgg@nvidia.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

09ea48ef

vfio: Stop using iommu_present() · a77109ff

由 Robin Murphy 提交于 5月 11, 2022

IOMMU groups have been mandatory for some time now, so a device without
one is necessarily a device without any usable IOMMU, therefore the
iommu_present() check is redundant (or at best unhelpful).
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/537103bbd7246574f37f2c88704d7824a3a889f2.1649160714.git.robin.murphy@arm.comSigned-off-by: NAlex Williamson <alex.williamson@redhat.com>

a77109ff

A
Merge tag 'gvt-next-2022-04-29' into v5.19/vfio/next · 5acb6cd1
由 Alex Williamson 提交于 5月 11, 2022
```
Merge GVT-g dependencies for vfio.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
```
5acb6cd1

Merge tag 'mlx5-lm-parallel' of... · 920df8d6

由 Alex Williamson 提交于 5月 11, 2022

Merge tag 'mlx5-lm-parallel' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into v5.19/vfio/next

Improve mlx5 live migration driver

From Yishai:

This series improves mlx5 live migration driver in few aspects as of
below.

Refactor to enable running migration commands in parallel over the PF
command interface.

To achieve that we exposed from mlx5_core an API to let the VF be
notified before that the PF command interface goes down/up. (e.g. PF
reload upon health recovery).

Once having the above functionality in place mlx5 vfio doesn't need any
more to obtain the global PF lock upon using the command interface but
can rely on the above mechanism to be in sync with the PF.

This can enable parallel VFs migration over the PF command interface
from kernel driver point of view.

In addition,
Moved to use the PF async command mode for the SAVE state command.
This enables returning earlier to user space upon issuing successfully
the command and improve latency by let things run in parallel.

Alex, as this series touches mlx5_core we may need to send this in a
pull request format to VFIO to avoid conflicts before acceptance.

Link: https://lore.kernel.org/all/20220510090206.90374-1-yishaih@nvidia.comSigned-of-by: NLeon Romanovsky <leonro@nvidia.com>

920df8d6

11 5月, 2022 3 次提交

vfio/mlx5: Run the SAVE state command in an async mode · 85c205db

由 Yishai Hadas 提交于 5月 10, 2022

Use the PF asynchronous command mode for the SAVE state command.

This enables returning earlier to user space upon issuing successfully
the command and improve latency by let things run in parallel.

Link: https://lore.kernel.org/r/20220510090206.90374-5-yishaih@nvidia.comSigned-off-by: NYishai Hadas <yishaih@nvidia.com>
Reviewed-by: NAlex Williamson <alex.williamson@redhat.com>
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>

85c205db

vfio/mlx5: Refactor to enable VFs migration in parallel · 8580ad14

由 Yishai Hadas 提交于 5月 10, 2022

Refactor to enable different VFs to run their commands over the PF
command interface in parallel and to not block one each other.

This is done by not using the global PF lock that was used before but
relying on the VF attach/detach mechanism to sync.

Link: https://lore.kernel.org/r/20220510090206.90374-4-yishaih@nvidia.comSigned-off-by: NYishai Hadas <yishaih@nvidia.com>
Reviewed-by: NAlex Williamson <alex.williamson@redhat.com>
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>

8580ad14

vfio/mlx5: Manage the VF attach/detach callback from the PF · 61a2f146

由 Yishai Hadas 提交于 5月 10, 2022

Manage the VF attach/detach callback from the PF.

This lets the driver to enable parallel VFs migration as will be
introduced in the next patch.

As part of this, reorganize the VF is migratable code to be in a
separate function and rename it to be set_migratable() to match its
functionality.

Link: https://lore.kernel.org/r/20220510090206.90374-3-yishaih@nvidia.comSigned-off-by: NYishai Hadas <yishaih@nvidia.com>
Reviewed-by: NAlex Williamson <alex.williamson@redhat.com>
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>

61a2f146

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功