提交 · 34886e796c413273908b036df57cc9ae731badae · openeuler / Kernel

14 4月, 2022 3 次提交

KVM: x86: Move .pmu_ops to kvm_x86_init_ops and tag as __initdata · 34886e79

由 Like Xu 提交于 3月 29, 2022

The pmu_ops should be moved to kvm_x86_init_ops and tagged as __initdata.
That'll save those precious few bytes, and more importantly make
the original ops unreachable, i.e. make it harder to sneak in post-init
modification bugs.
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NLike Xu <likexu@tencent.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220329235054.3534728-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

34886e79

KVM: x86: Copy kvm_pmu_ops by value to eliminate layer of indirection · 8f969c0c

由 Like Xu 提交于 3月 29, 2022

Replace the kvm_pmu_ops pointer in common x86 with an instance of the
struct to save one pointer dereference when invoking functions. Copy the
struct by value to set the ops during kvm_init().
Signed-off-by: NLike Xu <likexu@tencent.com>
[sean: Move pmc_is_enabled(), make kvm_pmu_ops static]
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220329235054.3534728-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8f969c0c

KVM: x86: Move kvm_ops_static_call_update() to x86.c · fdc298da

由 Like Xu 提交于 3月 29, 2022

The kvm_ops_static_call_update() is defined in kvm_host.h. That's
completely unnecessary, it should have exactly one caller,
kvm_arch_hardware_setup().  Move the helper to x86.c and have it do the
actual memcpy() of the ops in addition to the static call updates.  This
will also allow for cleanly giving kvm_pmu_ops static_call treatment.
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NLike Xu <likexu@tencent.com>
[sean: Move memcpy() into the helper and rename accordingly]
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220329235054.3534728-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fdc298da

12 4月, 2022 1 次提交

KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPU · 42dcbe7d

由 Vitaly Kuznetsov 提交于 4月 07, 2022

The following WARN is triggered from kvm_vm_ioctl_set_clock():
 WARNING: CPU: 10 PID: 579353 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3161 mark_page_dirty_in_slot+0x6c/0x80 [kvm]
 ...
 CPU: 10 PID: 579353 Comm: qemu-system-x86 Tainted: G        W  O      5.16.0.stable #20
 Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
 RIP: 0010:mark_page_dirty_in_slot+0x6c/0x80 [kvm]
 ...
 Call Trace:
  <TASK>
  ? kvm_write_guest+0x114/0x120 [kvm]
  kvm_hv_invalidate_tsc_page+0x9e/0xf0 [kvm]
  kvm_arch_vm_ioctl+0xa26/0xc50 [kvm]
  ? schedule+0x4e/0xc0
  ? __cond_resched+0x1a/0x50
  ? futex_wait+0x166/0x250
  ? __send_signal+0x1f1/0x3d0
  kvm_vm_ioctl+0x747/0xda0 [kvm]
  ...

The WARN was introduced by commit 03c0304a86bc ("KVM: Warn if
mark_page_dirty() is called without an active vCPU") but the change seems
to be correct (unlike Hyper-V TSC page update mechanism). In fact, there's
no real need to actually write to guest memory to invalidate TSC page, this
can be done by the first vCPU which goes through kvm_guest_time_update().
Reported-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220407201013.963226-1-vkuznets@redhat.com>

42dcbe7d

05 4月, 2022 1 次提交

KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded · 1d0e8480

由 Sean Christopherson 提交于 3月 31, 2022

Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
-1 is technically undefined behavior when its value is read out by
param_get_bool(), as boolean values are supposed to be '0' or '1'.

Alternatively, KVM could define a custom getter for the param, but the
auto value doesn't depend on the vendor module in any way, and printing
"auto" would be unnecessarily unfriendly to the user.

In addition to fixing the undefined behavior, resolving the auto value
also fixes the scenario where the auto value resolves to N and no vendor
module is loaded.  Previously, -1 would result in Y being printed even
though KVM would ultimately disable the mitigation.

Rename the existing MMU module init/exit helpers to clarify that they're
invoked with respect to the vendor module, and add comments to document
why KVM has two separate "module init" flows.

  =========================================================================
  UBSAN: invalid-load in kernel/params.c:320:33
  load of value 255 is not a valid value for type '_Bool'
  CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x44
   ubsan_epilogue+0x5/0x40
   __ubsan_handle_load_invalid_value.cold+0x43/0x48
   param_get_bool.cold+0xf/0x14
   param_attr_show+0x55/0x80
   module_attr_show+0x1c/0x30
   sysfs_kf_seq_show+0x93/0xc0
   seq_read_iter+0x11c/0x450
   new_sync_read+0x11b/0x1a0
   vfs_read+0xf0/0x190
   ksys_read+0x5f/0xe0
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  =========================================================================

Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Cc: stable@vger.kernel.org
Reported-by: NBruno Goncalves <bgoncalv@redhat.com>
Reported-by: NJan Stancek <jstancek@redhat.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220331221359.3912754-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1d0e8480

02 4月, 2022 21 次提交

KVM: x86: optimize PKU branching in kvm_load_{guest|host}_xsave_state · 945024d7

由 Jon Kohler 提交于 3月 23, 2022

kvm_load_{guest|host}_xsave_state handles xsave on vm entry and exit,
part of which is managing memory protection key state. The latest
arch.pkru is updated with a rdpkru, and if that doesn't match the base
host_pkru (which about 70% of the time), we issue a __write_pkru.

To improve performance, implement the following optimizations:
 1. Reorder if conditions prior to wrpkru in both
    kvm_load_{guest|host}_xsave_state.

    Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is
    checked first, which when instrumented in our environment appeared
    to be always true and less overall work than kvm_read_cr4_bits.

    For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead
    one position. When instrumented, I saw this be true roughly ~70% of
    the time vs the other conditions which were almost always true.
    With this change, we will avoid 3rd condition check ~30% of the time.

 2. Wrap PKU sections with CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS,
    as if the user compiles out this feature, we should not have
    these branches at all.
Signed-off-by: NJon Kohler <jon@nutanix.com>
Message-Id: <20220324004439.6709-1-jon@nutanix.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

945024d7

KVM: x86: allow per cpu apicv inhibit reasons · d5fa597e

由 Maxim Levitsky 提交于 3月 22, 2022

Add optional callback .vcpu_get_apicv_inhibit_reasons returning
extra inhibit reasons that prevent APICv from working on this vCPU.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322174050.241850-6-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d5fa597e

KVM: x86: Don't snapshot "max" TSC if host TSC is constant · 741e511b

由 Sean Christopherson 提交于 2月 25, 2022

Don't snapshot tsc_khz into max_tsc_khz during KVM initialization if the
host TSC is constant, in which case the actual TSC frequency will never
change and thus capturing the "max" TSC during initialization is
unnecessary, KVM can simply use tsc_khz during VM creation.

On CPUs with constant TSC, but not a hardware-specified TSC frequency,
snapshotting max_tsc_khz and using that to set a VM's default TSC
frequency can lead to KVM thinking it needs to manually scale the guest's
TSC if refining the TSC completes after KVM snapshots tsc_khz. The
actual frequency never changes, only the kernel's calculation of what
that frequency is changes. On systems without hardware TSC scaling, this
either puts KVM into "always catchup" mode (extremely inefficient), or
prevents creating VMs altogether.

Ideally, KVM would not be able to race with TSC refinement, or would have
a hook into tsc_refine_calibration_work() to get an alert when refinement
is complete. Avoiding the race altogether isn't practical as refinement
takes a relative eternity; it's deliberately put on a work queue outside
of the normal boot sequence to avoid unnecessarily delaying boot.

Adding a hook is doable, but somewhat gross due to KVM's ability to be
built as a module. And if the TSC is constant, which is likely the case
for every VMX/SVM-capable CPU produced in the last decade, the race can
be hit if and only if userspace is able to create a VM before TSC
refinement completes; refinement is slow, but not that slow.

For now, punt on a proper fix, as not taking a snapshot can help some
uses cases and not taking a snapshot is arguably correct irrespective of
the race with refinement.

[ dwmw2: Rebase on top of KVM-wide default_tsc_khz to ensure that all
vCPUs get the same frequency even if we hit the race. ]

Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Anton Romanov <romanton@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20220225145304.36166-3-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

741e511b

KVM: x86: Accept KVM_[GS]ET_TSC_KHZ as a VM ioctl. · ffbb61d0

由 David Woodhouse 提交于 2月 25, 2022

This sets the default TSC frequency for subsequently created vCPUs.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20220225145304.36166-2-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ffbb61d0

KVM: x86/xen: Advertise and document KVM_XEN_HVM_CONFIG_EVTCHN_SEND · 661a20fa

由 David Woodhouse 提交于 3月 03, 2022

At the end of the patch series adding this batch of event channel
acceleration features, finally add the feature bit which advertises
them and document it all.

For SCHEDOP_poll we need to wake a polling vCPU when a given port
is triggered, even when it's masked — and we want to implement that
in the kernel, for efficiency. So we want the kernel to know that it
has sole ownership of event channel delivery. Thus, we allow
userspace to make the 'promise' by setting the corresponding feature
bit in its KVM_XEN_HVM_CONFIG call. As we implement SCHEDOP_poll
bypass later, we will do so only if that promise has been made by
userspace.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-16-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

661a20fa

KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID · 942c2490

由 David Woodhouse 提交于 3月 03, 2022

In order to intercept hypercalls such as VCPUOP_set_singleshot_timer, we
need to be aware of the Xen CPU numbering.

This looks a lot like the Hyper-V handling of vpidx, for obvious reasons.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-12-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

942c2490

KVM: x86/xen: Support direct injection of event channel events · 35025735

由 David Woodhouse 提交于 3月 03, 2022

This adds a KVM_XEN_HVM_EVTCHN_SEND ioctl which allows direct injection
of events given an explicit { vcpu, port, priority } in precisely the
same form that those fields are given in the IRQ routing table.

Userspace is currently able to inject 2-level events purely by setting
the bits in the shared_info and vcpu_info, but FIFO event channels are
harder to deal with; we will need the kernel to take sole ownership of
delivery when we support those.

A patch advertising this feature with a new bit in the KVM_CAP_XEN_HVM
ioctl will be added in a subsequent patch.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-9-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

35025735

KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_time_info · 69d413cf

由 David Woodhouse 提交于 3月 03, 2022

This switches the final pvclock to kvm_setup_pvclock_pfncache() and now
the old kvm_setup_pvclock_page() can be removed.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-7-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

69d413cf

KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_info · 7caf9571

由 David Woodhouse 提交于 3月 03, 2022

Currently, the fast path of kvm_xen_set_evtchn_fast() doesn't set the
index bits in the target vCPU's evtchn_pending_sel, because it only has
a userspace virtual address with which to do so. It just sets them in
the kernel, and kvm_xen_has_interrupt() then completes the delivery to
the actual vcpu_info structure when the vCPU runs.

Using a gfn_to_pfn_cache allows kvm_xen_set_evtchn_fast() to do the full
delivery in the common case.

Clean up the fallback case too, by moving the deferred delivery out into
a separate kvm_xen_inject_pending_events() function which isn't ever
called in atomic contexts as __kvm_xen_has_interrupt() is.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-6-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7caf9571

KVM: x86: Use gfn_to_pfn_cache for pv_time · 916d3608

由 David Woodhouse 提交于 3月 03, 2022

Add a new kvm_setup_guest_pvclock() which parallels the existing
kvm_setup_pvclock_page(). The latter will be removed once we convert
all users to the gfn_to_pfn_cache version.

Using the new cache, we can potentially let kvm_set_guest_paused() set
the PVCLOCK_GUEST_STOPPED bit directly rather than having to delegate
to the vCPU via KVM_REQ_CLOCK_UPDATE. But not yet.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-5-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

916d3608

KVM: x86/xen: Use gfn_to_pfn_cache for runstate area · a795cd43

由 David Woodhouse 提交于 3月 03, 2022

Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-4-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a795cd43

KVM: x86: Allow userspace to opt out of hypercall patching · f1a9761f

由 Oliver Upton 提交于 3月 16, 2022

KVM handles the VMCALL/VMMCALL instructions very strangely. Even though
both of these instructions really should #UD when executed on the wrong
vendor's hardware (i.e. VMCALL on SVM, VMMCALL on VMX), KVM replaces the
guest's instruction with the appropriate instruction for the vendor.
Nonetheless, older guest kernels without commit c1118b36 ("x86: kvm:
use alternatives for VMCALL vs. VMMCALL if kernel text is read-only")
do not patch in the appropriate instruction using alternatives, likely
motivating KVM's intervention.

Add a quirk allowing userspace to opt out of hypercall patching. If the
quirk is disabled, KVM synthesizes a #UD in the guest.
Signed-off-by: NOliver Upton <oupton@google.com>
Message-Id: <20220316005538.2282772-2-oupton@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f1a9761f

KVM: x86: SVM: fix tsc scaling when the host doesn't support it · 88099313

由 Maxim Levitsky 提交于 3月 22, 2022

It was decided that when TSC scaling is not supported,
the virtual MSR_AMD64_TSC_RATIO should still have the default '1.0'
value.

However in this case kvm_max_tsc_scaling_ratio is not set,
which breaks various assumptions.

Fix this by always calculating kvm_max_tsc_scaling_ratio regardless of
host support.  For consistency, do the same for VMX.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322172449.235575-8-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

88099313

KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr · ac8d6cad

由 Hou Wenlong 提交于 3月 07, 2022

If MSR access is rejected by MSR filtering,
kvm_set_msr()/kvm_get_msr() would return KVM_MSR_RET_FILTERED,
and the return value is only handled well for rdmsr/wrmsr.
However, some instruction emulation and state transition also
use kvm_set_msr()/kvm_get_msr() to do msr access but may trigger
some unexpected results if MSR access is rejected, E.g. RDPID
emulation would inject a #UD but RDPID wouldn't cause a exit
when RDPID is supported in hardware and ENABLE_RDTSCP is set.
And it would also cause failure when load MSR at nested entry/exit.
Since msr filtering is based on MSR bitmap, it is better to only
do MSR filtering for rdmsr/wrmsr.
Signed-off-by: NHou Wenlong <houwenlong.hwl@antgroup.com>
Message-Id: <2b2774154f7532c96a6f04d71c82a8bec7d9e80b.1646655860.git.houwenlong.hwl@antgroup.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ac8d6cad

KVM: x86/emulator: Emulate RDPID only if it is enabled in guest · a836839c

由 Hou Wenlong 提交于 3月 02, 2022

When RDTSCP is supported but RDPID is not supported in host,
RDPID emulation is available. However, __kvm_get_msr() would
only fail when RDTSCP/RDPID both are disabled in guest, so
the emulator wouldn't inject a #UD when RDPID is disabled but
RDTSCP is enabled in guest.

Fixes: fb6d4d34 ("KVM: x86: emulate RDPID")
Signed-off-by: NHou Wenlong <houwenlong.hwl@antgroup.com>
Message-Id: <1dfd46ae5b76d3ed87bde3154d51c64ea64c99c1.1646226788.git.houwenlong.hwl@antgroup.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a836839c

KVM: x86: Trace all APICv inhibit changes and capture overall status · 4f4c4a3e

由 Sean Christopherson 提交于 3月 11, 2022

Trace all APICv inhibit changes instead of just those that result in
APICv being (un)inhibited, and log the current state.  Debugging why
APICv isn't working is frustrating as it's hard to see why APICv is still
inhibited, and logging only the first inhibition means unnecessary onion
peeling.

Opportunistically drop the export of the tracepoint, it is not and should
not be used by vendor code due to the need to serialize toggling via
apicv_update_lock.

Note, using the common flow means kvm_apicv_init() switched from atomic
to non-atomic bitwise operations.  The VM is unreachable at init, so
non-atomic is perfectly ok.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220311043517.17027-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4f4c4a3e

KVM: x86: Add wrappers for setting/clearing APICv inhibits · 320af55a

由 Sean Christopherson 提交于 3月 11, 2022

Add set/clear wrappers for toggling APICv inhibits to make the call sites
more readable, and opportunistically rename the inner helpers to align
with the new wrappers and to make them more readable as well. Invert the
flag from "activate" to "set"; activate is painfully ambiguous as it's
not obvious if the inhibit is being activated, or if APICv is being
activated, in which case the inhibit is being deactivated.

For the functions that take @set, swap the order of the inhibit reason
and @set so that the call sites are visually similar to those that bounce
through the wrapper.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220311043517.17027-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

320af55a

KVM: x86: Make APICv inhibit reasons an enum and cleanup naming · 7491b7b2

由 Sean Christopherson 提交于 3月 11, 2022

Use an enum for the APICv inhibit reasons, there is no meaning behind
their values and they most definitely are not "unsigned longs".  Rename
the various params to "reason" for consistency and clarity (inhibit may
be confused as a command, i.e. inhibit APICv, instead of the reason that
is getting toggled/checked).

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220311043517.17027-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7491b7b2

KVM: X86: Handle implicit supervisor access with SMAP · 4f4aa80e

由 Lai Jiangshan 提交于 3月 11, 2022

There are two kinds of implicit supervisor access
implicit supervisor access when CPL = 3
implicit supervisor access when CPL < 3

Current permission_fault() handles only the first kind for SMAP.

But if the access is implicit when SMAP is on, data may not be read
nor write from any user-mode address regardless the current CPL.

So the second kind should be also supported.

The first kind can be detect via CPL and access mode: if it is
supervisor access and CPL = 3, it must be implicit supervisor access.

But it is not possible to detect the second kind without extra
information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
into @access. This extra information also works for the first kind, so
the logic is changed to use this information for both cases.

The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
which is in the most significant 16 bits of u64 and less likely to be
forced to change due to future hardware uses it.

This patch removes the call to ->get_cpl() for access mode is determined
by @access. Not only does it reduce a function call, but also remove
confusions when the permission is checked for nested TDP. The nested
TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
on it. The original code works just because it is always user walk for
NPT and SMAP fault is not set for EPT in update_permission_bitmask.
Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4f4aa80e

KVM: X86: Change the type of access u32 to u64 · 5b22bbe7

由 Lai Jiangshan 提交于 3月 11, 2022

Change the type of access u32 to u64 for FNAME(walk_addr) and
->gva_to_gpa().

The kinds of accesses are usually combinations of UWX, and VMX/SVM's
nested paging adds a new factor of access: is it an access for a guest
page table or for a final guest physical address.

And SMAP relies a factor for supervisor access: explicit or implicit.

So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
all these information to do the walk.

Although @access(u32) has enough bits to encode all the kinds, this
patch extends it to u64:
	o Extra bits will be in the higher 32 bits, so that we can
	  easily obtain the traditional access mode (UWX) by converting
	  it to u32.
	o Reuse the value for the access kind defined by SVM's nested
	  paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
	  @error_code in kvm_handle_page_fault().
Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5b22bbe7

KVM: MMU: propagate alloc_workqueue failure · a1a39128

由 Paolo Bonzini 提交于 3月 25, 2022

If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
kvm_arch_init_vm also has to undo all the initialization, so
group all the MMU initialization code at the beginning and
handle cleaning up of kvm_page_track_init.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a1a39128

21 3月, 2022 2 次提交

KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2 · 6d849191

由 Oliver Upton 提交于 3月 01, 2022

KVM_CAP_DISABLE_QUIRKS is irrevocably broken. The capability does not
advertise the set of quirks which may be disabled to userspace, so it is
impossible to predict the behavior of KVM. Worse yet,
KVM_CAP_DISABLE_QUIRKS will tolerate any value for cap->args[0], meaning
it fails to reject attempts to set invalid quirk bits.

The only valid workaround for the quirky quirks API is to add a new CAP.
Actually advertise the set of quirks that can be disabled to userspace
so it can predict KVM's behavior. Reject values for cap->args[0] that
contain invalid bits.

Finally, add documentation for the new capability and describe the
existing quirks.
Signed-off-by: NOliver Upton <oupton@google.com>
Message-Id: <20220301060351.442881-5-oupton@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6d849191

kvm: x86: Require const tsc for RT · 5e17b2ee

由 Thomas Gleixner 提交于 11月 06, 2011

Non constant TSC is a nightmare on bare metal already, but with
virtualization it becomes a complete disaster because the workarounds
are horrible latency wise. That's also a preliminary for running RT in
a guest on top of a RT host.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Message-Id: <Yh5eJSG19S2sjZfy@linutronix.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5e17b2ee

08 3月, 2022 1 次提交

KVM: use __vcalloc for very large allocations · 37b2a651

由 Paolo Bonzini 提交于 3月 08, 2022

Allocations whose size is related to the memslot size can be arbitrarily
large.  Do not use kvzalloc/kvcalloc, as those are limited to "not crazy"
sizes that fit in 32 bits.

Cc: stable@vger.kernel.org
Fixes: 7661809d ("mm: don't allow oversized kvmalloc() calls")
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

37b2a651

02 3月, 2022 1 次提交

KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run · 8d25b7be

由 Paolo Bonzini 提交于 2月 19, 2022

kvm_arch_vcpu_ioctl_run is already doing srcu_read_lock/unlock in two
places, namely vcpu_run and post_kvm_run_save, and a third is actually
needed around the call to vcpu->arch.complete_userspace_io to avoid
the following splat:

  WARNING: suspicious RCU usage
  arch/x86/kvm/pmu.c:190 suspicious rcu_dereference_check() usage!
  other info that might help us debug this:
  rcu_scheduler_active = 2, debug_locks = 1
  1 lock held by CPU 28/KVM/370841:
  #0: ff11004089f280b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x87/0x730 [kvm]
  Call Trace:
   <TASK>
   dump_stack_lvl+0x59/0x73
   reprogram_fixed_counter+0x15d/0x1a0 [kvm]
   kvm_pmu_trigger_event+0x1a3/0x260 [kvm]
   ? free_moved_vector+0x1b4/0x1e0
   complete_fast_pio_in+0x8a/0xd0 [kvm]

This splat is not at all unexpected, since complete_userspace_io callbacks
can execute similar code to vmexits.  For example, SVM with nrips=false
will call into the emulator from svm_skip_emulated_instruction().

While it's tempting to never acquire kvm->srcu for an uninitialized vCPU,
practically speaking there's no penalty to acquiring kvm->srcu "early"
as the KVM_MP_STATE_UNINITIALIZED path is a one-time thing per vCPU.  On
the other hand, seemingly innocuous helpers like kvm_apic_accept_events()
and sync_regs() can theoretically reach code that might access
SRCU-protected data structures, e.g. sync_regs() can trigger forced
existing of nested mode via kvm_vcpu_ioctl_x86_set_vcpu_events().
Reported-by: NLike Xu <likexu@tencent.com>
Co-developed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8d25b7be

01 3月, 2022 5 次提交

KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped · 527d5cd7

由 Sean Christopherson 提交于 2月 25, 2022

Zap only obsolete roots when responding to zapping a single root shadow
page.  Because KVM keeps root_count elevated when stuffing a previous
root into its PGD cache, shadowing a 64-bit guest means that zapping any
root causes all vCPUs to reload all roots, even if their current root is
not affected by the zap.

For many kernels, zapping a single root is a frequent operation, e.g. in
Linux it happens whenever an mm is dropped, e.g. process exits, etc...
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NBen Gardon <bgardon@google.com>
Message-Id: <20220225182248.3812651-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

527d5cd7

KVM: x86: Invoke kvm_mmu_unload() directly on CR4.PCIDE change · f6d0a252

由 Sean Christopherson 提交于 2月 25, 2022

Replace a KVM_REQ_MMU_RELOAD request with a direct kvm_mmu_unload() call
when the guest's CR4.PCIDE changes. This will allow tweaking the logic
of KVM_REQ_MMU_RELOAD to free only obsolete/invalid roots, which is the
historical intent of KVM_REQ_MMU_RELOAD. The recent PCIDE behavior is
the only user of KVM_REQ_MMU_RELOAD that doesn't mark affected roots as
obsolete, needs to unconditionally unload the entire MMU, _and_ affects
only the current vCPU.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220225182248.3812651-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f6d0a252

KVM: x86: Make kvm_lapic_set_reg() a "private" xAPIC helper · b9964ee3

由 Sean Christopherson 提交于 2月 04, 2022

Hide the lapic's "raw" write helper inside lapic.c to force non-APIC code
to go through proper helpers when modification the vAPIC state.  Keep the
read helper visible to outsiders for now, refactoring KVM to hide it too
is possible, it will just take more work to do so.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220204214205.3306634-11-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b9964ee3

KVM: x86: WARN if KVM emulates an IPI without clearing the BUSY flag · bd17f417

由 Sean Christopherson 提交于 2月 04, 2022

WARN if KVM emulates an IPI without clearing the BUSY flag, failure to do
so could hang the guest if it waits for the IPI be sent.

Opportunistically use APIC_ICR_BUSY macro instead of open coding the
magic number, and add a comment to clarify why kvm_recalculate_apic_map()
is unconditionally invoked (it's really, really confusing for IPIs due to
the existence of fast paths that don't trigger a potential recalc).
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220204214205.3306634-7-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bd17f417

KVM: x86: flush TLB separately from MMU reset · b5f61c03

由 Paolo Bonzini 提交于 2月 14, 2022

For both CR0 and CR4, disassociate the TLB flush logic from the
MMU role logic.  Instead  of relying on kvm_mmu_reset_context() being
a superset of various TLB flushes (which is not necessarily going to
be the case in the future), always call it if the role changes
but also set the various TLB flush requests according to what is
in the manual.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b5f61c03

25 2月, 2022 5 次提交

KVM: x86/mmu: do not pass vcpu to root freeing functions · 0c1c92f1

由 Paolo Bonzini 提交于 2月 21, 2022

These functions only operate on a given MMU, of which there is more
than one in a vCPU (we care about two, because the third does not have
any roots and is only used to walk guest page tables). They do need a
struct kvm in order to lock the mmu_lock, but they do not needed anything
else in the struct kvm_vcpu. So, pass the vcpu->kvm directly to them.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0c1c92f1

KVM: x86: use struct kvm_mmu_root_info for mmu->root · b9e5603c

由 Paolo Bonzini 提交于 2月 21, 2022

The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
Use the struct to have more consistency between mmu->root and
mmu->prev_roots.

The patch is entirely search and replace except for cached_root_available,
which does not need a temporary struct kvm_mmu_root_info anymore.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b9e5603c

KVM: x86: do not deliver asynchronous page faults if CR0.PG=0 · 57cb3bb0

由 Paolo Bonzini 提交于 2月 09, 2022

Enabling async page faults is nonsensical if paging is disabled, but
it is allowed because CR0.PG=0 does not clear the async page fault
MSR.  Just ignore them and only use the artificial halt state,
similar to what happens in guest mode if async #PF vmexits are disabled.

Given the increasingly complex logic, and the nicer code if the new
"if" is placed last, opportunistically change the "||" into a chain
of "if (...) return false" statements.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

57cb3bb0

KVM: x86: Reinitialize context if host userspace toggles EFER.LME · d6174299

由 Paolo Bonzini 提交于 2月 09, 2022

While the guest runs, EFER.LME cannot change unless CR0.PG is clear, and
therefore EFER.NX is the only bit that can affect the MMU role.  However,
set_efer accepts a host-initiated change to EFER.LME even with CR0.PG=1.
In that case, the MMU has to be reset.

Fixes: 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes")
Cc: stable@vger.kernel.org
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d6174299

KVM: x86: Provide per VM capability for disabling PMU virtualization · ba7bb663

由 David Dunn 提交于 2月 23, 2022

Add a new capability, KVM_CAP_PMU_CAPABILITY, that takes a bitmask of
settings/features to allow userspace to configure PMU virtualization on
a per-VM basis.  For now, support a single flag, KVM_PMU_CAP_DISABLE,
to allow disabling PMU virtualization for a VM even when KVM is configured
with enable_pmu=true a module level.

To keep KVM simple, disallow changing VM's PMU configuration after vCPUs
have been created.
Signed-off-by: NDavid Dunn <daviddunn@google.com>
Message-Id: <20220223225743.2703915-2-daviddunn@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ba7bb663

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功