提交 · 527d5cd7eece9f9f5e9c5b6692cd6814a46df6fe · openeuler / Kernel

01 3月, 2022 5 次提交

KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped · 527d5cd7

由 Sean Christopherson 提交于 2月 25, 2022

Zap only obsolete roots when responding to zapping a single root shadow
page.  Because KVM keeps root_count elevated when stuffing a previous
root into its PGD cache, shadowing a 64-bit guest means that zapping any
root causes all vCPUs to reload all roots, even if their current root is
not affected by the zap.

For many kernels, zapping a single root is a frequent operation, e.g. in
Linux it happens whenever an mm is dropped, e.g. process exits, etc...
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NBen Gardon <bgardon@google.com>
Message-Id: <20220225182248.3812651-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

527d5cd7

KVM: x86: Invoke kvm_mmu_unload() directly on CR4.PCIDE change · f6d0a252

由 Sean Christopherson 提交于 2月 25, 2022

Replace a KVM_REQ_MMU_RELOAD request with a direct kvm_mmu_unload() call
when the guest's CR4.PCIDE changes. This will allow tweaking the logic
of KVM_REQ_MMU_RELOAD to free only obsolete/invalid roots, which is the
historical intent of KVM_REQ_MMU_RELOAD. The recent PCIDE behavior is
the only user of KVM_REQ_MMU_RELOAD that doesn't mark affected roots as
obsolete, needs to unconditionally unload the entire MMU, _and_ affects
only the current vCPU.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220225182248.3812651-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f6d0a252

KVM: x86: Make kvm_lapic_set_reg() a "private" xAPIC helper · b9964ee3

由 Sean Christopherson 提交于 2月 04, 2022

Hide the lapic's "raw" write helper inside lapic.c to force non-APIC code
to go through proper helpers when modification the vAPIC state.  Keep the
read helper visible to outsiders for now, refactoring KVM to hide it too
is possible, it will just take more work to do so.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220204214205.3306634-11-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b9964ee3

KVM: x86: WARN if KVM emulates an IPI without clearing the BUSY flag · bd17f417

由 Sean Christopherson 提交于 2月 04, 2022

WARN if KVM emulates an IPI without clearing the BUSY flag, failure to do
so could hang the guest if it waits for the IPI be sent.

Opportunistically use APIC_ICR_BUSY macro instead of open coding the
magic number, and add a comment to clarify why kvm_recalculate_apic_map()
is unconditionally invoked (it's really, really confusing for IPIs due to
the existence of fast paths that don't trigger a potential recalc).
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220204214205.3306634-7-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bd17f417

KVM: x86: flush TLB separately from MMU reset · b5f61c03

由 Paolo Bonzini 提交于 2月 14, 2022

For both CR0 and CR4, disassociate the TLB flush logic from the
MMU role logic.  Instead  of relying on kvm_mmu_reset_context() being
a superset of various TLB flushes (which is not necessarily going to
be the case in the future), always call it if the role changes
but also set the various TLB flush requests according to what is
in the manual.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b5f61c03

25 2月, 2022 5 次提交

KVM: x86/mmu: do not pass vcpu to root freeing functions · 0c1c92f1

由 Paolo Bonzini 提交于 2月 21, 2022

These functions only operate on a given MMU, of which there is more
than one in a vCPU (we care about two, because the third does not have
any roots and is only used to walk guest page tables). They do need a
struct kvm in order to lock the mmu_lock, but they do not needed anything
else in the struct kvm_vcpu. So, pass the vcpu->kvm directly to them.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0c1c92f1

KVM: x86: use struct kvm_mmu_root_info for mmu->root · b9e5603c

由 Paolo Bonzini 提交于 2月 21, 2022

The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
Use the struct to have more consistency between mmu->root and
mmu->prev_roots.

The patch is entirely search and replace except for cached_root_available,
which does not need a temporary struct kvm_mmu_root_info anymore.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b9e5603c

KVM: x86: do not deliver asynchronous page faults if CR0.PG=0 · 57cb3bb0

由 Paolo Bonzini 提交于 2月 09, 2022

Enabling async page faults is nonsensical if paging is disabled, but
it is allowed because CR0.PG=0 does not clear the async page fault
MSR.  Just ignore them and only use the artificial halt state,
similar to what happens in guest mode if async #PF vmexits are disabled.

Given the increasingly complex logic, and the nicer code if the new
"if" is placed last, opportunistically change the "||" into a chain
of "if (...) return false" statements.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

57cb3bb0

KVM: x86: Reinitialize context if host userspace toggles EFER.LME · d6174299

由 Paolo Bonzini 提交于 2月 09, 2022

While the guest runs, EFER.LME cannot change unless CR0.PG is clear, and
therefore EFER.NX is the only bit that can affect the MMU role.  However,
set_efer accepts a host-initiated change to EFER.LME even with CR0.PG=1.
In that case, the MMU has to be reset.

Fixes: 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes")
Cc: stable@vger.kernel.org
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d6174299

KVM: x86: Provide per VM capability for disabling PMU virtualization · ba7bb663

由 David Dunn 提交于 2月 23, 2022

Add a new capability, KVM_CAP_PMU_CAPABILITY, that takes a bitmask of
settings/features to allow userspace to configure PMU virtualization on
a per-VM basis.  For now, support a single flag, KVM_PMU_CAP_DISABLE,
to allow disabling PMU virtualization for a VM even when KVM is configured
with enable_pmu=true a module level.

To keep KVM simple, disallow changing VM's PMU configuration after vCPUs
have been created.
Signed-off-by: NDavid Dunn <daviddunn@google.com>
Message-Id: <20220223225743.2703915-2-daviddunn@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ba7bb663

19 2月, 2022 5 次提交

KVM: x86: allow defining return-0 static calls · 5be2226f

由 Paolo Bonzini 提交于 2月 15, 2022

A few vendor callbacks are only used by VMX, but they return an integer
or bool value.  Introduce KVM_X86_OP_OPTIONAL_RET0 for them: if a func is
NULL in struct kvm_x86_ops, it will be changed to __static_call_return0
when updating static calls.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5be2226f

KVM: x86: make several APIC virtualization callbacks optional · abb6d479

由 Paolo Bonzini 提交于 2月 08, 2022

All their invocations are conditional on vcpu->arch.apicv_active,
meaning that they need not be implemented by vendor code: even
though at the moment both vendors implement APIC virtualization,
all of them can be optional.  In fact SVM does not need many of
them, and their implementation can be deleted now.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

abb6d479

KVM: x86: remove KVM_X86_OP_NULL and mark optional kvm_x86_ops · e4fc23ba

由 Paolo Bonzini 提交于 12月 09, 2021

The original use of KVM_X86_OP_NULL, which was to mark calls
that do not follow a specific naming convention, is not in use
anymore.  Instead, let's mark calls that are optional because
they are always invoked within conditionals or with static_call_cond.
Those that are _not_, i.e. those that are defined with KVM_X86_OP,
must be defined by both vendor modules or some kind of NULL pointer
dereference is bound to happen at runtime.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e4fc23ba

KVM: x86: use static_call_cond for optional callbacks · 2a890614

由 Paolo Bonzini 提交于 2月 01, 2022

SVM implements neither update_emulated_instruction nor
set_apic_access_page_addr.  Remove an "if" by calling them
with static_call_cond().
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2a890614

KVM: x86: return 1 unconditionally for availability of KVM_CAP_VAPIC · 8a289785

由 Paolo Bonzini 提交于 2月 15, 2022

The two ioctls used to implement userspace-accelerated TPR,
KVM_TPR_ACCESS_REPORTING and KVM_SET_VAPIC_ADDR, are available
even if hardware-accelerated TPR can be used.  So there is
no reason not to report KVM_CAP_VAPIC.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8a289785

12 2月, 2022 1 次提交

KVM: SVM: fix race between interrupt delivery and AVIC inhibition · 66fa226c

由 Maxim Levitsky 提交于 2月 08, 2022

If svm_deliver_avic_intr is called just after the target vcpu's AVIC got
inhibited, it might read a stale value of vcpu->arch.apicv_active
which can lead to the target vCPU not noticing the interrupt.

To fix this use load-acquire/store-release so that, if the target vCPU
is IN_GUEST_MODE, we're guaranteed to see a previous disabling of the
AVIC.  If AVIC has been disabled in the meanwhile, proceed with the
KVM_REQ_EVENT-based delivery.

Incomplete IPI vmexit has the same races as svm_deliver_avic_intr, and
in fact it can be handled in exactly the same way; the only difference
lies in who has set IRR, whether svm_deliver_interrupt or the processor.
Therefore, svm_complete_interrupt_delivery can be used to fix incomplete
IPI vmexits as well.
Co-developed-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

66fa226c

11 2月, 2022 13 次提交

KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG · cb00a70b

由 David Matlack 提交于 1月 19, 2022

When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
write-protected when dirty logging is enabled on the memslot. Instead
they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
the first time and only for the specific sub-region being cleared.

Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
write-protecting to avoid causing write-protection faults on vCPU
threads. This also allows userspace to smear the cost of huge page
splitting across multiple ioctls, rather than splitting the entire
memslot as is the case when initially-all-set is not used.
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-17-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cb00a70b

KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled · a3fe5dbd

由 David Matlack 提交于 1月 19, 2022

When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.

Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.

Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:

 1. Splitting on the vCPU thread interrupts vCPUs execution and is
    disruptive to customers whereas splitting on VM ioctl threads can
    run in parallel with vCPU execution.

 2. Splitting all huge pages at once is more efficient because it does
    not require performing VM-exit handling or walking the page table for
    every 4KiB page in the memslot, and greatly reduces the amount of
    contention on the mmu_lock.

For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.

Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.

Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a3fe5dbd

KVM: x86: Use more verbose names for mem encrypt kvm_x86_ops hooks · 03d004cd

由 Sean Christopherson 提交于 1月 28, 2022

Use slightly more verbose names for the so called "memory encrypt",
a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current
super short kvm_x86_ops names and SVM's more verbose, but non-conforming
names.  This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP()
to fill svm_x86_ops.

Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better
reflect its true nature, as it really is a full fledged ioctl() of its
own.  Ideally, the hook would be named confidential_vm_ioctl() or so, as
the ioctl() is a gateway to more than just memory encryption, and because
its underlying purpose to support Confidential VMs, which can be provided
without memory encryption, e.g. if the TCB of the guest includes the host
kernel but not host userspace, or by isolation in hardware without
encrypting memory.  But, diverging from KVM_MEMORY_ENCRYPT_OP even
further is undeseriable, and short of creating alises for all related
ioctl()s, which introduces a different flavor of divergence, KVM is stuck
with the nomenclature.

Defer renaming SVM's functions to a future commit as there are additional
changes needed to make SVM fully conforming and to match reality (looking
at you, svm_vm_copy_asid_from()).

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-20-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

03d004cd

KVM: x86: Move get_cs_db_l_bits() helper to SVM · 872e0c53

由 Sean Christopherson 提交于 1月 28, 2022

Move kvm_get_cs_db_l_bits() to SVM and rename it appropriately so that
its svm_x86_ops entry can be filled via kvm-x86-ops, and to eliminate a
superfluous export from KVM x86.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-16-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

872e0c53

KVM: x86: Use static_call() for copy/move encryption context ioctls() · 7ad02ef0

由 Sean Christopherson 提交于 1月 28, 2022

Define and use static_call()s for .vm_{copy,move}_enc_context_from(),
mostly so that the op is defined in kvm-x86-ops.h.  This will allow using
KVM_X86_OP in vendor code to wire up the implementation.  Any performance
gains eeked out by using static_call() is a happy bonus and not the
primary motiviation.

Opportunistically refactor the code to reduce indentation and keep line
lengths reasonable, and to be consistent when wrapping versus running
a bit over the 80 char soft limit.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-12-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7ad02ef0

KVM: x86: Unexport kvm_x86_ops · dfc4e6ca

由 Sean Christopherson 提交于 1月 28, 2022

Drop the export of kvm_x86_ops now it is no longer referenced by SVM or
VMX.  Disallowing access to kvm_x86_ops is very desirable as it prevents
vendor code from incorrectly modifying hooks after they have been set by
kvm_arch_hardware_setup(), and more importantly after each function's
associated static_call key has been updated.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-11-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

dfc4e6ca

KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names · e27bc044

由 Sean Christopherson 提交于 1月 28, 2022

Rename a variety of kvm_x86_op function pointers so that preferred name
for vendor implementations follows the pattern <vendor>_<function>, e.g.
rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run().  This will
allow vendor implementations to be wired up via the KVM_X86_OP macro.

In many cases, VMX and SVM "disagree" on the preferred name, though in
reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
the kvm_x86_ops name.  Justification for using the VMX nomenclature:

  - set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
    event that has already been "set" in e.g. the vIRR.  SVM's relevant
    VMCB field is even named event_inj, and KVM's stat is irq_injections.

  - prepare_guest_switch => prepare_switch_to_guest because the former is
    ambiguous, e.g. it could mean switching between multiple guests,
    switching from the guest to host, etc...

  - update_pi_irte => pi_update_irte to allow for matching match the rest
    of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().

  - start_assignment => pi_start_assignment to again follow VMX's posted
    interrupt naming scheme, and to provide context for what bit of code
    might care about an otherwise undescribed "assignment".

The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
wrong.  x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
appropriate name for the Hyper-V hooks would be flush_remote_tlbs.  Leave
that change for another time as the Hyper-V hooks always start as NULL,
i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
names requires an astounding amount of churn.

VMX and SVM function names are intentionally left as is to minimize the
diff.  Both VMX and SVM will need to rename even more functions in order
to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
inevitable.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e27bc044

KVM: x86: Drop export for .tlb_flush_current() static_call key · feee3d9d

由 Sean Christopherson 提交于 1月 28, 2022

Remove the export of kvm_x86_tlb_flush_current() as there are no longer
any users outside of common x86 code.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

feee3d9d

KVM: x86: Remove unused "flags" of kvm_pv_kick_cpu_op() · 9d68c6f6

由 Jinrong Liang 提交于 1月 25, 2022

The "unsigned long flags" parameter of  kvm_pv_kick_cpu_op() is not used,
so remove it. No functional change intended.
Signed-off-by: NJinrong Liang <cloudliang@tencent.com>
Message-Id: <20220125095909.38122-20-cloudliang@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9d68c6f6

KVM: x86: Remove unused "vcpu" of kvm_scale_tsc() · 62711e5a

由 Jinrong Liang 提交于 1月 25, 2022

The "struct kvm_vcpu *vcpu" parameter of kvm_scale_tsc() is not used,
so remove it. No functional change intended.
Signed-off-by: NJinrong Liang <cloudliang@tencent.com>
Message-Id: <20220125095909.38122-18-cloudliang@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

62711e5a

KVM: x86: Skip APICv update if APICv is disable at the module level · f1575642

由 Sean Christopherson 提交于 12月 08, 2021

Bail from the APICv update paths _before_ taking apicv_update_lock if
APICv is disabled at the module level.  kvm_request_apicv_update() in
particular is invoked from multiple paths that can be reached without
APICv being enabled, e.g. svm_enable_irq_window(), and taking the
rw_sem for write when APICv is disabled may introduce unnecessary
contention and stalls.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-25-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f1575642

KVM: x86: Drop NULL check on kvm_x86_ops.check_apicv_inhibit_reasons · 7446cfeb

由 Sean Christopherson 提交于 12月 08, 2021

Drop the useless NULL check on kvm_x86_ops.check_apicv_inhibit_reasons
when handling an APICv update, both VMX and SVM unconditionally implement
the helper and leave it non-NULL even if APICv is disabled at the module
level.  The latter is a moot point now that __kvm_request_apicv_update()
is called if and only if enable_apicv is true.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-26-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7446cfeb

KVM: x86: Unexport __kvm_request_apicv_update() · cf9e2555

由 Sean Christopherson 提交于 12月 08, 2021

Unexport __kvm_request_apicv_update(), it's not used by vendor code and
should never be used by vendor code.  The only reason it's exposed at all
is because Hyper-V's SynIC needs to track how many auto-EOIs are in use,
and it's convenient to use apicv_update_lock to guard that tracking.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-27-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cf9e2555

04 2月, 2022 1 次提交

KVM: x86: Use ERR_PTR_USR() to return -EFAULT as a __user pointer · 6e37ec88

由 Sean Christopherson 提交于 2月 02, 2022

Use ERR_PTR_USR() when returning -EFAULT from kvm_get_attr_addr(), sparse
complains about implicitly casting the kernel pointer from ERR_PTR() into
a __user pointer.

>> arch/x86/kvm/x86.c:4342:31: sparse: sparse: incorrect type in return expression
   (different address spaces) @@     expected void [noderef] __user * @@     got void * @@
   arch/x86/kvm/x86.c:4342:31: sparse:     expected void [noderef] __user *
   arch/x86/kvm/x86.c:4342:31: sparse:     got void *
>> arch/x86/kvm/x86.c:4342:31: sparse: sparse: incorrect type in return expression
   (different address spaces) @@     expected void [noderef] __user * @@     got void * @@
   arch/x86/kvm/x86.c:4342:31: sparse:     expected void [noderef] __user *
   arch/x86/kvm/x86.c:4342:31: sparse:     got void *

No functional change intended.

Fixes: 56f289a8 ("KVM: x86: Add a helper to retrieve userspace address from kvm_device_attr")
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220202005157.2545816-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6e37ec88

01 2月, 2022 1 次提交

kvm/x86: rework guest entry logic · b2d2af7e

由 Mark Rutland 提交于 2月 01, 2022

For consistency and clarity, migrate x86 over to the generic helpers for
guest timing and lockdep/RCU/tracing management, and remove the
x86-specific helpers.

Prior to this patch, the guest timing was entered in
kvm_guest_enter_irqoff() (called by svm_vcpu_enter_exit() and
svm_vcpu_enter_exit()), and was exited by the call to
vtime_account_guest_exit() within vcpu_enter_guest().

To minimize duplication and to more clearly balance entry and exit, both
entry and exit of guest timing are placed in vcpu_enter_guest(), using
the new guest_timing_{enter,exit}_irqoff() helpers. When context
tracking is used a small amount of additional time will be accounted
towards guests; tick-based accounting is unnaffected as IRQs are
disabled at this point and not enabled until after the return from the
guest.

This also corrects (benign) mis-balanced context tracking accounting
introduced in commits:

  ae95f566 ("KVM: X86: TSCDEADLINE MSR emulation fastpath")
  26efe2fd ("KVM: VMX: Handle preemption timer fastpath")

Where KVM can enter a guest multiple times, calling vtime_guest_enter()
without a corresponding call to vtime_account_guest_exit(), and with
vtime_account_system() called when vtime_account_guest() should be used.
As account_system_time() checks PF_VCPU and calls account_guest_time(),
this doesn't result in any functional problem, but is unnecessarily
confusing.
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NNicolas Saenz Julienne <nsaenzju@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <20220201132926.3301912-4-mark.rutland@arm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b2d2af7e

28 1月, 2022 2 次提交

KVM: x86: add system attribute to retrieve full set of supported xsave states · dd6e6312

由 Paolo Bonzini 提交于 1月 26, 2022

Because KVM_GET_SUPPORTED_CPUID is meant to be passed (by simple-minded
VMMs) to KVM_SET_CPUID2, it cannot include any dynamic xsave states that
have not been enabled. Probing those, for example so that they can be
passed to ARCH_REQ_XCOMP_GUEST_PERM, requires a new ioctl or arch_prctl.
The latter is in fact worse, even though that is what the rest of the
API uses, because it would require supported_xcr0 to be moved from the
KVM module to the kernel just for this use. In addition, the value
would be nonsensical (or an error would have to be returned) until
the KVM module is loaded in.

Therefore, to limit the growth of system ioctls, add a /dev/kvm
variant of KVM_{GET,HAS}_DEVICE_ATTR, and implement it in x86
with just one group (0) and attribute (KVM_X86_XCOMP_GUEST_SUPP).
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

dd6e6312

KVM: x86: Add a helper to retrieve userspace address from kvm_device_attr · 56f289a8

由 Sean Christopherson 提交于 1月 27, 2022

Add a helper to handle converting the u64 userspace address embedded in
struct kvm_device_attr into a userspace pointer, it's all too easy to
forget the intermediate "unsigned long" cast as well as the truncation
check.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

56f289a8

27 1月, 2022 5 次提交

KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any time · 05a9e065

由 Like Xu 提交于 1月 26, 2022

XCR0 is reset to 1 by RESET but not INIT and IA32_XSS is zeroed by
both RESET and INIT. The kvm_set_msr_common()'s handling of MSR_IA32_XSS
also needs to update kvm_update_cpuid_runtime(). In the above cases, the
size in bytes of the XSAVE area containing all states enabled by XCR0 or
(XCRO | IA32_XSS) needs to be updated.

For simplicity and consistency, existing helpers are used to write values
and call kvm_update_cpuid_runtime(), and it's not exactly a fast path.

Fixes: a554d207 ("KVM: X86: Processor States following Reset or INIT")
Cc: stable@vger.kernel.org
Signed-off-by: NLike Xu <likexu@tencent.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220126172226.2298529-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

05a9e065

KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSS · 4c282e51

由 Like Xu 提交于 1月 26, 2022

Do a runtime CPUID update for a vCPU if MSR_IA32_XSS is written, as the
size in bytes of the XSAVE area is affected by the states enabled in XSS.

Fixes: 20300099 ("kvm: vmx: add MSR logic for XSAVES")
Cc: stable@vger.kernel.org
Signed-off-by: NLike Xu <likexu@tencent.com>
[sean: split out as a separate patch, adjust Fixes tag]
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220126172226.2298529-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4c282e51

KVM: x86: Keep MSR_IA32_XSS unchanged for INIT · be4f3b3f

由 Xiaoyao Li 提交于 1月 26, 2022

It has been corrected from SDM version 075 that MSR_IA32_XSS is reset to
zero on Power up and Reset but keeps unchanged on INIT.

Fixes: a554d207 ("KVM: X86: Processor States following Reset or INIT")
Cc: stable@vger.kernel.org
Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220126172226.2298529-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

be4f3b3f

KVM: x86: Forcibly leave nested virt when SMM state is toggled · f7e57078

由 Sean Christopherson 提交于 1月 25, 2022

Forcibly leave nested virtualization operation if userspace toggles SMM
state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS.  If userspace
forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
vmxon=false and smm.vmxon=false, but all other nVMX state allocated.

Don't attempt to gracefully handle the transition as (a) most transitions
are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
sufficient information to handle all transitions, e.g. SVM wants access
to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
KVM_SET_NESTED_STATE during state restore as the latter disallows putting
the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
being post-VMXON in SMM if SMM is not active.

Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
in an architecturally impossible state.

  WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
  WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
  Modules linked in:
  CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
  RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
  Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
  Call Trace:
   <TASK>
   kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
   kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
   kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
   kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
   kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
   kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
   kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
   kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
   __fput+0x286/0x9f0 fs/file_table.c:311
   task_work_run+0xdd/0x1a0 kernel/task_work.c:164
   exit_task_work include/linux/task_work.h:32 [inline]
   do_exit+0xb29/0x2a30 kernel/exit.c:806
   do_group_exit+0xd2/0x2f0 kernel/exit.c:935
   get_signal+0x4b0/0x28c0 kernel/signal.c:2862
   arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
   handle_signal_work kernel/entry/common.c:148 [inline]
   exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
   exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
   __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
   syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
   do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>

Cc: stable@vger.kernel.org
Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220125220358.2091737-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f7e57078

KVM: x86: Pass emulation type to can_emulate_instruction() · 4d31d9ef

由 Sean Christopherson 提交于 1月 20, 2022

Pass the emulation type to kvm_x86_ops.can_emulate_insutrction() so that
a future commit can harden KVM's SEV support to WARN on emulation
scenarios that should never happen.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
Message-Id: <20220120010719.711476-6-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4d31d9ef

25 1月, 2022 1 次提交

KVM/X86: Make kvm_vcpu_reload_apic_access_page() static · d081a343

由 Quanfa Fu 提交于 12月 19, 2021

Make kvm_vcpu_reload_apic_access_page() static
as it is no longer invoked directly by vmx
and it is also no longer exported.

No functional change intended.
Signed-off-by: NQuanfa Fu <quanfafu@gmail.com>
Message-Id: <20211219091446.174584-1-quanfafu@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d081a343

20 1月, 2022 1 次提交

KVM: VMX: Don't do full kick when triggering posted interrupt "fails" · 0f65a9d3

由 Sean Christopherson 提交于 12月 08, 2021

Replace the full "kick" with just the "wake" in the fallback path when
triggering a virtual interrupt via a posted interrupt fails because the
guest is not IN_GUEST_MODE.  If the guest transitions into guest mode
between the check and the kick, then it's guaranteed to see the pending
interrupt as KVM syncs the PIR to IRR (and onto GUEST_RVI) after setting
IN_GUEST_MODE.  Kicking the guest in this case is nothing more than an
unnecessary VM-Exit (and host IRQ).

Opportunistically update comments to explain the various ordering rules
and barriers at play.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211208015236.1616697-17-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0f65a9d3

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功