提交 · 04c975121cae872acb553e1ad0a132f9b4a2640b · openeuler / Kernel

14 4月, 2022 1 次提交

KVM: x86/xen: Remove the redundantly included header file lapic.h · 04c97512

由 Like Xu 提交于 4月 06, 2022

The header lapic.h is included more than once, remove one of them.
Signed-off-by: NLike Xu <likexu@tencent.com>
Message-Id: <20220406063715.55625-2-likexu@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

04c97512

12 4月, 2022 2 次提交

KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPU · 42dcbe7d

由 Vitaly Kuznetsov 提交于 4月 07, 2022

The following WARN is triggered from kvm_vm_ioctl_set_clock():
 WARNING: CPU: 10 PID: 579353 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3161 mark_page_dirty_in_slot+0x6c/0x80 [kvm]
 ...
 CPU: 10 PID: 579353 Comm: qemu-system-x86 Tainted: G        W  O      5.16.0.stable #20
 Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
 RIP: 0010:mark_page_dirty_in_slot+0x6c/0x80 [kvm]
 ...
 Call Trace:
  <TASK>
  ? kvm_write_guest+0x114/0x120 [kvm]
  kvm_hv_invalidate_tsc_page+0x9e/0xf0 [kvm]
  kvm_arch_vm_ioctl+0xa26/0xc50 [kvm]
  ? schedule+0x4e/0xc0
  ? __cond_resched+0x1a/0x50
  ? futex_wait+0x166/0x250
  ? __send_signal+0x1f1/0x3d0
  kvm_vm_ioctl+0x747/0xda0 [kvm]
  ...

The WARN was introduced by commit 03c0304a86bc ("KVM: Warn if
mark_page_dirty() is called without an active vCPU") but the change seems
to be correct (unlike Hyper-V TSC page update mechanism). In fact, there's
no real need to actually write to guest memory to invalidate TSC page, this
can be done by the first vCPU which goes through kvm_guest_time_update().
Reported-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220407201013.963226-1-vkuznets@redhat.com>

42dcbe7d

KVM: SVM: Do not activate AVIC for SEV-enabled guest · c538dc79

由 Suravee Suthikulpanit 提交于 4月 08, 2022

Since current AVIC implementation cannot support encrypted memory,
inhibit AVIC for SEV-enabled guest.
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220408133710.54275-1-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c538dc79

09 4月, 2022 2 次提交

RISC-V: KVM: include missing hwcap.h into vcpu_fp · 4054eee9

由 Heiko Stuebner 提交于 4月 09, 2022

vcpu_fp uses the riscv_isa_extension mechanism which gets
defined in hwcap.h but doesn't include that head file.

While it seems to work in most cases, in certain conditions
this can lead to build failures like

../arch/riscv/kvm/vcpu_fp.c: In function ‘kvm_riscv_vcpu_fp_reset’:
../arch/riscv/kvm/vcpu_fp.c:22:13: error: implicit declaration of function ‘riscv_isa_extension_available’ [-Werror=implicit-function-declaration]
   22 |         if (riscv_isa_extension_available(&isa, f) ||
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../arch/riscv/kvm/vcpu_fp.c:22:49: error: ‘f’ undeclared (first use in this function)
   22 |         if (riscv_isa_extension_available(&isa, f) ||

Fix this by simply including the necessary header.

Fixes: 0a86512d ("RISC-V: KVM: Factor-out FP virtualization into separate
sources")
Signed-off-by: NHeiko Stuebner <heiko@sntech.de>
Signed-off-by: NAnup Patel <anup@brainfault.org>

4054eee9

RISC-V: KVM: Don't clear hgatp CSR in kvm_arch_vcpu_put() · 8c3ce496

由 Anup Patel 提交于 4月 09, 2022

We might have RISC-V systems (such as QEMU) where VMID is not part
of the TLB entry tag so these systems will have to flush all TLB
entries upon any change in hgatp.VMID.

Currently, we zero-out hgatp CSR in kvm_arch_vcpu_put() and we
re-program hgatp CSR in kvm_arch_vcpu_load(). For above described
systems, this will flush all TLB entries whenever VCPU exits to
user-space hence reducing performance.

This patch fixes above described performance issue by not clearing
hgatp CSR in kvm_arch_vcpu_put().

Fixes: 34bde9d8 ("RISC-V: KVM: Implement VCPU world-switch")
Cc: stable@vger.kernel.org
Signed-off-by: NAnup Patel <apatel@ventanamicro.com>
Signed-off-by: NAnup Patel <anup@brainfault.org>

8c3ce496

06 4月, 2022 6 次提交

KVM: arm64: mixed-width check should be skipped for uninitialized vCPUs · 26bf74bd

由 Reiji Watanabe 提交于 3月 28, 2022

KVM allows userspace to configure either all EL1 32bit or 64bit vCPUs
for a guest. At vCPU reset, vcpu_allowed_register_width() checks
if the vcpu's register width is consistent with all other vCPUs'.
Since the checking is done even against vCPUs that are not initialized
(KVM_ARM_VCPU_INIT has not been done) yet, the uninitialized vCPUs
are erroneously treated as 64bit vCPU, which causes the function to
incorrectly detect a mixed-width VM.

Introduce KVM_ARCH_FLAG_EL1_32BIT and KVM_ARCH_FLAG_REG_WIDTH_CONFIGURED
bits for kvm->arch.flags. A value of the EL1_32BIT bit indicates that
the guest needs to be configured with all 32bit or 64bit vCPUs, and
a value of the REG_WIDTH_CONFIGURED bit indicates if a value of the
EL1_32BIT bit is valid (already set up). Values in those bits are set at
the first KVM_ARM_VCPU_INIT for the guest based on KVM_ARM_VCPU_EL1_32BIT
configuration for the vCPU.

Check vcpu's register width against those new bits at the vcpu's
KVM_ARM_VCPU_INIT (instead of against other vCPUs' register width).

Fixes: 66e94d5c ("KVM: arm64: Prevent mixed-width VM creation")
Signed-off-by: NReiji Watanabe <reijiw@google.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220329031924.619453-2-reijiw@google.com

26bf74bd

KVM: arm64: vgic: Remove unnecessary type castings · c707663e

由 Yu Zhe 提交于 3月 29, 2022

Remove unnecessary casts.
Signed-off-by: NYu Zhe <yuzhe@nfschina.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220329102059.268983-1-yuzhe@nfschina.com

c707663e

KVM: arm64: Don't split hugepages outside of MMU write lock · f587661f

由 Oliver Upton 提交于 4月 01, 2022

It is possible to take a stage-2 permission fault on a page larger than
PAGE_SIZE. For example, when running a guest backed by 2M HugeTLB, KVM
eagerly maps at the largest possible block size. When dirty logging is
enabled on a memslot, KVM does *not* eagerly split these 2M stage-2
mappings and instead clears the write bit on the pte.

Since dirty logging is always performed at PAGE_SIZE granularity, KVM
lazily splits these 2M block mappings down to PAGE_SIZE in the stage-2
fault handler. This operation must be done under the write lock. Since
commit f783ef1c ("KVM: arm64: Add fast path to handle permission
relaxation during dirty logging"), the stage-2 fault handler
conditionally takes the read lock on permission faults with dirty
logging enabled. To that end, it is possible to split a 2M block mapping
while only holding the read lock.

The problem is demonstrated by running kvm_page_table_test with 2M
anonymous HugeTLB, which splats like so:

  WARNING: CPU: 5 PID: 15276 at arch/arm64/kvm/hyp/pgtable.c:153 stage2_map_walk_leaf+0x124/0x158

  [...]

  Call trace:
  stage2_map_walk_leaf+0x124/0x158
  stage2_map_walker+0x5c/0xf0
  __kvm_pgtable_walk+0x100/0x1d4
  __kvm_pgtable_walk+0x140/0x1d4
  __kvm_pgtable_walk+0x140/0x1d4
  kvm_pgtable_walk+0xa0/0xf8
  kvm_pgtable_stage2_map+0x15c/0x198
  user_mem_abort+0x56c/0x838
  kvm_handle_guest_abort+0x1fc/0x2a4
  handle_exit+0xa4/0x120
  kvm_arch_vcpu_ioctl_run+0x200/0x448
  kvm_vcpu_ioctl+0x588/0x664
  __arm64_sys_ioctl+0x9c/0xd4
  invoke_syscall+0x4c/0x144
  el0_svc_common+0xc4/0x190
  do_el0_svc+0x30/0x8c
  el0_svc+0x28/0xcc
  el0t_64_sync_handler+0x84/0xe4
  el0t_64_sync+0x1a4/0x1a8

Fix the issue by only acquiring the read lock if the guest faulted on a
PAGE_SIZE granule w/ dirty logging enabled. Add a WARN to catch locking
bugs in future changes.

Fixes: f783ef1c ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging")
Cc: Jing Zhang <jingzhangos@google.com>
Signed-off-by: NOliver Upton <oupton@google.com>
Reviewed-by: NReiji Watanabe <reijiw@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220401194652.950240-1-oupton@google.com

f587661f

KVM: arm64: Drop unneeded minor version check from PSCI v1.x handler · 73b725c7

由 Oliver Upton 提交于 3月 22, 2022

We already sanitize the guest's PSCI version when it is being written by
userspace, rejecting unsupported version numbers. Additionally, the
'minor' parameter to kvm_psci_1_x_call() is a constant known at compile
time for all callsites.

Though it is benign, the additional check against the
PSCI kvm_psci_1_x_call() is unnecessary and likely to be missed the next
time KVM raises its maximum PSCI version. Drop the check altogether and
rely on sanitization when the PSCI version is set by userspace.

No functional change intended.
Signed-off-by: NOliver Upton <oupton@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220322183538.2757758-4-oupton@google.com

73b725c7

KVM: arm64: Actually prevent SMC64 SYSTEM_RESET2 from AArch32 · 827c2ab3

由 Oliver Upton 提交于 3月 22, 2022

The SMCCC does not allow the SMC64 calling convention to be used from
AArch32. While KVM checks to see if the calling convention is allowed in
PSCI_1_0_FN_PSCI_FEATURES, it does not actually prevent calls to
unadvertised PSCI v1.0+ functions.

Hoist the check to see if the requested function is allowed into
kvm_psci_call(), thereby preventing SMC64 calls from AArch32 for all
PSCI versions.

Fixes: d43583b8 ("KVM: arm64: Expose PSCI SYSTEM_RESET2 call to the guest")
Acked-by: NWill Deacon <will@kernel.org>
Reviewed-by: NReiji Watanabe <reijiw@google.com>
Signed-off-by: NOliver Upton <oupton@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220322183538.2757758-3-oupton@google.com

827c2ab3

KVM: arm64: Generally disallow SMC64 for AArch32 guests · 2da0aebc

由 Oliver Upton 提交于 3月 22, 2022

The only valid calling SMC calling convention from an AArch32 state is
SMC32. Disallow any PSCI function that sets the SMC64 function ID bit
when called from AArch32 rather than comparing against known SMC64 PSCI
functions.

Note that without this change KVM advertises the SMC64 flavor of
SYSTEM_RESET2 to AArch32 guests.

Fixes: d43583b8 ("KVM: arm64: Expose PSCI SYSTEM_RESET2 call to the guest")
Acked-by: NWill Deacon <will@kernel.org>
Reviewed-by: NReiji Watanabe <reijiw@google.com>
Reviewed-by: NAndrew Jones <drjones@redhat.com>
Signed-off-by: NOliver Upton <oupton@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220322183538.2757758-2-oupton@google.com

2da0aebc

05 4月, 2022 3 次提交

KVM: x86/mmu: remove unnecessary flush_workqueue() · 3203a56a

由 Lv Ruyi 提交于 4月 01, 2022

All work currently pending will be done first by calling destroy_workqueue,
so there is unnecessary to flush it explicitly.
Reported-by: NZeal Robot <zealci@zte.com.cn>
Signed-off-by: NLv Ruyi <lv.ruyi@zte.com.cn>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220401083530.2407703-1-lv.ruyi@zte.com.cn>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3203a56a

KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded · 1d0e8480

由 Sean Christopherson 提交于 3月 31, 2022

Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
-1 is technically undefined behavior when its value is read out by
param_get_bool(), as boolean values are supposed to be '0' or '1'.

Alternatively, KVM could define a custom getter for the param, but the
auto value doesn't depend on the vendor module in any way, and printing
"auto" would be unnecessarily unfriendly to the user.

In addition to fixing the undefined behavior, resolving the auto value
also fixes the scenario where the auto value resolves to N and no vendor
module is loaded.  Previously, -1 would result in Y being printed even
though KVM would ultimately disable the mitigation.

Rename the existing MMU module init/exit helpers to clarify that they're
invoked with respect to the vendor module, and add comments to document
why KVM has two separate "module init" flows.

  =========================================================================
  UBSAN: invalid-load in kernel/params.c:320:33
  load of value 255 is not a valid value for type '_Bool'
  CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x44
   ubsan_epilogue+0x5/0x40
   __ubsan_handle_load_invalid_value.cold+0x43/0x48
   param_get_bool.cold+0xf/0x14
   param_attr_show+0x55/0x80
   module_attr_show+0x1c/0x30
   sysfs_kf_seq_show+0x93/0xc0
   seq_read_iter+0x11c/0x450
   new_sync_read+0x11b/0x1a0
   vfs_read+0xf0/0x190
   ksys_read+0x5f/0xe0
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  =========================================================================

Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Cc: stable@vger.kernel.org
Reported-by: NBruno Goncalves <bgoncalv@redhat.com>
Reported-by: NJan Stancek <jstancek@redhat.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220331221359.3912754-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1d0e8480

KVM: SEV: Add cond_resched() to loop in sev_clflush_pages() · 00c22013

由 Peter Gonda 提交于 3月 30, 2022

Add resched to avoid warning from sev_clflush_pages() with large number
of pages.
Signed-off-by: NPeter Gonda <pgonda@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

Message-Id: <20220330164306.2376085-1-pgonda@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

00c22013

02 4月, 2022 26 次提交

KVM: x86/mmu: Don't rebuild page when the page is synced and no tlb flushing is required · 8d5678a7

由 Hou Wenlong 提交于 3月 15, 2022

Before Commit c3e5e415 ("KVM: X86: Change kvm_sync_page()
to return true when remote flush is needed"), the return value
of kvm_sync_page() indicates whether the page is synced, and
kvm_mmu_get_page() would rebuild page when the sync fails.
But now, kvm_sync_page() returns false when the page is
synced and no tlb flushing is required, which leads to
rebuild page in kvm_mmu_get_page(). So return the return
value of mmu->sync_page() directly and check it in
kvm_mmu_get_page(). If the sync fails, the page will be
zapped and the invalid_list is not empty, so set flush as
true is accepted in mmu_sync_children().

Cc: stable@vger.kernel.org
Fixes: c3e5e415 ("KVM: X86: Change kvm_sync_page() to return true when remote flush is needed")
Signed-off-by: NHou Wenlong <houwenlong.hwl@antgroup.com>
Acked-by: NLai Jiangshan <jiangshanlai@gmail.com>
Message-Id: <0dabeeb789f57b0d793f85d073893063e692032d.1647336064.git.houwenlong.hwl@antgroup.com>
[mmu_sync_children should not flush if the page is zapped. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8d5678a7

KVM: x86: optimize PKU branching in kvm_load_{guest|host}_xsave_state · 945024d7

由 Jon Kohler 提交于 3月 23, 2022

kvm_load_{guest|host}_xsave_state handles xsave on vm entry and exit,
part of which is managing memory protection key state. The latest
arch.pkru is updated with a rdpkru, and if that doesn't match the base
host_pkru (which about 70% of the time), we issue a __write_pkru.

To improve performance, implement the following optimizations:
 1. Reorder if conditions prior to wrpkru in both
    kvm_load_{guest|host}_xsave_state.

    Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is
    checked first, which when instrumented in our environment appeared
    to be always true and less overall work than kvm_read_cr4_bits.

    For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead
    one position. When instrumented, I saw this be true roughly ~70% of
    the time vs the other conditions which were almost always true.
    With this change, we will avoid 3rd condition check ~30% of the time.

 2. Wrap PKU sections with CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS,
    as if the user compiles out this feature, we should not have
    these branches at all.
Signed-off-by: NJon Kohler <jon@nutanix.com>
Message-Id: <20220324004439.6709-1-jon@nutanix.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

945024d7

KVM: x86: SVM: allow AVIC to co-exist with a nested guest running · f44509f8

由 Maxim Levitsky 提交于 3月 22, 2022

Inhibit the AVIC of the vCPU that is running nested for the duration of the
nested run, so that all interrupts arriving from both its vCPU siblings
and from KVM are delivered using normal IPIs and cause that vCPU to vmexit.

Note that unlike normal AVIC inhibition, there is no need to
update the AVIC mmio memslot, because the nested guest uses its
own set of paging tables.
That also means that AVIC doesn't need to be inhibited VM wide.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322174050.241850-7-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f44509f8

KVM: x86: allow per cpu apicv inhibit reasons · d5fa597e

由 Maxim Levitsky 提交于 3月 22, 2022

Add optional callback .vcpu_get_apicv_inhibit_reasons returning
extra inhibit reasons that prevent APICv from working on this vCPU.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322174050.241850-6-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d5fa597e

KVM: x86: nSVM: implement nested vGIF · 0b349662

由 Maxim Levitsky 提交于 3月 22, 2022

In case L1 enables vGIF for L2, the L2 cannot affect L1's GIF, regardless
of STGI/CLGI intercepts, and since VM entry enables GIF, this means
that L1's GIF is always 1 while L2 is running.

Thus in this case leave L1's vGIF in vmcb01, while letting L2
control the vGIF thus implementing nested vGIF.

Also allow KVM to toggle L1's GIF during nested entry/exit
by always using vmcb01.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322174050.241850-5-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0b349662

KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE · 74fd41ed

由 Maxim Levitsky 提交于 3月 22, 2022

Expose the pause filtering and threshold in the guest CPUID
and support PAUSE filtering when possible:

- If the L0 doesn't intercept PAUSE (cpu_pm=on), then allow L1 to
  have full control over PAUSE filtering.

- if the L1 doesn't intercept PAUSE, use host values and update
  the adaptive count/threshold even when running nested.

- Otherwise always exit to L1; it is not really possible to merge
  the fields correctly.  It is expected that in this case, userspace
  will not enable this feature in the guest CPUID, to avoid having the
  guest update both fields pointlessly.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322174050.241850-4-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

74fd41ed

KVM: x86: nSVM: implement nested LBR virtualization · d20c796c

由 Maxim Levitsky 提交于 3月 22, 2022

This was tested with kvm-unit-test that was developed
for this purpose.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322174050.241850-3-mlevitsk@redhat.com>
[Copy all of DEBUGCTL except for reserved bits. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d20c796c

KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running · 1d5a1b58

由 Maxim Levitsky 提交于 3月 22, 2022

When L2 is running without LBR virtualization, we should ensure
that L1's LBR msrs continue to update as usual.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322174050.241850-2-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1d5a1b58

KVM: x86: SVM: remove vgif_enabled() · ea91559b

由 Maxim Levitsky 提交于 3月 22, 2022

KVM always uses vgif when allowed, thus there is
no need to query current vmcb for it
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322172449.235575-9-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ea91559b

kvm: x86: SVM: use vmcb* instead of svm->vmcb where it makes sense · db663af4

由 Maxim Levitsky 提交于 3月 22, 2022

This makes the code a bit shorter and cleaner.

No functional change intended.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322172449.235575-4-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

db663af4

KVM: x86: SVM: use vmcb01 in init_vmcb · 1ee73a33

由 Maxim Levitsky 提交于 3月 22, 2022

Clarify that this function is not used to initialize any part of
the vmcb02.  No functional change intended.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1ee73a33

KVM: x86: Support the vCPU preemption check with nopvspin and realtime hint · d063de55

由 Li RongQing 提交于 3月 09, 2022

If guest kernel is configured with nopvspin, or CONFIG_PARAVIRT_SPINLOCK
is disabled, or guest find its has dedicated pCPUs from realtime hint
feature, the pvspinlock will be disabled, and vCPU preemption check
is disabled too.

Hoever, KVM still can emulating HLT for vCPU for both cases.  Checking if a vCPU
is preempted or not can still boost performance in IPI-heavy scenarios such as
unixbench file copy and pipe-based context switching tests:  Here the vCPU is
running with a dedicated pCPU, so the guest kernel has nopvspin but is
emulating HLT for the vCPU:

Testcase                                  Base    with patch
System Benchmarks Index Values            INDEX     INDEX
Dhrystone 2 using register variables     3278.4    3277.7
Double-Precision Whetstone                822.8     825.8
Execl Throughput                         1296.5     941.1
File Copy 1024 bufsize 2000 maxblocks    2124.2    2142.7
File Copy 256 bufsize 500 maxblocks      1335.9    1353.6
File Copy 4096 bufsize 8000 maxblocks    4256.3    4760.3
Pipe Throughput                          1050.1    1054.0
Pipe-based Context Switching              243.3     352.0
Process Creation                          820.1     814.4
Shell Scripts (1 concurrent)             2169.0    2086.0
Shell Scripts (8 concurrent)             7710.3    7576.3
System Call Overhead                      672.4     673.9
                                      ========    =======
System Benchmarks Index Score             1467.2   1483.0

Move the setting of pv_ops.lock.vcpu_is_preempted to kvm_guest_init, so
that it does not depend on pvspinlock.
Signed-off-by: NLi RongQing <lirongqing@baidu.com>
Message-Id: <1646815610-43315-1-git-send-email-lirongqing@baidu.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d063de55

KVM: x86: Don't snapshot "max" TSC if host TSC is constant · 741e511b

由 Sean Christopherson 提交于 2月 25, 2022

Don't snapshot tsc_khz into max_tsc_khz during KVM initialization if the
host TSC is constant, in which case the actual TSC frequency will never
change and thus capturing the "max" TSC during initialization is
unnecessary, KVM can simply use tsc_khz during VM creation.

On CPUs with constant TSC, but not a hardware-specified TSC frequency,
snapshotting max_tsc_khz and using that to set a VM's default TSC
frequency can lead to KVM thinking it needs to manually scale the guest's
TSC if refining the TSC completes after KVM snapshots tsc_khz. The
actual frequency never changes, only the kernel's calculation of what
that frequency is changes. On systems without hardware TSC scaling, this
either puts KVM into "always catchup" mode (extremely inefficient), or
prevents creating VMs altogether.

Ideally, KVM would not be able to race with TSC refinement, or would have
a hook into tsc_refine_calibration_work() to get an alert when refinement
is complete. Avoiding the race altogether isn't practical as refinement
takes a relative eternity; it's deliberately put on a work queue outside
of the normal boot sequence to avoid unnecessarily delaying boot.

Adding a hook is doable, but somewhat gross due to KVM's ability to be
built as a module. And if the TSC is constant, which is likely the case
for every VMX/SVM-capable CPU produced in the last decade, the race can
be hit if and only if userspace is able to create a VM before TSC
refinement completes; refinement is slow, but not that slow.

For now, punt on a proper fix, as not taking a snapshot can help some
uses cases and not taking a snapshot is arguably correct irrespective of
the race with refinement.

[ dwmw2: Rebase on top of KVM-wide default_tsc_khz to ensure that all
vCPUs get the same frequency even if we hit the race. ]

Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Anton Romanov <romanton@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20220225145304.36166-3-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

741e511b

KVM: x86: Accept KVM_[GS]ET_TSC_KHZ as a VM ioctl. · ffbb61d0

由 David Woodhouse 提交于 2月 25, 2022

This sets the default TSC frequency for subsequently created vCPUs.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20220225145304.36166-2-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ffbb61d0

KVM: x86/i8259: Remove a dead store of irq in a conditional block · fe3787a0

由 Like Xu 提交于 3月 01, 2022

The [clang-analyzer-deadcode.DeadStores] helper reports
that the value stored to 'irq' is never read.
Signed-off-by: NLike Xu <likexu@tencent.com>
Message-Id: <20220301120217.38092-1-likexu@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fe3787a0

KVM: VMX: Prepare VMCS setting for posted interrupt enabling when APICv is available · 1421211a

由 Zeng Guang 提交于 3月 15, 2022

Currently KVM setup posted interrupt VMCS only depending on
per-vcpu APICv activation status at the vCPU creation time.
However, this status can be toggled dynamically under some
circumstance. So potentially, later posted interrupt enabling
may be problematic without VMCS readiness.

To fix this, always settle the VMCS setting for posted interrupt
as long as APICv is available and lapic locates in kernel.
Signed-off-by: NZeng Guang <guang.zeng@intel.com>
Message-Id: <20220315145836.9910-1-guang.zeng@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1421211a

KVM: x86/xen: handle PV spinlocks slowpath · 1a65105a

由 Boris Ostrovsky 提交于 3月 03, 2022

Add support for SCHEDOP_poll hypercall.

This implementation is optimized for polling for a single channel, which
is what Linux does. Polling for multiple channels is not especially
efficient (and has not been tested).

PV spinlocks slow path uses this hypercall, and explicitly crash if it's
not supported.

[ dwmw2: Rework to use kvm_vcpu_halt(), not supported for 32-bit guests ]
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-17-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1a65105a

KVM: x86/xen: Advertise and document KVM_XEN_HVM_CONFIG_EVTCHN_SEND · 661a20fa

由 David Woodhouse 提交于 3月 03, 2022

At the end of the patch series adding this batch of event channel
acceleration features, finally add the feature bit which advertises
them and document it all.

For SCHEDOP_poll we need to wake a polling vCPU when a given port
is triggered, even when it's masked — and we want to implement that
in the kernel, for efficiency. So we want the kernel to know that it
has sole ownership of event channel delivery. Thus, we allow
userspace to make the 'promise' by setting the corresponding feature
bit in its KVM_XEN_HVM_CONFIG call. As we implement SCHEDOP_poll
bypass later, we will do so only if that promise has been made by
userspace.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-16-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

661a20fa

KVM: x86/xen: Support per-vCPU event channel upcall via local APIC · fde0451b

由 David Woodhouse 提交于 3月 03, 2022

Windows uses a per-vCPU vector, and it's delivered via the local APIC
basically like an MSI (with associated EOI) unlike the traditional
guest-wide vector which is just magically asserted by Xen (and in the
KVM case by kvm_xen_has_interrupt() / kvm_cpu_get_extint()).

Now that the kernel is able to raise event channel events for itself,
being able to do so for Windows guests is also going to be useful.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-15-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fde0451b

KVM: x86/xen: Kernel acceleration for XENVER_version · 28d1629f

由 David Woodhouse 提交于 3月 03, 2022

Turns out this is a fast path for PV guests because they use it to
trigger the event channel upcall. So letting it bounce all the way up
to userspace is not great.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-14-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

28d1629f

KVM: x86/xen: handle PV timers oneshot mode · 53639526

由 Joao Martins 提交于 3月 03, 2022

If the guest has offloaded the timer virq, handle the following
hypercalls for programming the timer:

    VCPUOP_set_singleshot_timer
    VCPUOP_stop_singleshot_timer
    set_timer_op(timestamp_ns)

The event channel corresponding to the timer virq is then used to inject
events once timer deadlines are met. For now we back the PV timer with
hrtimer.

[ dwmw2: Add save/restore, 32-bit compat mode, immediate delivery,
         don't check timer in kvm_vcpu_has_event() ]
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-13-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

53639526

KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID · 942c2490

由 David Woodhouse 提交于 3月 03, 2022

In order to intercept hypercalls such as VCPUOP_set_singleshot_timer, we
need to be aware of the Xen CPU numbering.

This looks a lot like the Hyper-V handling of vpidx, for obvious reasons.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-12-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

942c2490

KVM: x86/xen: handle PV IPI vcpu yield · 0ec6c5c5

由 Joao Martins 提交于 3月 03, 2022

Cooperative Linux guests after an IPI-many may yield vcpu if
any of the IPI'd vcpus were preempted (i.e. runstate is 'runnable'.)
Support SCHEDOP_yield for handling yield.
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-11-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0ec6c5c5

KVM: x86/xen: intercept EVTCHNOP_send from guests · 2fd6df2f

由 Joao Martins 提交于 3月 03, 2022

Userspace registers a sending @port to either deliver to an @eventfd
or directly back to a local event channel port.

After binding events the guest or host may wish to bind those
events to a particular vcpu. This is usually done for unbound
and and interdomain events. Update requests are handled via the
KVM_XEN_EVTCHN_UPDATE flag.

Unregistered ports are handled by the emulator.
Co-developed-by: NAnkur Arora <ankur.a.arora@oracle.com>
Co-developed-By: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NAnkur Arora <ankur.a.arora@oracle.com>
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-10-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2fd6df2f

KVM: x86/xen: Support direct injection of event channel events · 35025735

由 David Woodhouse 提交于 3月 03, 2022

This adds a KVM_XEN_HVM_EVTCHN_SEND ioctl which allows direct injection
of events given an explicit { vcpu, port, priority } in precisely the
same form that those fields are given in the IRQ routing table.

Userspace is currently able to inject 2-level events purely by setting
the bits in the shared_info and vcpu_info, but FIFO event channels are
harder to deal with; we will need the kernel to take sole ownership of
delivery when we support those.

A patch advertising this feature with a new bit in the KVM_CAP_XEN_HVM
ioctl will be added in a subsequent patch.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-9-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

35025735

KVM: x86/xen: Make kvm_xen_set_evtchn() reusable from other places · 8733068b

由 David Woodhouse 提交于 3月 03, 2022

Clean it up to return -errno on error consistently, while still being
compatible with the return conventions for kvm_arch_set_irq_inatomic()
and the kvm_set_irq() callback.

We use -ENOTCONN to indicate when the port is masked. No existing users
care, except that it's negative.

Also allow it to optimise the vCPU lookup. Unless we abuse the lapic
map, there is no quick lookup from APIC ID to a vCPU; the logic in
kvm_get_vcpu_by_id() will just iterate over all vCPUs till it finds
the one it wants. So do that just once and stash the result in the
struct kvm_xen_evtchn for next time.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-8-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8733068b

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功