提交 · 73e69a86347afe8156aa50c436fc192b280b0cd7 · openeuler / Kernel

04 2月, 2021 11 次提交

KVM: x86/xen: register vcpu info · 73e69a86

由 Joao Martins 提交于 6月 29, 2018

The vcpu info supersedes the per vcpu area of the shared info page and
the guest vcpus will use this instead.
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NAnkur Arora <ankur.a.arora@oracle.com>
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>

73e69a86

KVM: x86/xen: register shared_info page · 13ffb97a

由 Joao Martins 提交于 6月 15, 2018

Add KVM_XEN_ATTR_TYPE_SHARED_INFO to allow hypervisor to know where the
guest's shared info page is.
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>

13ffb97a

D
KVM: x86/xen: latch long_mode when hypercall page is set up · a3833b81
由 David Woodhouse 提交于 12月 03, 2020
```
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
```
a3833b81

KVM: x86/xen: intercept xen hypercalls if enabled · 23200b7a

由 Joao Martins 提交于 6月 13, 2018

Add a new exit reason for emulator to handle Xen hypercalls.

Since this means KVM owns the ABI, dispense with the facility for the
VMM to provide its own copy of the hypercall pages; just fill them in
directly using VMCALL/VMMCALL as we do for the Hyper-V hypercall page.

This behaviour is enabled by a new INTERCEPT_HCALL flag in the
KVM_XEN_HVM_CONFIG ioctl structure, and advertised by the same flag
being returned from the KVM_CAP_XEN_HVM check.

Rename xen_hvm_config() to kvm_xen_write_hypercall_page() and move it
to the nascent xen.c while we're at it, and add a test case.
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>

23200b7a

KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map · 9a77daac

由 Ben Gardon 提交于 2月 02, 2021

To prepare for handling page faults in parallel, change the TDP MMU
page fault handler to use atomic operations to set SPTEs so that changes
are not lost if multiple threads attempt to modify the same SPTE.
Reviewed-by: NPeter Feiner <pfeiner@google.com>
Signed-off-by: NBen Gardon <bgardon@google.com>

Message-Id: <20210202185734.1680553-21-bgardon@google.com>
[Document new locking rules. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9a77daac

KVM: x86/mmu: Use an rwlock for the x86 MMU · 531810ca

由 Ben Gardon 提交于 2月 02, 2021

Add a read / write lock to be used in place of the MMU spinlock on x86.
The rwlock will enable the TDP MMU to handle page faults, and other
operations in parallel in future commits.
Reviewed-by: NPeter Feiner <pfeiner@google.com>
Signed-off-by: NBen Gardon <bgardon@google.com>

Message-Id: <20210202185734.1680553-19-bgardon@google.com>
[Introduce virt/kvm/mmu_lock.h - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

531810ca

KVM: x86: use static calls to reduce kvm_x86_ops overhead · b3646477

由 Jason Baron 提交于 1月 14, 2021

Convert kvm_x86_ops to use static calls. Note that all kvm_x86_ops are
covered here except for 'pmu_ops and 'nested ops'.

Here are some numbers running cpuid in a loop of 1 million calls averaged
over 5 runs, measured in the vm (lower is better).

Intel Xeon 3000MHz:

           |default    |mitigations=off
-------------------------------------
vanilla    |.671s      |.486s
static call|.573s(-15%)|.458s(-6%)

AMD EPYC 2500MHz:

           |default    |mitigations=off
-------------------------------------
vanilla    |.710s      |.609s
static call|.664s(-6%) |.609s(0%)

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: NJason Baron <jbaron@akamai.com>
Message-Id: <e057bf1b8a7ad15652df6eeba3f907ae758d3399.1610680941.git.jbaron@akamai.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b3646477

KVM: x86: introduce definitions to support static calls for kvm_x86_ops · 9af5471b

由 Jason Baron 提交于 1月 14, 2021

Use static calls to improve kvm_x86_ops performance. Introduce the
definitions that will be used by a subsequent patch to actualize the
savings. Add a new kvm-x86-ops.h header that can be used for the
definition of static calls. This header is also intended to be
used to simplify the defition of svm_kvm_ops and vmx_x86_ops.

Note that all functions in kvm_x86_ops are covered here except for
'pmu_ops' and 'nested ops'. I think they can be covered by static
calls in a simlilar manner, but were omitted from this series to
reduce scope and because I don't think they have as large of a
performance impact.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: NJason Baron <jbaron@akamai.com>
Message-Id: <e5cc82ead7ab37b2dceb0837a514f3f8bea4f8d1.1610680941.git.jbaron@akamai.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9af5471b

KVM: X86: Rename DR6_INIT to DR6_ACTIVE_LOW · 9a3ecd5e

由 Chenyi Qiang 提交于 2月 02, 2021

DR6_INIT contains the 1-reserved bits as well as the bit that is cleared
to 0 when the condition (e.g. RTM) happens. The value can be used to
initialize dr6 and also be the XOR mask between the #DB exit
qualification (or payload) and DR6.

Concerning that DR6_INIT is used as initial value only once, rename it
to DR6_ACTIVE_LOW and apply it in other places, which would make the
incoming changes for bus lock debug exception more simple.
Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20210202090433.13441-2-chenyi.qiang@intel.com>
[Define DR6_FIXED_1 from DR6_ACTIVE_LOW and DR6_VOLATILE. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9a3ecd5e

KVM: VMX: Enable bus lock VM exit · fe6b6bc8

由 Chenyi Qiang 提交于 11月 06, 2020

Virtual Machine can exploit bus locks to degrade the performance of
system. Bus lock can be caused by split locked access to writeback(WB)
memory or by using locks on uncacheable(UC) memory. The bus lock is
typically >1000 cycles slower than an atomic operation within a cache
line. It also disrupts performance on other cores (which must wait for
the bus lock to be released before their memory operations can
complete).

To address the threat, bus lock VM exit is introduced to notify the VMM
when a bus lock was acquired, allowing it to enforce throttling or other
policy based mitigations.

A VMM can enable VM exit due to bus locks by setting a new "Bus Lock
Detection" VM-execution control(bit 30 of Secondary Processor-based VM
execution controls). If delivery of this VM exit was preempted by a
higher priority VM exit (e.g. EPT misconfiguration, EPT violation, APIC
access VM exit, APIC write VM exit, exception bitmap exiting), bit 26 of
exit reason in vmcs field is set to 1.

In current implementation, the KVM exposes this capability through
KVM_CAP_X86_BUS_LOCK_EXIT. The user can get the supported mode bitmap
(i.e. off and exit) and enable it explicitly (disabled by default). If
bus locks in guest are detected by KVM, exit to user space even when
current exit reason is handled by KVM internally. Set a new field
KVM_RUN_BUS_LOCK in vcpu->run->flags to inform the user space that there
is a bus lock detected in guest.

Document for Bus Lock VM exit is now available at the latest "Intel
Architecture Instruction Set Extensions Programming Reference".

Document Link:
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.htmlCo-developed-by: NXiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20201106090315.18606-4-chenyi.qiang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fe6b6bc8

KVM: x86/mmu: Remove the defunct update_pte() paging hook · c5e2184d

由 Sean Christopherson 提交于 1月 14, 2021

Remove the update_pte() shadow paging logic, which was obsoleted by
commit 4731d4c7 ("KVM: MMU: out of sync shadow core"), but never
removed.  As pointed out by Yu, KVM never write protects leaf page
tables for the purposes of shadow paging, and instead marks their
associated shadow page as unsync so that the guest can write PTEs at
will.

The update_pte() path, which predates the unsync logic, optimizes COW
scenarios by refreshing leaf SPTEs when they are written, as opposed to
zapping the SPTE, restarting the guest, and installing the new SPTE on
the subsequent fault.  Since KVM no longer write-protects leaf page
tables, update_pte() is unreachable and can be dropped.
Reported-by: NYu Zhang <yu.c.zhang@intel.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210115004051.4099250-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c5e2184d

08 1月, 2021 2 次提交

KVM: SVM: Add support for booting APs in an SEV-ES guest · 647daca2

由 Tom Lendacky 提交于 1月 04, 2021

Typically under KVM, an AP is booted using the INIT-SIPI-SIPI sequence,
where the guest vCPU register state is updated and then the vCPU is VMRUN
to begin execution of the AP. For an SEV-ES guest, this won't work because
the guest register state is encrypted.

Following the GHCB specification, the hypervisor must not alter the guest
register state, so KVM must track an AP/vCPU boot. Should the guest want
to park the AP, it must use the AP Reset Hold exit event in place of, for
example, a HLT loop.

First AP boot (first INIT-SIPI-SIPI sequence):
Execute the AP (vCPU) as it was initialized and measured by the SEV-ES
support. It is up to the guest to transfer control of the AP to the
proper location.

Subsequent AP boot:
KVM will expect to receive an AP Reset Hold exit event indicating that
the vCPU is being parked and will require an INIT-SIPI-SIPI sequence to
awaken it. When the AP Reset Hold exit event is received, KVM will place
the vCPU into a simulated HLT mode. Upon receiving the INIT-SIPI-SIPI
sequence, KVM will make the vCPU runnable. It is again up to the guest
to then transfer control of the AP to the proper location.

To differentiate between an actual HLT and an AP Reset Hold, a new MP
state is introduced, KVM_MP_STATE_AP_RESET_HOLD, which the vCPU is
placed in upon receiving the AP Reset Hold exit event. Additionally, to
communicate the AP Reset Hold exit event up to userspace (if needed), a
new exit reason is introduced, KVM_EXIT_AP_RESET_HOLD.

A new x86 ops function is introduced, vcpu_deliver_sipi_vector, in order
to accomplish AP booting. For VMX, vcpu_deliver_sipi_vector is set to the
original SIPI delivery function, kvm_vcpu_deliver_sipi_vector(). SVM adds
a new function that, for non SEV-ES guests, invokes the original SIPI
delivery function, kvm_vcpu_deliver_sipi_vector(), but for SEV-ES guests,
implements the logic above.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <e8fbebe8eb161ceaabdad7c01a5859a78b424d5e.1609791600.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

647daca2

KVM: x86/mmu: Clarify TDP MMU page list invariants · c0dba6e4

由 Ben Gardon 提交于 1月 06, 2021

The tdp_mmu_roots and tdp_mmu_pages in struct kvm_arch should only contain
pages with tdp_mmu_page set to true. tdp_mmu_pages should not contain any
pages with a non-zero root_count and tdp_mmu_roots should only contain
pages with a positive root_count, unless a thread holds the MMU lock and
is in the process of modifying the list. Various functions expect these
invariants to be maintained, but they are not explictily documented. Add
to the comments on both fields to document the above invariants.
Signed-off-by: NBen Gardon <bgardon@google.com>
Message-Id: <20210107001935.3732070-2-bgardon@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c0dba6e4

15 12月, 2020 7 次提交

KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest · ed02b213

由 Tom Lendacky 提交于 12月 10, 2020

The guest FPU state is automatically restored on VMRUN and saved on VMEXIT
by the hardware, so there is no reason to do this in KVM. Eliminate the
allocation of the guest_fpu save area and key off that to skip operations
related to the guest FPU state.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <173e429b4d0d962c6a443c4553ffdaf31b7665a4.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ed02b213

KVM: SVM: Do not report support for SMM for an SEV-ES guest · 5719455f

由 Tom Lendacky 提交于 12月 10, 2020

SEV-ES guests do not currently support SMM. Update the has_emulated_msr()
kvm_x86_ops function to take a struct kvm parameter so that the capability
can be reported at a VM level.

Since this op is also called during KVM initialization and before a struct
kvm instance is available, comments will be added to each implementation
of has_emulated_msr() to indicate the kvm parameter can be null.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <75de5138e33b945d2fb17f81ae507bda381808e3.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5719455f

KVM: SVM: Add support for CR4 write traps for an SEV-ES guest · 5b51cb13

由 Tom Lendacky 提交于 12月 10, 2020

For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.

SEV-ES guests introduce new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.

Add support to track the value of the guest CR4 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c3880bf2db8693aa26f648528fbc6e967ab46e25.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5b51cb13

KVM: SVM: Add support for CR0 write traps for an SEV-ES guest · f27ad38a

由 Tom Lendacky 提交于 12月 10, 2020

For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.

SEV-ES support introduces new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.

Add support to track the value of the guest CR0 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <182c9baf99df7e40ad9617ff90b84542705ef0d7.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f27ad38a

KVM: SVM: Support string IO operations for an SEV-ES guest · 7ed9abfe

由 Tom Lendacky 提交于 12月 10, 2020

For an SEV-ES guest, string-based port IO is performed to a shared
(un-encrypted) page so that both the hypervisor and guest can read or
write to it and each see the contents.

For string-based port IO operations, invoke SEV-ES specific routines that
can complete the operation using common KVM port IO support.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <9d61daf0ffda496703717218f415cdc8fd487100.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7ed9abfe

KVM: x86: introduce complete_emulated_msr callback · f9a4d621

由 Paolo Bonzini 提交于 12月 14, 2020

This will be used by SEV-ES to inject MSR failure via the GHCB.
Reviewed-by: NTom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f9a4d621

KVM: SVM: Add support for the SEV-ES VMSA · add5e2f0

由 Tom Lendacky 提交于 12月 10, 2020

Allocate a page during vCPU creation to be used as the encrypted VM save
area (VMSA) for the SEV-ES guest. Provide a flag in the kvm_vcpu_arch
structure that indicates whether the guest state is protected.

When freeing a VMSA page that has been encrypted, the cache contents must
be flushed using the MSR_AMD64_VM_PAGE_FLUSH before freeing the page.

[ i386 build warnings ]
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <fde272b17eec804f3b9db18c131262fe074015c5.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

add5e2f0

27 11月, 2020 1 次提交

KVM: x86: Fix split-irqchip vs interrupt injection window request · 71cc849b

由 Paolo Bonzini 提交于 11月 27, 2020

kvm_cpu_accept_dm_intr and kvm_vcpu_ready_for_interrupt_injection are
a hodge-podge of conditions, hacked together to get something that
more or less works.  But what is actually needed is much simpler;
in both cases the fundamental question is, do we have a place to stash
an interrupt if userspace does KVM_INTERRUPT?

In userspace irqchip mode, that is !vcpu->arch.interrupt.injected.
Currently kvm_event_needs_reinjection(vcpu) covers it, but it is
unnecessarily restrictive.

In split irqchip mode it's a bit more complicated, we need to check
kvm_apic_accept_pic_intr(vcpu) (the IRQ window exit is basically an INTACK
cycle and thus requires ExtINTs not to be masked) as well as
!pending_userspace_extint(vcpu).  However, there is no need to
check kvm_event_needs_reinjection(vcpu), since split irqchip keeps
pending ExtINT state separate from event injection state, and checking
kvm_cpu_has_interrupt(vcpu) is wrong too since ExtINT has higher
priority than APIC interrupts.  In fact the latter fixes a bug:
when userspace requests an IRQ window vmexit, an interrupt in the
local APIC can cause kvm_cpu_has_interrupt() to be true and thus
kvm_vcpu_ready_for_interrupt_injection() to return false.  When this
happens, vcpu_run does not exit to userspace but the interrupt window
vmexits keep occurring.  The VM loops without any hope of making progress.

Once we try to fix these with something like

     return kvm_arch_interrupt_allowed(vcpu) &&
-        !kvm_cpu_has_interrupt(vcpu) &&
-        !kvm_event_needs_reinjection(vcpu) &&
-        kvm_cpu_accept_dm_intr(vcpu);
+        (!lapic_in_kernel(vcpu)
+         ? !vcpu->arch.interrupt.injected
+         : (kvm_apic_accept_pic_intr(vcpu)
+            && !pending_userspace_extint(v)));

we realize two things.  First, thanks to the previous patch the complex
conditional can reuse !kvm_cpu_has_extint(vcpu).  Second, the interrupt
window request in vcpu_enter_guest()

        bool req_int_win =
                dm_request_for_irq_injection(vcpu) &&
                kvm_cpu_accept_dm_intr(vcpu);

should be kept in sync with kvm_vcpu_ready_for_interrupt_injection():
it is unnecessary to ask the processor for an interrupt window
if we would not be able to return to userspace.  Therefore,
kvm_cpu_accept_dm_intr(vcpu) is basically !kvm_cpu_has_extint(vcpu)
ANDed with the existing check for masked ExtINT.  It all makes sense:

- we can accept an interrupt from userspace if there is a place
  to stash it (and, for irqchip split, ExtINTs are not masked).
  Interrupts from userspace _can_ be accepted even if right now
  EFLAGS.IF=0.

- in order to tell userspace we will inject its interrupt ("IRQ
  window open" i.e. kvm_vcpu_ready_for_interrupt_injection), both
  KVM and the vCPU need to be ready to accept the interrupt.

... and this is what the patch implements.
Reported-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Analyzed-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Tested-by: NDavid Woodhouse <dwmw@amazon.co.uk>

71cc849b

15 11月, 2020 3 次提交

KVM: X86: Implement ring-based dirty memory tracking · fb04a1ed

由 Peter Xu 提交于 9月 30, 2020

This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]

KVM currently uses large bitmaps to track dirty memory. These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information. The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another. However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial. In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN). This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

This patch enables dirty ring for X86 only. However it should be
easily extended to other archs as well.

[1] https://patchwork.kernel.org/patch/10471409/Signed-off-by: NLei Cao <lei.cao@stratus.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20201001012222.5767-1-peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fb04a1ed

KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] · ff5a983c

由 Peter Xu 提交于 9月 30, 2020

Originally, we have three code paths that can dirty a page without
vcpu context for X86:

  - init_rmode_identity_map
  - init_rmode_tss
  - kvmgt_rw_gpa

init_rmode_identity_map and init_rmode_tss will be setup on
destination VM no matter what (and the guest cannot even see them), so
it does not make sense to track them at all.

To do this, allow __x86_set_memory_region() to return the userspace
address that just allocated to the caller.  Then in both of the
functions we directly write to the userspace address instead of
calling kvm_write_*() APIs.

Another trivial change is that we don't need to explicitly clear the
identity page table root in init_rmode_identity_map() because no
matter what we'll write to the whole page with 4M huge page entries.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20201001012044.5151-4-peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ff5a983c

KVM: x86: Move vendor CR4 validity check to dedicated kvm_x86_ops hook · c2fe3cd4

由 Sean Christopherson 提交于 10月 06, 2020

Split out VMX's checks on CR4.VMXE to a dedicated hook, .is_valid_cr4(),
and invoke the new hook from kvm_valid_cr4(). This fixes an issue where
KVM_SET_SREGS would return success while failing to actually set CR4.

Fixing the issue by explicitly checking kvm_x86_ops.set_cr4()'s return
in __set_sregs() is not a viable option as KVM has already stuffed a
variety of vCPU state.

Note, kvm_valid_cr4() and is_valid_cr4() have different return types and
inverted semantics. This will be remedied in a future patch.

Fixes: 5e1746d6 ("KVM: nVMX: Allow setting the VMXE bit in CR4")
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-5-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c2fe3cd4

13 11月, 2020 1 次提交

KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch · 0107973a

由 Babu Moger 提交于 11月 12, 2020

SEV guests fail to boot on a system that supports the PCID feature.

While emulating the RSM instruction, KVM reads the guest CR3
and calls kvm_set_cr3(). If the vCPU is in the long mode,
kvm_set_cr3() does a sanity check for the CR3 value. In this case,
it validates whether the value has any reserved bits set. The
reserved bit range is 63:cpuid_maxphysaddr(). When AMD memory
encryption is enabled, the memory encryption bit is set in the CR3
value. The memory encryption bit may fall within the KVM reserved
bit range, causing the KVM emulation failure.

Introduce a new field cr3_lm_rsvd_bits in kvm_vcpu_arch which will
cache the reserved bits in the CR3 value. This will be initialized
to rsvd_bits(cpuid_maxphyaddr(vcpu), 63).

If the architecture has any special bits(like AMD SEV encryption bit)
that needs to be masked from the reserved bits, should be cleared
in vendor specific kvm_x86_ops.vcpu_after_set_cpuid handler.

Fixes: a780a3ea ("KVM: X86: Fix reserved bits check for MOV to CR3")
Signed-off-by: NBabu Moger <babu.moger@amd.com>
Message-Id: <160521947657.32054.3264016688005356563.stgit@bmoger-ubuntu>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0107973a

23 10月, 2020 1 次提交

kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU · 89c0fd49

由 Ben Gardon 提交于 10月 14, 2020

Attach struct kvm_mmu_pages to every page in the TDP MMU to track
metadata, facilitate NX reclaim, and enable inproved parallelism of MMU
operations in future patches.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538Signed-off-by: NBen Gardon <bgardon@google.com>
Message-Id: <20201014182700.2888246-12-bgardon@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

89c0fd49

22 10月, 2020 6 次提交

kvm: x86/mmu: Allocate and free TDP MMU roots · 02c00b3a

由 Ben Gardon 提交于 10月 14, 2020

The TDP MMU must be able to allocate paging structure root pages and track
the usage of those pages. Implement a similar, but separate system for root
page allocation to that of the x86 shadow paging implementation. When
future patches add synchronization model changes to allow for parallel
page faults, these pages will need to be handled differently from the
x86 shadow paging based MMU's root pages.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538Signed-off-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

02c00b3a

kvm: x86/mmu: Init / Uninit the TDP MMU · fe5db27d

由 Ben Gardon 提交于 10月 14, 2020

The TDP MMU offers an alternative mode of operation to the x86 shadow
paging based MMU, optimized for running an L1 guest with TDP. The TDP MMU
will require new fields that need to be initialized and torn down. Add
hooks into the existing KVM MMU initialization process to do that
initialization / cleanup. Currently the initialization and cleanup
fucntions do not do very much, however more operations will be added in
future patches.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538Signed-off-by: NBen Gardon <bgardon@google.com>
Message-Id: <20201014182700.2888246-4-bgardon@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fe5db27d

KVM: x86: allow kvm_x86_ops.set_efer to return an error value · 72f211ec

由 Maxim Levitsky 提交于 10月 01, 2020

This will be used to signal an error to the userspace, in case
the vendor code failed during handling of this msr. (e.g -ENOMEM)
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201001112954.6258-4-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

72f211ec

KVM: x86: bump KVM_MAX_CPUID_ENTRIES · 3f4e3eb4

由 Vitaly Kuznetsov 提交于 10月 01, 2020

As vcpu->arch.cpuid_entries is now allocated dynamically, the only
remaining use for KVM_MAX_CPUID_ENTRIES is to check KVM_SET_CPUID/
KVM_SET_CPUID2 input for sanity. Since it was reported that the
current limit (80) is insufficient for some CPUs, bump
KVM_MAX_CPUID_ENTRIES and use an arbitrary value '256' as the new
limit.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20201001130541.1398392-4-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3f4e3eb4

KVM: x86: allocate vcpu->arch.cpuid_entries dynamically · 255cbecf

由 Vitaly Kuznetsov 提交于 10月 01, 2020

The current limit for guest CPUID leaves (KVM_MAX_CPUID_ENTRIES, 80)
is reported to be insufficient but before we bump it let's switch to
allocating vcpu->arch.cpuid_entries[] array dynamically. Currently,
'struct kvm_cpuid_entry2' is 40 bytes so vcpu->arch.cpuid_entries is
3200 bytes which accounts for 1/4 of the whole 'struct kvm_vcpu_arch'
but having it pre-allocated (for all vCPUs which we also pre-allocate)
gives us no real benefits.

Another plus of the dynamic allocation is that we now do kvm_check_cpuid()
check before we assign anything to vcpu->arch.cpuid_nent/cpuid_entries so
no changes are made in case the check fails.

Opportunistically remove unneeded 'out' labels from
kvm_vcpu_ioctl_set_cpuid()/kvm_vcpu_ioctl_set_cpuid2() and return
directly whenever possible.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20201001130541.1398392-3-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>

255cbecf

kvm: x86: only provide PV features if enabled in guest's CPUID · 66570e96

由 Oliver Upton 提交于 8月 18, 2020

KVM unconditionally provides PV features to the guest, regardless of the
configured CPUID. An unwitting guest that doesn't check
KVM_CPUID_FEATURES before use could access paravirt features that
userspace did not intend to provide. Fix this by checking the guest's
CPUID before performing any paravirtual operations.

Introduce a capability, KVM_CAP_ENFORCE_PV_FEATURE_CPUID, to gate the
aforementioned enforcement. Migrating a VM from a host w/o this patch to
a host with this patch could silently change the ABI exposed to the
guest, warranting that we default to the old behavior and opt-in for
the new one.
Reviewed-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NPeter Shier <pshier@google.com>
Signed-off-by: NOliver Upton <oupton@google.com>
Change-Id: I202a0926f65035b872bfe8ad15307c026de59a98
Message-Id: <20200818152429.1923996-4-oupton@google.com>
Reviewed-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

66570e96

28 9月, 2020 7 次提交

KVM: x86: rename KVM_REQ_GET_VMCS12_PAGES · 729c15c2

由 Paolo Bonzini 提交于 9月 22, 2020

We are going to use it for SVM too, so use a more generic name.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

729c15c2

KVM: x86: Introduce MSR filtering · 1a155254

由 Alexander Graf 提交于 9月 25, 2020

It's not desireable to have all MSRs always handled by KVM kernel space. Some
MSRs would be useful to handle in user space to either emulate behavior (like
uCode updates) or differentiate whether they are valid based on the CPU model.

To allow user space to specify which MSRs it wants to see handled by KVM,
this patch introduces a new ioctl to push filter rules with bitmaps into
KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access.
With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the
denied MSR events to user space to operate on.

If no filter is populated, MSR handling stays identical to before.
Signed-off-by: NAlexander Graf <graf@amazon.com>

Message-Id: <20200925143422.21718-8-graf@amazon.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1a155254

KVM: x86: Add infrastructure for MSR filtering · 51de8151

由 Alexander Graf 提交于 9月 25, 2020

In the following commits we will add pieces of MSR filtering.
To ensure that code compiles even with the feature half-merged, let's add
a few stubs and struct definitions before the real patches start.
Signed-off-by: NAlexander Graf <graf@amazon.com>

Message-Id: <20200925143422.21718-4-graf@amazon.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

51de8151

KVM: x86: Allow deflecting unknown MSR accesses to user space · 1ae09954

由 Alexander Graf 提交于 9月 25, 2020

MSRs are weird. Some of them are normal control registers, such as EFER.
Some however are registers that really are model specific, not very
interesting to virtualization workloads, and not performance critical.
Others again are really just windows into package configuration.

Out of these MSRs, only the first category is necessary to implement in
kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
certain CPU models and MSRs that contain information on the package level
are much better suited for user space to process. However, over time we have
accumulated a lot of MSRs that are not the first category, but still handled
by in-kernel KVM code.

This patch adds a generic interface to handle WRMSR and RDMSR from user
space. With this, any future MSR that is part of the latter categories can
be handled in user space.

Furthermore, it allows us to replace the existing "ignore_msrs" logic with
something that applies per-VM rather than on the full system. That way you
can run productive VMs in parallel to experimental ones where you don't care
about proper MSR handling.
Signed-off-by: NAlexander Graf <graf@amazon.com>
Reviewed-by: NJim Mattson <jmattson@google.com>

Message-Id: <20200925143422.21718-3-graf@amazon.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1ae09954

KVM: x86: Rename "shared_msrs" to "user_return_msrs" · 7e34fbd0

由 Sean Christopherson 提交于 9月 23, 2020

Rename the "shared_msrs" mechanism, which is used to defer restoring
MSRs that are only consumed when running in userspace, to a more banal
but less likely to be confusing "user_return_msrs".

The "shared" nomenclature is confusing as it's not obvious who is
sharing what, e.g. reasonable interpretations are that the guest value
is shared by vCPUs in a VM, or that the MSR value is shared/common to
guest and host, both of which are wrong.

"shared" is also misleading as the MSR value (in hardware) is not
guaranteed to be shared/reused between VMs (if that's indeed the correct
interpretation of the name), as the ability to share values between VMs
is simply a side effect (albiet a very nice side effect) of deferring
restoration of the host value until returning from userspace.

"user_return" avoids the above confusion by describing the mechanism
itself instead of its effects.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-2-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7e34fbd0

KVM: x86: Add intr/vectoring info and error code to kvm_exit tracepoint · 235ba74f

由 Sean Christopherson 提交于 9月 23, 2020

Extend the kvm_exit tracepoint to align it with kvm_nested_vmexit in
terms of what information is captured. On SVM, add interrupt info and
error code, while on VMX it add IDT vectoring and error code. This
sets the stage for macrofying the kvm_exit tracepoint definition so that
it can be reused for kvm_nested_vmexit without loss of information.

Opportunistically stuff a zero for VM_EXIT_INTR_INFO if the VM-Enter
failed, as the field is guaranteed to be invalid. Note, it'd be
possible to further filter the interrupt/exception fields based on the
VM-Exit reason, but the helper is intended only for tracepoints, i.e.
an extra VMREAD or two is a non-issue, the failed VM-Enter case is just
low hanging fruit.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-5-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

235ba74f

KVM: x86: Add kvm_x86_ops hook to short circuit emulation · 09e3e2a1

由 Sean Christopherson 提交于 9月 15, 2020

Replace the existing kvm_x86_ops.need_emulation_on_page_fault() with a
more generic is_emulatable(), and unconditionally call the new function
in x86_emulate_instruction().

KVM will use the generic hook to support multiple security related
technologies that prevent emulation in one way or another. Similar to
the existing AMD #NPF case where emulation of the current instruction is
not possible due to lack of information, AMD's SEV-ES and Intel's SGX
and TDX will introduce scenarios where emulation is impossible due to
the guest's register state being inaccessible. And again similar to the
existing #NPF case, emulation can be initiated by kvm_mmu_page_fault(),
i.e. outside of the control of vendor-specific code.

While the cause and architecturally visible behavior of the various
cases are different, e.g. SGX will inject a #UD, AMD #NPF is a clean
resume or complete shutdown, and SEV-ES and TDX "return" an error, the
impact on the common emulation code is identical: KVM must stop
emulation immediately and resume the guest.

Query is_emulatable() in handle_ud() as well so that the
force_emulation_prefix code doesn't incorrectly modify RIP before
calling emulate_instruction() in the absurdly unlikely scenario that
KVM encounters forced emulation in conjunction with "do not emulate".

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200915232702.15945-1-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

09e3e2a1

22 8月, 2020 1 次提交

KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() · fdfe7cbd

由 Will Deacon 提交于 8月 11, 2020

The 'flags' field of 'struct mmu_notifier_range' is used to indicate
whether invalidate_range_{start,end}() are permitted to block. In the
case of kvm_mmu_notifier_invalidate_range_start(), this field is not
forwarded on to the architecture-specific implementation of
kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
whether or not to block.

Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
architectures are aware as to whether or not they are permitted to block.

Cc: <stable@vger.kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: NWill Deacon <will@kernel.org>
Message-Id: <20200811102725.7121-2-will@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fdfe7cbd

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功