提交 63fc9c23 编写于 作者: L Linus Torvalds

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "A collection of x86 and ARM bugfixes, and some improvements to
  documentation.

  On top of this, a cleanup of kvm_para.h headers, which were exported
  by some architectures even though they not support KVM at all. This is
  responsible for all the Kbuild changes in the diffstat"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits)
  Documentation: kvm: clarify KVM_SET_USER_MEMORY_REGION
  KVM: doc: Document the life cycle of a VM and its resources
  KVM: selftests: complete IO before migrating guest state
  KVM: selftests: disable stack protector for all KVM tests
  KVM: selftests: explicitly disable PIE for tests
  KVM: selftests: assert on exit reason in CR4/cpuid sync test
  KVM: x86: update %rip after emulating IO
  x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init
  kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs
  KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts
  kvm: don't redefine flags as something else
  kvm: mmu: Used range based flushing in slot_handle_level_range
  KVM: export <linux/kvm_para.h> and <asm/kvm_para.h> iif KVM is supported
  KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region()
  kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP fields
  KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)
  KVM: Reject device ioctls from processes other than the VM's creator
  KVM: doc: Fix incorrect word ordering regarding supported use of APIs
  KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'
  KVM: nVMX: Do not inherit quadrant and invalid for the root shadow EPT
  ...
......@@ -5,25 +5,32 @@ The Definitive KVM (Kernel-based Virtual Machine) API Documentation
----------------------
The kvm API is a set of ioctls that are issued to control various aspects
of a virtual machine. The ioctls belong to three classes
of a virtual machine. The ioctls belong to three classes:
- System ioctls: These query and set global attributes which affect the
whole kvm subsystem. In addition a system ioctl is used to create
virtual machines
virtual machines.
- VM ioctls: These query and set attributes that affect an entire virtual
machine, for example memory layout. In addition a VM ioctl is used to
create virtual cpus (vcpus).
create virtual cpus (vcpus) and devices.
Only run VM ioctls from the same process (address space) that was used
to create the VM.
VM ioctls must be issued from the same process (address space) that was
used to create the VM.
- vcpu ioctls: These query and set attributes that control the operation
of a single virtual cpu.
Only run vcpu ioctls from the same thread that was used to create the
vcpu.
vcpu ioctls should be issued from the same thread that was used to create
the vcpu, except for asynchronous vcpu ioctl that are marked as such in
the documentation. Otherwise, the first ioctl after switching threads
could see a performance impact.
- device ioctls: These query and set attributes that control the operation
of a single device.
device ioctls must be issued from the same process (address space) that
was used to create the VM.
2. File descriptors
-------------------
......@@ -32,17 +39,34 @@ The kvm API is centered around file descriptors. An initial
open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this
handle will create a VM file descriptor which can be used to issue VM
ioctls. A KVM_CREATE_VCPU ioctl on a VM fd will create a virtual cpu
and return a file descriptor pointing to it. Finally, ioctls on a vcpu
fd can be used to control the vcpu, including the important task of
actually running guest code.
ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will
create a virtual cpu or device and return a file descriptor pointing to
the new resource. Finally, ioctls on a vcpu or device fd can be used
to control the vcpu or device. For vcpus, this includes the important
task of actually running guest code.
In general file descriptors can be migrated among processes by means
of fork() and the SCM_RIGHTS facility of unix domain socket. These
kinds of tricks are explicitly not supported by kvm. While they will
not cause harm to the host, their actual behavior is not guaranteed by
the API. The only supported use is one virtual machine per process,
and one vcpu per thread.
the API. See "General description" for details on the ioctl usage
model that is supported by KVM.
It is important to note that althought VM ioctls may only be issued from
the process that created the VM, a VM's lifecycle is associated with its
file descriptor, not its creator (process). In other words, the VM and
its resources, *including the associated address space*, are not freed
until the last reference to the VM's file descriptor has been released.
For example, if fork() is issued after ioctl(KVM_CREATE_VM), the VM will
not be freed until both the parent (original) process and its child have
put their references to the VM's file descriptor.
Because a VM's resources are not freed until the last reference to its
file descriptor is released, creating additional references to a VM via
via fork(), dup(), etc... without careful consideration is strongly
discouraged and may have unwanted side effects, e.g. memory allocated
by and on behalf of the VM's process may not be freed/unaccounted when
the VM is shut down.
It is important to note that althought VM ioctls may only be issued from
......@@ -515,11 +539,15 @@ c) KVM_INTERRUPT_SET_LEVEL
Note that any value for 'irq' other than the ones stated above is invalid
and incurs unexpected behavior.
This is an asynchronous vcpu ioctl and can be invoked from any thread.
MIPS:
Queues an external interrupt to be injected into the virtual CPU. A negative
interrupt number dequeues the interrupt.
This is an asynchronous vcpu ioctl and can be invoked from any thread.
4.17 KVM_DEBUG_GUEST
......@@ -1086,14 +1114,12 @@ struct kvm_userspace_memory_region {
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
This ioctl allows the user to create or modify a guest physical memory
slot. When changing an existing slot, it may be moved in the guest
physical memory space, or its flags may be modified. It may not be
resized. Slots may not overlap in guest physical address space.
Bits 0-15 of "slot" specifies the slot id and this value should be
less than the maximum number of user memory slots supported per VM.
The maximum allowed slots can be queried using KVM_CAP_NR_MEMSLOTS,
if this capability is supported by the architecture.
This ioctl allows the user to create, modify or delete a guest physical
memory slot. Bits 0-15 of "slot" specify the slot id and this value
should be less than the maximum number of user memory slots supported per
VM. The maximum allowed slots can be queried using KVM_CAP_NR_MEMSLOTS,
if this capability is supported by the architecture. Slots may not
overlap in guest physical address space.
If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 of "slot"
specifies the address space which is being modified. They must be
......@@ -1102,6 +1128,10 @@ KVM_CAP_MULTI_ADDRESS_SPACE capability. Slots in separate address spaces
are unrelated; the restriction on overlapping slots only applies within
each address space.
Deleting a slot is done by passing zero for memory_size. When changing
an existing slot, it may be moved in the guest physical memory space,
or its flags may be modified, but it may not be resized.
Memory for the region is taken starting at the address denoted by the
field userspace_addr, which must point at user addressable memory for
the entire memory slot size. Any object may back this memory, including
......@@ -2493,7 +2523,7 @@ KVM_S390_MCHK (vm, vcpu) - machine check interrupt; cr 14 bits in parm,
machine checks needing further payload are not
supported by this ioctl)
Note that the vcpu ioctl is asynchronous to vcpu execution.
This is an asynchronous vcpu ioctl and can be invoked from any thread.
4.78 KVM_PPC_GET_HTAB_FD
......@@ -3042,8 +3072,7 @@ KVM_S390_INT_EMERGENCY - sigp emergency; parameters in .emerg
KVM_S390_INT_EXTERNAL_CALL - sigp external call; parameters in .extcall
KVM_S390_MCHK - machine check interrupt; parameters in .mchk
Note that the vcpu ioctl is asynchronous to vcpu execution.
This is an asynchronous vcpu ioctl and can be invoked from any thread.
4.94 KVM_S390_GET_IRQ_STATE
......
......@@ -142,7 +142,7 @@ Shadow pages contain the following information:
If clear, this page corresponds to a guest page table denoted by the gfn
field.
role.quadrant:
When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit
When role.gpte_is_8_bytes=0, the guest uses 32-bit gptes while the host uses 64-bit
sptes. That means a guest page table contains more ptes than the host,
so multiple shadow pages are needed to shadow one guest page.
For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
......@@ -158,9 +158,9 @@ Shadow pages contain the following information:
The page is invalid and should not be used. It is a root page that is
currently pinned (by a cpu hardware register pointing to it); once it is
unpinned it will be destroyed.
role.cr4_pae:
Contains the value of cr4.pae for which the page is valid (e.g. whether
32-bit or 64-bit gptes are in use).
role.gpte_is_8_bytes:
Reflects the size of the guest PTE for which the page is valid, i.e. '1'
if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
role.nxe:
Contains the value of efer.nxe for which the page is valid.
role.cr0_wp:
......@@ -173,6 +173,9 @@ Shadow pages contain the following information:
Contains the value of cr4.smap && !cr0.wp for which the page is valid
(pages for which this is true are different from other pages; see the
treatment of cr0.wp=0 below).
role.ept_sp:
This is a virtual flag to denote a shadowed nested EPT page. ept_sp
is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
role.smm:
Is 1 if the page is valid in system management mode. This field
determines which of the kvm_memslots array was used to build this
......
......@@ -6,6 +6,7 @@ generic-y += exec.h
generic-y += export.h
generic-y += fb.h
generic-y += irq_work.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
generic-y += preempt.h
......
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
#include <asm-generic/kvm_para.h>
......@@ -11,6 +11,7 @@ generic-y += hardirq.h
generic-y += hw_irq.h
generic-y += irq_regs.h
generic-y += irq_work.h
generic-y += kvm_para.h
generic-y += local.h
generic-y += local64.h
generic-y += mcs_spinlock.h
......
generic-y += kvm_para.h
generic-y += ucontext.h
......@@ -381,6 +381,17 @@ static inline int kvm_read_guest_lock(struct kvm *kvm,
return ret;
}
static inline int kvm_write_guest_lock(struct kvm *kvm, gpa_t gpa,
const void *data, unsigned long len)
{
int srcu_idx = srcu_read_lock(&kvm->srcu);
int ret = kvm_write_guest(kvm, gpa, data, len);
srcu_read_unlock(&kvm->srcu, srcu_idx);
return ret;
}
static inline void *kvm_get_hyp_vector(void)
{
switch(read_cpuid_part()) {
......
......@@ -75,6 +75,8 @@ static inline bool kvm_stage2_has_pud(struct kvm *kvm)
#define S2_PMD_MASK PMD_MASK
#define S2_PMD_SIZE PMD_SIZE
#define S2_PUD_MASK PUD_MASK
#define S2_PUD_SIZE PUD_SIZE
static inline bool kvm_stage2_has_pmd(struct kvm *kvm)
{
......
......@@ -3,3 +3,4 @@
generated-y += unistd-common.h
generated-y += unistd-oabi.h
generated-y += unistd-eabi.h
generic-y += kvm_para.h
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
#include <asm-generic/kvm_para.h>
......@@ -445,6 +445,17 @@ static inline int kvm_read_guest_lock(struct kvm *kvm,
return ret;
}
static inline int kvm_write_guest_lock(struct kvm *kvm, gpa_t gpa,
const void *data, unsigned long len)
{
int srcu_idx = srcu_read_lock(&kvm->srcu);
int ret = kvm_write_guest(kvm, gpa, data, len);
srcu_read_unlock(&kvm->srcu, srcu_idx);
return ret;
}
#ifdef CONFIG_KVM_INDIRECT_VECTORS
/*
* EL2 vectors can be mapped and rerouted in a number of ways,
......
......@@ -123,6 +123,9 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
int ret = -EINVAL;
bool loaded;
/* Reset PMU outside of the non-preemptible section */
kvm_pmu_vcpu_reset(vcpu);
preempt_disable();
loaded = (vcpu->cpu != -1);
if (loaded)
......@@ -170,9 +173,6 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
vcpu->arch.reset_state.reset = false;
}
/* Reset PMU */
kvm_pmu_vcpu_reset(vcpu);
/* Default workaround setup is enabled (if supported) */
if (kvm_arm_have_ssbd() == KVM_SSBD_KERNEL)
vcpu->arch.workaround_flags |= VCPU_WORKAROUND_2_FLAG;
......
......@@ -19,6 +19,7 @@ generic-y += irq_work.h
generic-y += kdebug.h
generic-y += kmap_types.h
generic-y += kprobes.h
generic-y += kvm_para.h
generic-y += local.h
generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
......
generic-y += kvm_para.h
generic-y += ucontext.h
......@@ -23,6 +23,7 @@ generic-y += irq_work.h
generic-y += kdebug.h
generic-y += kmap_types.h
generic-y += kprobes.h
generic-y += kvm_para.h
generic-y += linkage.h
generic-y += local.h
generic-y += local64.h
......
generic-y += kvm_para.h
generic-y += ucontext.h
......@@ -19,6 +19,7 @@ generic-y += irq_work.h
generic-y += kdebug.h
generic-y += kmap_types.h
generic-y += kprobes.h
generic-y += kvm_para.h
generic-y += local.h
generic-y += local64.h
generic-y += mcs_spinlock.h
......
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
#include <asm-generic/kvm_para.h>
......@@ -2,6 +2,7 @@ generated-y += syscall_table.h
generic-y += compat.h
generic-y += exec.h
generic-y += irq_work.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
generic-y += preempt.h
......
generated-y += unistd_64.h
generic-y += kvm_para.h
......@@ -13,6 +13,7 @@ generic-y += irq_work.h
generic-y += kdebug.h
generic-y += kmap_types.h
generic-y += kprobes.h
generic-y += kvm_para.h
generic-y += local.h
generic-y += local64.h
generic-y += mcs_spinlock.h
......
generated-y += unistd_32.h
generic-y += kvm_para.h
......@@ -17,6 +17,7 @@ generic-y += irq_work.h
generic-y += kdebug.h
generic-y += kmap_types.h
generic-y += kprobes.h
generic-y += kvm_para.h
generic-y += linkage.h
generic-y += local.h
generic-y += local64.h
......
generated-y += unistd_32.h
generic-y += kvm_para.h
generic-y += ucontext.h
......@@ -23,6 +23,7 @@ generic-y += irq_work.h
generic-y += kdebug.h
generic-y += kmap_types.h
generic-y += kprobes.h
generic-y += kvm_para.h
generic-y += local.h
generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
......
generic-y += kvm_para.h
generic-y += ucontext.h
......@@ -20,6 +20,7 @@ generic-y += irq_work.h
generic-y += kdebug.h
generic-y += kmap_types.h
generic-y += kprobes.h
generic-y += kvm_para.h
generic-y += local.h
generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
......
generic-y += kvm_para.h
generic-y += ucontext.h
......@@ -11,6 +11,7 @@ generic-y += irq_regs.h
generic-y += irq_work.h
generic-y += kdebug.h
generic-y += kprobes.h
generic-y += kvm_para.h
generic-y += local.h
generic-y += local64.h
generic-y += mcs_spinlock.h
......
generated-y += unistd_32.h
generated-y += unistd_64.h
generic-y += kvm_para.h
......@@ -9,6 +9,7 @@ generic-y += emergency-restart.h
generic-y += exec.h
generic-y += irq_regs.h
generic-y += irq_work.h
generic-y += kvm_para.h
generic-y += local.h
generic-y += local64.h
generic-y += mcs_spinlock.h
......
# SPDX-License-Identifier: GPL-2.0
generated-y += unistd_32.h
generic-y += kvm_para.h
generic-y += ucontext.h
......@@ -9,6 +9,7 @@ generic-y += exec.h
generic-y += export.h
generic-y += irq_regs.h
generic-y += irq_work.h
generic-y += kvm_para.h
generic-y += linkage.h
generic-y += local.h
generic-y += local64.h
......
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
#include <asm-generic/kvm_para.h>
......@@ -18,6 +18,7 @@ generic-y += irq_work.h
generic-y += kdebug.h
generic-y += kmap_types.h
generic-y += kprobes.h
generic-y += kvm_para.h
generic-y += local.h
generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
......
generic-y += kvm_para.h
generic-y += ucontext.h
......@@ -253,14 +253,14 @@ struct kvm_mmu_memory_cache {
* kvm_memory_slot.arch.gfn_track which is 16 bits, so the role bits used
* by indirect shadow page can not be more than 15 bits.
*
* Currently, we used 14 bits that are @level, @cr4_pae, @quadrant, @access,
* Currently, we used 14 bits that are @level, @gpte_is_8_bytes, @quadrant, @access,
* @nxe, @cr0_wp, @smep_andnot_wp and @smap_andnot_wp.
*/
union kvm_mmu_page_role {
u32 word;
struct {
unsigned level:4;
unsigned cr4_pae:1;
unsigned gpte_is_8_bytes:1;
unsigned quadrant:2;
unsigned direct:1;
unsigned access:3;
......@@ -350,6 +350,7 @@ struct kvm_mmu_page {
};
struct kvm_pio_request {
unsigned long linear_rip;
unsigned long count;
int in;
int port;
......@@ -568,6 +569,7 @@ struct kvm_vcpu_arch {
bool tpr_access_reporting;
u64 ia32_xss;
u64 microcode_version;
u64 arch_capabilities;
/*
* Paging state of the vcpu
......@@ -1192,6 +1194,8 @@ struct kvm_x86_ops {
int (*nested_enable_evmcs)(struct kvm_vcpu *vcpu,
uint16_t *vmcs_version);
uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu);
bool (*need_emulation_on_page_fault)(struct kvm_vcpu *vcpu);
};
struct kvm_arch_async_pf {
......@@ -1252,7 +1256,7 @@ void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
gfn_t gfn_offset, unsigned long mask);
void kvm_mmu_zap_all(struct kvm *kvm);
void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen);
unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
unsigned int kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm);
void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3);
......
......@@ -526,7 +526,9 @@ static int stimer_set_config(struct kvm_vcpu_hv_stimer *stimer, u64 config,
new_config.enable = 0;
stimer->config.as_uint64 = new_config.as_uint64;
stimer_mark_pending(stimer, false);
if (stimer->config.enable)
stimer_mark_pending(stimer, false);
return 0;
}
......@@ -542,7 +544,10 @@ static int stimer_set_count(struct kvm_vcpu_hv_stimer *stimer, u64 count,
stimer->config.enable = 0;
else if (stimer->config.auto_enable)
stimer->config.enable = 1;
stimer_mark_pending(stimer, false);
if (stimer->config.enable)
stimer_mark_pending(stimer, false);
return 0;
}
......
......@@ -182,7 +182,7 @@ struct kvm_shadow_walk_iterator {
static const union kvm_mmu_page_role mmu_base_role_mask = {
.cr0_wp = 1,
.cr4_pae = 1,
.gpte_is_8_bytes = 1,
.nxe = 1,
.smep_andnot_wp = 1,
.smap_andnot_wp = 1,
......@@ -2205,6 +2205,7 @@ static bool kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list);
#define for_each_valid_sp(_kvm, _sp, _gfn) \
hlist_for_each_entry(_sp, \
&(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \
......@@ -2215,12 +2216,17 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
for_each_valid_sp(_kvm, _sp, _gfn) \
if ((_sp)->gfn != (_gfn) || (_sp)->role.direct) {} else
static inline bool is_ept_sp(struct kvm_mmu_page *sp)
{
return sp->role.cr0_wp && sp->role.smap_andnot_wp;
}
/* @sp->gfn should be write-protected at the call site */
static bool __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
struct list_head *invalid_list)
{
if (sp->role.cr4_pae != !!is_pae(vcpu)
|| vcpu->arch.mmu->sync_page(vcpu, sp) == 0) {
if ((!is_ept_sp(sp) && sp->role.gpte_is_8_bytes != !!is_pae(vcpu)) ||
vcpu->arch.mmu->sync_page(vcpu, sp) == 0) {
kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
return false;
}
......@@ -2423,7 +2429,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
role.level = level;
role.direct = direct;
if (role.direct)
role.cr4_pae = 0;
role.gpte_is_8_bytes = true;
role.access = access;
if (!vcpu->arch.mmu->direct_map
&& vcpu->arch.mmu->root_level <= PT32_ROOT_LEVEL) {
......@@ -4794,7 +4800,6 @@ static union kvm_mmu_role kvm_calc_mmu_role_common(struct kvm_vcpu *vcpu,
role.base.access = ACC_ALL;
role.base.nxe = !!is_nx(vcpu);
role.base.cr4_pae = !!is_pae(vcpu);
role.base.cr0_wp = is_write_protection(vcpu);
role.base.smm = is_smm(vcpu);
role.base.guest_mode = is_guest_mode(vcpu);
......@@ -4815,6 +4820,7 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu, bool base_only)
role.base.ad_disabled = (shadow_accessed_mask == 0);
role.base.level = kvm_x86_ops->get_tdp_level(vcpu);
role.base.direct = true;
role.base.gpte_is_8_bytes = true;
return role;
}
......@@ -4879,6 +4885,7 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu, bool base_only)
role.base.smap_andnot_wp = role.ext.cr4_smap &&
!is_write_protection(vcpu);
role.base.direct = !is_paging(vcpu);
role.base.gpte_is_8_bytes = !!is_pae(vcpu);
if (!is_long_mode(vcpu))
role.base.level = PT32E_ROOT_LEVEL;
......@@ -4918,18 +4925,26 @@ static union kvm_mmu_role
kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool accessed_dirty,
bool execonly)
{
union kvm_mmu_role role;
union kvm_mmu_role role = {0};
/* Base role is inherited from root_mmu */
role.base.word = vcpu->arch.root_mmu.mmu_role.base.word;
role.ext = kvm_calc_mmu_role_ext(vcpu);
/* SMM flag is inherited from root_mmu */
role.base.smm = vcpu->arch.root_mmu.mmu_role.base.smm;
role.base.level = PT64_ROOT_4LEVEL;
role.base.gpte_is_8_bytes = true;
role.base.direct = false;
role.base.ad_disabled = !accessed_dirty;
role.base.guest_mode = true;
role.base.access = ACC_ALL;
/*
* WP=1 and NOT_WP=1 is an impossible combination, use WP and the
* SMAP variation to denote shadow EPT entries.
*/
role.base.cr0_wp = true;
role.base.smap_andnot_wp = true;
role.ext = kvm_calc_mmu_role_ext(vcpu);
role.ext.execonly = execonly;
return role;
......@@ -5179,7 +5194,7 @@ static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa,
gpa, bytes, sp->role.word);
offset = offset_in_page(gpa);
pte_size = sp->role.cr4_pae ? 8 : 4;
pte_size = sp->role.gpte_is_8_bytes ? 8 : 4;
/*
* Sometimes, the OS only writes the last one bytes to update status
......@@ -5203,7 +5218,7 @@ static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte)
page_offset = offset_in_page(gpa);
level = sp->role.level;
*nspte = 1;
if (!sp->role.cr4_pae) {
if (!sp->role.gpte_is_8_bytes) {
page_offset <<= 1; /* 32->64 */
/*
* A 32-bit pde maps 4MB while the shadow pdes map
......@@ -5393,10 +5408,12 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
* This can happen if a guest gets a page-fault on data access but the HW
* table walker is not able to read the instruction page (e.g instruction
* page is not present in memory). In those cases we simply restart the
* guest.
* guest, with the exception of AMD Erratum 1096 which is unrecoverable.
*/
if (unlikely(insn && !insn_len))
return 1;
if (unlikely(insn && !insn_len)) {
if (!kvm_x86_ops->need_emulation_on_page_fault(vcpu))
return 1;
}
er = x86_emulate_instruction(vcpu, cr2, emulation_type, insn, insn_len);
......@@ -5509,7 +5526,9 @@ slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
if (flush && lock_flush_tlb) {
kvm_flush_remote_tlbs(kvm);
kvm_flush_remote_tlbs_with_address(kvm,
start_gfn,
iterator.gfn - start_gfn + 1);
flush = false;
}
cond_resched_lock(&kvm->mmu_lock);
......@@ -5517,7 +5536,8 @@ slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
}
if (flush && lock_flush_tlb) {
kvm_flush_remote_tlbs(kvm);
kvm_flush_remote_tlbs_with_address(kvm, start_gfn,
end_gfn - start_gfn + 1);
flush = false;
}
......@@ -6011,7 +6031,7 @@ int kvm_mmu_module_init(void)
/*
* Calculate mmu pages needed for kvm.
*/
unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm)
unsigned int kvm_mmu_calculate_default_mmu_pages(struct kvm *kvm)
{
unsigned int nr_mmu_pages;
unsigned int nr_pages = 0;
......
......@@ -29,10 +29,10 @@
\
role.word = __entry->role; \
\
trace_seq_printf(p, "sp gfn %llx l%u%s q%u%s %s%s" \
trace_seq_printf(p, "sp gfn %llx l%u %u-byte q%u%s %s%s" \
" %snxe %sad root %u %s%c", \
__entry->gfn, role.level, \
role.cr4_pae ? " pae" : "", \
role.gpte_is_8_bytes ? 8 : 4, \
role.quadrant, \
role.direct ? " direct" : "", \
access_str[role.access], \
......
......@@ -7098,6 +7098,36 @@ static int nested_enable_evmcs(struct kvm_vcpu *vcpu,
return -ENODEV;
}
static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
{
bool is_user, smap;
is_user = svm_get_cpl(vcpu) == 3;
smap = !kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
/*
* Detect and workaround Errata 1096 Fam_17h_00_0Fh
*
* In non SEV guest, hypervisor will be able to read the guest
* memory to decode the instruction pointer when insn_len is zero
* so we return true to indicate that decoding is possible.
*
* But in the SEV guest, the guest memory is encrypted with the
* guest specific key and hypervisor will not be able to decode the
* instruction pointer so we will not able to workaround it. Lets
* print the error and request to kill the guest.
*/
if (is_user && smap) {
if (!sev_guest(vcpu->kvm))
return true;
pr_err_ratelimited("KVM: Guest triggered AMD Erratum 1096\n");
kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
}
return false;
}
static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
.cpu_has_kvm_support = has_svm,
.disabled_by_bios = is_disabled,
......@@ -7231,6 +7261,8 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
.nested_enable_evmcs = nested_enable_evmcs,
.nested_get_evmcs_version = nested_get_evmcs_version,
.need_emulation_on_page_fault = svm_need_emulation_on_page_fault,
};
static int __init svm_init(void)
......
......@@ -2585,6 +2585,11 @@ static int nested_check_host_control_regs(struct kvm_vcpu *vcpu,
!nested_host_cr4_valid(vcpu, vmcs12->host_cr4) ||
!nested_cr3_valid(vcpu, vmcs12->host_cr3))
return -EINVAL;
if (is_noncanonical_address(vmcs12->host_ia32_sysenter_esp, vcpu) ||
is_noncanonical_address(vmcs12->host_ia32_sysenter_eip, vcpu))
return -EINVAL;
/*
* If the load IA32_EFER VM-exit control is 1, bits reserved in the
* IA32_EFER MSR must be 0 in the field for that register. In addition,
......
......@@ -1683,12 +1683,6 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
msr_info->data = to_vmx(vcpu)->spec_ctrl;
break;
case MSR_IA32_ARCH_CAPABILITIES:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
return 1;
msr_info->data = to_vmx(vcpu)->arch_capabilities;
break;
case MSR_IA32_SYSENTER_CS:
msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
break;
......@@ -1895,11 +1889,6 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
MSR_TYPE_W);
break;
case MSR_IA32_ARCH_CAPABILITIES:
if (!msr_info->host_initiated)
return 1;
vmx->arch_capabilities = data;
break;
case MSR_IA32_CR_PAT:
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
......@@ -4088,8 +4077,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
++vmx->nmsrs;
}
vmx->arch_capabilities = kvm_get_arch_capabilities();
vm_exit_controls_init(vmx, vmx_vmexit_ctrl());
/* 22.2.1, 20.8.1 */
......@@ -7409,6 +7396,11 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
return 0;
}
static bool vmx_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
{
return 0;
}
static __init int hardware_setup(void)
{
unsigned long host_bndcfgs;
......@@ -7711,6 +7703,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
.set_nested_state = NULL,
.get_vmcs12_pages = NULL,
.nested_enable_evmcs = NULL,
.need_emulation_on_page_fault = vmx_need_emulation_on_page_fault,
};
static void vmx_cleanup_l1d_flush(void)
......
......@@ -190,7 +190,6 @@ struct vcpu_vmx {
u64 msr_guest_kernel_gs_base;
#endif
u64 arch_capabilities;
u64 spec_ctrl;
u32 vm_entry_controls_shadow;
......
......@@ -1125,7 +1125,7 @@ static u32 msrs_to_save[] = {
#endif
MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
MSR_IA32_SPEC_CTRL, MSR_IA32_ARCH_CAPABILITIES,
MSR_IA32_SPEC_CTRL,
MSR_IA32_RTIT_CTL, MSR_IA32_RTIT_STATUS, MSR_IA32_RTIT_CR3_MATCH,
MSR_IA32_RTIT_OUTPUT_BASE, MSR_IA32_RTIT_OUTPUT_MASK,
MSR_IA32_RTIT_ADDR0_A, MSR_IA32_RTIT_ADDR0_B,
......@@ -1158,6 +1158,7 @@ static u32 emulated_msrs[] = {
MSR_IA32_TSC_ADJUST,
MSR_IA32_TSCDEADLINE,
MSR_IA32_ARCH_CAPABILITIES,
MSR_IA32_MISC_ENABLE,
MSR_IA32_MCG_STATUS,
MSR_IA32_MCG_CTL,
......@@ -2443,6 +2444,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if (msr_info->host_initiated)
vcpu->arch.microcode_version = data;
break;
case MSR_IA32_ARCH_CAPABILITIES:
if (!msr_info->host_initiated)
return 1;
vcpu->arch.arch_capabilities = data;
break;
case MSR_EFER:
return set_efer(vcpu, data);
case MSR_K7_HWCR:
......@@ -2747,6 +2753,12 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_UCODE_REV:
msr_info->data = vcpu->arch.microcode_version;
break;
case MSR_IA32_ARCH_CAPABILITIES:
if (!msr_info->host_initiated &&
!guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
return 1;
msr_info->data = vcpu->arch.arch_capabilities;
break;
case MSR_IA32_TSC:
msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + vcpu->arch.tsc_offset;
break;
......@@ -6523,14 +6535,27 @@ int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu,
}
EXPORT_SYMBOL_GPL(kvm_emulate_instruction_from_buffer);
static int complete_fast_pio_out(struct kvm_vcpu *vcpu)
{
vcpu->arch.pio.count = 0;
if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.pio.linear_rip)))
return 1;
return kvm_skip_emulated_instruction(vcpu);
}
static int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size,
unsigned short port)
{
unsigned long val = kvm_register_read(vcpu, VCPU_REGS_RAX);
int ret = emulator_pio_out_emulated(&vcpu->arch.emulate_ctxt,
size, port, &val, 1);
/* do not return to emulator after return from userspace */
vcpu->arch.pio.count = 0;
if (!ret) {
vcpu->arch.pio.linear_rip = kvm_get_linear_rip(vcpu);
vcpu->arch.complete_userspace_io = complete_fast_pio_out;
}
return ret;
}
......@@ -6541,6 +6566,11 @@ static int complete_fast_pio_in(struct kvm_vcpu *vcpu)
/* We should only ever be called with arch.pio.count equal to 1 */
BUG_ON(vcpu->arch.pio.count != 1);
if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.pio.linear_rip))) {
vcpu->arch.pio.count = 0;
return 1;
}
/* For size less than 4 we merge, else we zero extend */
val = (vcpu->arch.pio.size < 4) ? kvm_register_read(vcpu, VCPU_REGS_RAX)
: 0;
......@@ -6553,7 +6583,7 @@ static int complete_fast_pio_in(struct kvm_vcpu *vcpu)
vcpu->arch.pio.port, &val, 1);
kvm_register_write(vcpu, VCPU_REGS_RAX, val);
return 1;
return kvm_skip_emulated_instruction(vcpu);
}
static int kvm_fast_pio_in(struct kvm_vcpu *vcpu, int size,
......@@ -6572,6 +6602,7 @@ static int kvm_fast_pio_in(struct kvm_vcpu *vcpu, int size,
return ret;
}
vcpu->arch.pio.linear_rip = kvm_get_linear_rip(vcpu);
vcpu->arch.complete_userspace_io = complete_fast_pio_in;
return 0;
......@@ -6579,16 +6610,13 @@ static int kvm_fast_pio_in(struct kvm_vcpu *vcpu, int size,
int kvm_fast_pio(struct kvm_vcpu *vcpu, int size, unsigned short port, int in)
{
int ret = kvm_skip_emulated_instruction(vcpu);
int ret;
/*
* TODO: we might be squashing a KVM_GUESTDBG_SINGLESTEP-triggered
* KVM_EXIT_DEBUG here.
*/
if (in)
return kvm_fast_pio_in(vcpu, size, port) && ret;
ret = kvm_fast_pio_in(vcpu, size, port);
else
return kvm_fast_pio_out(vcpu, size, port) && ret;
ret = kvm_fast_pio_out(vcpu, size, port);
return ret && kvm_skip_emulated_instruction(vcpu);
}
EXPORT_SYMBOL_GPL(kvm_fast_pio);
......@@ -8733,6 +8761,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm,
int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
{
vcpu->arch.arch_capabilities = kvm_get_arch_capabilities();
vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
kvm_vcpu_mtrr_init(vcpu);
vcpu_load(vcpu);
......@@ -9429,13 +9458,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
const struct kvm_memory_slot *new,
enum kvm_mr_change change)
{
int nr_mmu_pages = 0;
if (!kvm->arch.n_requested_mmu_pages)
nr_mmu_pages = kvm_mmu_calculate_mmu_pages(kvm);
if (nr_mmu_pages)
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
kvm_mmu_change_mmu_pages(kvm,
kvm_mmu_calculate_default_mmu_pages(kvm));
/*
* Dirty logging tracks sptes in 4k granularity, meaning that large
......
......@@ -15,6 +15,7 @@ generic-y += irq_work.h
generic-y += kdebug.h
generic-y += kmap_types.h
generic-y += kprobes.h
generic-y += kvm_para.h
generic-y += local.h
generic-y += local64.h
generic-y += mcs_spinlock.h
......
generated-y += unistd_32.h
generic-y += kvm_para.h
......@@ -7,5 +7,7 @@ no-export-headers += kvm.h
endif
ifeq ($(wildcard $(srctree)/arch/$(SRCARCH)/include/uapi/asm/kvm_para.h),)
ifeq ($(wildcard $(objtree)/arch/$(SRCARCH)/include/generated/uapi/asm/kvm_para.h),)
no-export-headers += kvm_para.h
endif
endif
......@@ -29,8 +29,8 @@ LIBKVM += $(LIBKVM_$(UNAME_M))
INSTALL_HDR_PATH = $(top_srcdir)/usr
LINUX_HDR_PATH = $(INSTALL_HDR_PATH)/include/
LINUX_TOOL_INCLUDE = $(top_srcdir)/tools/include
CFLAGS += -O2 -g -std=gnu99 -I$(LINUX_TOOL_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude -I$(<D) -Iinclude/$(UNAME_M) -I..
LDFLAGS += -pthread
CFLAGS += -O2 -g -std=gnu99 -fno-stack-protector -fno-PIE -I$(LINUX_TOOL_INCLUDE) -I$(LINUX_HDR_PATH) -Iinclude -I$(<D) -Iinclude/$(UNAME_M) -I..
LDFLAGS += -pthread -no-pie
# After inclusion, $(OUTPUT) is defined and
# $(TEST_GEN_PROGS) starts with $(OUTPUT)/
......
......@@ -102,6 +102,7 @@ vm_paddr_t addr_gva2gpa(struct kvm_vm *vm, vm_vaddr_t gva);
struct kvm_run *vcpu_state(struct kvm_vm *vm, uint32_t vcpuid);
void vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
int _vcpu_run(struct kvm_vm *vm, uint32_t vcpuid);
void vcpu_run_complete_io(struct kvm_vm *vm, uint32_t vcpuid);
void vcpu_set_mp_state(struct kvm_vm *vm, uint32_t vcpuid,
struct kvm_mp_state *mp_state);
void vcpu_regs_get(struct kvm_vm *vm, uint32_t vcpuid, struct kvm_regs *regs);
......
......@@ -1121,6 +1121,22 @@ int _vcpu_run(struct kvm_vm *vm, uint32_t vcpuid)
return rc;
}
void vcpu_run_complete_io(struct kvm_vm *vm, uint32_t vcpuid)
{
struct vcpu *vcpu = vcpu_find(vm, vcpuid);
int ret;
TEST_ASSERT(vcpu != NULL, "vcpu not found, vcpuid: %u", vcpuid);
vcpu->state->immediate_exit = 1;
ret = ioctl(vcpu->fd, KVM_RUN, NULL);
vcpu->state->immediate_exit = 0;
TEST_ASSERT(ret == -1 && errno == EINTR,
"KVM_RUN IOCTL didn't exit immediately, rc: %i, errno: %i",
ret, errno);
}
/*
* VM VCPU Set MP State
*
......
......@@ -87,22 +87,25 @@ int main(int argc, char *argv[])
while (1) {
rc = _vcpu_run(vm, VCPU_ID);
if (run->exit_reason == KVM_EXIT_IO) {
switch (get_ucall(vm, VCPU_ID, &uc)) {
case UCALL_SYNC:
/* emulate hypervisor clearing CR4.OSXSAVE */
vcpu_sregs_get(vm, VCPU_ID, &sregs);
sregs.cr4 &= ~X86_CR4_OSXSAVE;
vcpu_sregs_set(vm, VCPU_ID, &sregs);
break;
case UCALL_ABORT:
TEST_ASSERT(false, "Guest CR4 bit (OSXSAVE) unsynchronized with CPUID bit.");
break;
case UCALL_DONE:
goto done;
default:
TEST_ASSERT(false, "Unknown ucall 0x%x.", uc.cmd);
}
TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
"Unexpected exit reason: %u (%s),\n",
run->exit_reason,
exit_reason_str(run->exit_reason));
switch (get_ucall(vm, VCPU_ID, &uc)) {
case UCALL_SYNC:
/* emulate hypervisor clearing CR4.OSXSAVE */
vcpu_sregs_get(vm, VCPU_ID, &sregs);
sregs.cr4 &= ~X86_CR4_OSXSAVE;
vcpu_sregs_set(vm, VCPU_ID, &sregs);
break;
case UCALL_ABORT:
TEST_ASSERT(false, "Guest CR4 bit (OSXSAVE) unsynchronized with CPUID bit.");
break;
case UCALL_DONE:
goto done;
default:
TEST_ASSERT(false, "Unknown ucall 0x%x.", uc.cmd);
}
}
......
......@@ -134,6 +134,11 @@ int main(int argc, char *argv[])
struct kvm_cpuid_entry2 *entry = kvm_get_supported_cpuid_entry(1);
if (!kvm_check_cap(KVM_CAP_IMMEDIATE_EXIT)) {
fprintf(stderr, "immediate_exit not available, skipping test\n");
exit(KSFT_SKIP);
}
/* Create VM */
vm = vm_create_default(VCPU_ID, 0, guest_code);
vcpu_set_cpuid(vm, VCPU_ID, kvm_get_supported_cpuid());
......@@ -156,8 +161,6 @@ int main(int argc, char *argv[])
stage, run->exit_reason,
exit_reason_str(run->exit_reason));
memset(&regs1, 0, sizeof(regs1));
vcpu_regs_get(vm, VCPU_ID, &regs1);
switch (get_ucall(vm, VCPU_ID, &uc)) {
case UCALL_ABORT:
TEST_ASSERT(false, "%s at %s:%d", (const char *)uc.args[0],
......@@ -176,6 +179,17 @@ int main(int argc, char *argv[])
uc.args[1] == stage, "Unexpected register values vmexit #%lx, got %lx",
stage, (ulong)uc.args[1]);
/*
* When KVM exits to userspace with KVM_EXIT_IO, KVM guarantees
* guest state is consistent only after userspace re-enters the
* kernel with KVM_RUN. Complete IO prior to migrating state
* to a new VM.
*/
vcpu_run_complete_io(vm, VCPU_ID);
memset(&regs1, 0, sizeof(regs1));
vcpu_regs_get(vm, VCPU_ID, &regs1);
state = vcpu_save_state(vm, VCPU_ID);
kvm_vm_release(vm);
......
......@@ -222,7 +222,7 @@ void __hyp_text __vgic_v3_save_state(struct kvm_vcpu *vcpu)
}
}
if (used_lrs) {
if (used_lrs || cpu_if->its_vpe.its_vm) {
int i;
u32 elrsr;
......@@ -247,7 +247,7 @@ void __hyp_text __vgic_v3_restore_state(struct kvm_vcpu *vcpu)
u64 used_lrs = vcpu->arch.vgic_cpu.used_lrs;
int i;
if (used_lrs) {
if (used_lrs || cpu_if->its_vpe.its_vm) {
write_gicreg(cpu_if->vgic_hcr, ICH_HCR_EL2);
for (i = 0; i < used_lrs; i++)
......
......@@ -102,8 +102,7 @@ static bool kvm_is_device_pfn(unsigned long pfn)
* @addr: IPA
* @pmd: pmd pointer for IPA
*
* Function clears a PMD entry, flushes addr 1st and 2nd stage TLBs. Marks all
* pages in the range dirty.
* Function clears a PMD entry, flushes addr 1st and 2nd stage TLBs.
*/
static void stage2_dissolve_pmd(struct kvm *kvm, phys_addr_t addr, pmd_t *pmd)
{
......@@ -121,8 +120,7 @@ static void stage2_dissolve_pmd(struct kvm *kvm, phys_addr_t addr, pmd_t *pmd)
* @addr: IPA
* @pud: pud pointer for IPA
*
* Function clears a PUD entry, flushes addr 1st and 2nd stage TLBs. Marks all
* pages in the range dirty.
* Function clears a PUD entry, flushes addr 1st and 2nd stage TLBs.
*/
static void stage2_dissolve_pud(struct kvm *kvm, phys_addr_t addr, pud_t *pudp)
{
......@@ -899,9 +897,8 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
* kvm_alloc_stage2_pgd - allocate level-1 table for stage-2 translation.
* @kvm: The KVM struct pointer for the VM.
*
* Allocates only the stage-2 HW PGD level table(s) (can support either full
* 40-bit input addresses or limited to 32-bit input addresses). Clears the
* allocated pages.
* Allocates only the stage-2 HW PGD level table(s) of size defined by
* stage2_pgd_size(kvm).
*
* Note we don't need locking here as this is only called when the VM is
* created, which can only be done once.
......@@ -1067,25 +1064,43 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
{
pmd_t *pmd, old_pmd;
retry:
pmd = stage2_get_pmd(kvm, cache, addr);
VM_BUG_ON(!pmd);
old_pmd = *pmd;
/*
* Multiple vcpus faulting on the same PMD entry, can
* lead to them sequentially updating the PMD with the
* same value. Following the break-before-make
* (pmd_clear() followed by tlb_flush()) process can
* hinder forward progress due to refaults generated
* on missing translations.
*
* Skip updating the page table if the entry is
* unchanged.
*/
if (pmd_val(old_pmd) == pmd_val(*new_pmd))
return 0;
if (pmd_present(old_pmd)) {
/*
* Multiple vcpus faulting on the same PMD entry, can
* lead to them sequentially updating the PMD with the
* same value. Following the break-before-make
* (pmd_clear() followed by tlb_flush()) process can
* hinder forward progress due to refaults generated
* on missing translations.
* If we already have PTE level mapping for this block,
* we must unmap it to avoid inconsistent TLB state and
* leaking the table page. We could end up in this situation
* if the memory slot was marked for dirty logging and was
* reverted, leaving PTE level mappings for the pages accessed
* during the period. So, unmap the PTE level mapping for this
* block and retry, as we could have released the upper level
* table in the process.
*
* Skip updating the page table if the entry is
* unchanged.
* Normal THP split/merge follows mmu_notifier callbacks and do
* get handled accordingly.
*/
if (pmd_val(old_pmd) == pmd_val(*new_pmd))
return 0;
if (!pmd_thp_or_huge(old_pmd)) {
unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE);
goto retry;
}
/*
* Mapping in huge pages should only happen through a
* fault. If a page is merged into a transparent huge
......@@ -1097,8 +1112,7 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
* should become splitting first, unmapped, merged,
* and mapped back in on-demand.
*/
VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd));
pmd_clear(pmd);
kvm_tlb_flush_vmid_ipa(kvm, addr);
} else {
......@@ -1114,6 +1128,7 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
{
pud_t *pudp, old_pud;
retry:
pudp = stage2_get_pud(kvm, cache, addr);
VM_BUG_ON(!pudp);
......@@ -1121,14 +1136,23 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac
/*
* A large number of vcpus faulting on the same stage 2 entry,
* can lead to a refault due to the
* stage2_pud_clear()/tlb_flush(). Skip updating the page
* tables if there is no change.
* can lead to a refault due to the stage2_pud_clear()/tlb_flush().
* Skip updating the page tables if there is no change.
*/
if (pud_val(old_pud) == pud_val(*new_pudp))
return 0;
if (stage2_pud_present(kvm, old_pud)) {
/*
* If we already have table level mapping for this block, unmap
* the range for this block and retry.
*/
if (!stage2_pud_huge(kvm, old_pud)) {
unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE);
goto retry;
}
WARN_ON_ONCE(kvm_pud_pfn(old_pud) != kvm_pud_pfn(*new_pudp));
stage2_pud_clear(kvm, pudp);
kvm_tlb_flush_vmid_ipa(kvm, addr);
} else {
......@@ -1451,13 +1475,11 @@ static void stage2_wp_pmds(struct kvm *kvm, pud_t *pud,
}
/**
* stage2_wp_puds - write protect PGD range
* @pgd: pointer to pgd entry
* @addr: range start address
* @end: range end address
*
* Process PUD entries, for a huge PUD we cause a panic.
*/
* stage2_wp_puds - write protect PGD range
* @pgd: pointer to pgd entry
* @addr: range start address
* @end: range end address
*/
static void stage2_wp_puds(struct kvm *kvm, pgd_t *pgd,
phys_addr_t addr, phys_addr_t end)
{
......@@ -1594,8 +1616,9 @@ static void kvm_send_hwpoison_signal(unsigned long address,
send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
}
static bool fault_supports_stage2_pmd_mappings(struct kvm_memory_slot *memslot,
unsigned long hva)
static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
unsigned long hva,
unsigned long map_size)
{
gpa_t gpa_start;
hva_t uaddr_start, uaddr_end;
......@@ -1610,34 +1633,34 @@ static bool fault_supports_stage2_pmd_mappings(struct kvm_memory_slot *memslot,
/*
* Pages belonging to memslots that don't have the same alignment
* within a PMD for userspace and IPA cannot be mapped with stage-2
* PMD entries, because we'll end up mapping the wrong pages.
* within a PMD/PUD for userspace and IPA cannot be mapped with stage-2
* PMD/PUD entries, because we'll end up mapping the wrong pages.
*
* Consider a layout like the following:
*
* memslot->userspace_addr:
* +-----+--------------------+--------------------+---+
* |abcde|fgh Stage-1 PMD | Stage-1 PMD tv|xyz|
* |abcde|fgh Stage-1 block | Stage-1 block tv|xyz|
* +-----+--------------------+--------------------+---+
*
* memslot->base_gfn << PAGE_SIZE:
* +---+--------------------+--------------------+-----+
* |abc|def Stage-2 PMD | Stage-2 PMD |tvxyz|
* |abc|def Stage-2 block | Stage-2 block |tvxyz|
* +---+--------------------+--------------------+-----+
*
* If we create those stage-2 PMDs, we'll end up with this incorrect
* If we create those stage-2 blocks, we'll end up with this incorrect
* mapping:
* d -> f
* e -> g
* f -> h
*/
if ((gpa_start & ~S2_PMD_MASK) != (uaddr_start & ~S2_PMD_MASK))
if ((gpa_start & (map_size - 1)) != (uaddr_start & (map_size - 1)))
return false;
/*
* Next, let's make sure we're not trying to map anything not covered
* by the memslot. This means we have to prohibit PMD size mappings
* for the beginning and end of a non-PMD aligned and non-PMD sized
* by the memslot. This means we have to prohibit block size mappings
* for the beginning and end of a non-block aligned and non-block sized
* memory slot (illustrated by the head and tail parts of the
* userspace view above containing pages 'abcde' and 'xyz',
* respectively).
......@@ -1646,8 +1669,8 @@ static bool fault_supports_stage2_pmd_mappings(struct kvm_memory_slot *memslot,
* userspace_addr or the base_gfn, as both are equally aligned (per
* the check above) and equally sized.
*/
return (hva & S2_PMD_MASK) >= uaddr_start &&
(hva & S2_PMD_MASK) + S2_PMD_SIZE <= uaddr_end;
return (hva & ~(map_size - 1)) >= uaddr_start &&
(hva & ~(map_size - 1)) + map_size <= uaddr_end;
}
static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
......@@ -1676,12 +1699,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
return -EFAULT;
}
if (!fault_supports_stage2_pmd_mappings(memslot, hva))
force_pte = true;
if (logging_active)
force_pte = true;
/* Let's check if we will get back a huge page backed by hugetlbfs */
down_read(&current->mm->mmap_sem);
vma = find_vma_intersection(current->mm, hva, hva + 1);
......@@ -1692,6 +1709,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
}
vma_pagesize = vma_kernel_pagesize(vma);
if (logging_active ||
!fault_supports_stage2_huge_mapping(memslot, hva, vma_pagesize)) {
force_pte = true;
vma_pagesize = PAGE_SIZE;
}
/*
* The stage2 has a minimum of 2 level table (For arm64 see
* kvm_arm_setup_stage2()). Hence, we are guaranteed that we can
......@@ -1699,11 +1722,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
* As for PUD huge maps, we must make sure that we have at least
* 3 levels, i.e, PMD is not folded.
*/
if ((vma_pagesize == PMD_SIZE ||
(vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm))) &&
!force_pte) {
if (vma_pagesize == PMD_SIZE ||
(vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm)))
gfn = (fault_ipa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT;
}
up_read(&current->mm->mmap_sem);
/* We need minimum second+third level pages */
......
......@@ -754,8 +754,9 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
u64 indirect_ptr, type = GITS_BASER_TYPE(baser);
phys_addr_t base = GITS_BASER_ADDR_48_to_52(baser);
int esz = GITS_BASER_ENTRY_SIZE(baser);
int index;
int index, idx;
gfn_t gfn;
bool ret;
switch (type) {
case GITS_BASER_TYPE_DEVICE:
......@@ -782,7 +783,8 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
if (eaddr)
*eaddr = addr;
return kvm_is_visible_gfn(its->dev->kvm, gfn);
goto out;
}
/* calculate and check the index into the 1st level */
......@@ -812,7 +814,12 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
if (eaddr)
*eaddr = indirect_ptr;
return kvm_is_visible_gfn(its->dev->kvm, gfn);
out:
idx = srcu_read_lock(&its->dev->kvm->srcu);
ret = kvm_is_visible_gfn(its->dev->kvm, gfn);
srcu_read_unlock(&its->dev->kvm->srcu, idx);
return ret;
}
static int vgic_its_alloc_collection(struct vgic_its *its,
......@@ -1729,8 +1736,8 @@ static void vgic_its_destroy(struct kvm_device *kvm_dev)
kfree(its);
}
int vgic_its_has_attr_regs(struct kvm_device *dev,
struct kvm_device_attr *attr)
static int vgic_its_has_attr_regs(struct kvm_device *dev,
struct kvm_device_attr *attr)
{
const struct vgic_register_region *region;
gpa_t offset = attr->attr;
......@@ -1750,9 +1757,9 @@ int vgic_its_has_attr_regs(struct kvm_device *dev,
return 0;
}
int vgic_its_attr_regs_access(struct kvm_device *dev,
struct kvm_device_attr *attr,
u64 *reg, bool is_write)
static int vgic_its_attr_regs_access(struct kvm_device *dev,
struct kvm_device_attr *attr,
u64 *reg, bool is_write)
{
const struct vgic_register_region *region;
struct vgic_its *its;
......@@ -1919,7 +1926,7 @@ static int vgic_its_save_ite(struct vgic_its *its, struct its_device *dev,
((u64)ite->irq->intid << KVM_ITS_ITE_PINTID_SHIFT) |
ite->collection->collection_id;
val = cpu_to_le64(val);
return kvm_write_guest(kvm, gpa, &val, ite_esz);
return kvm_write_guest_lock(kvm, gpa, &val, ite_esz);
}
/**
......@@ -2066,7 +2073,7 @@ static int vgic_its_save_dte(struct vgic_its *its, struct its_device *dev,
(itt_addr_field << KVM_ITS_DTE_ITTADDR_SHIFT) |
(dev->num_eventid_bits - 1));
val = cpu_to_le64(val);
return kvm_write_guest(kvm, ptr, &val, dte_esz);
return kvm_write_guest_lock(kvm, ptr, &val, dte_esz);
}
/**
......@@ -2246,7 +2253,7 @@ static int vgic_its_save_cte(struct vgic_its *its,
((u64)collection->target_addr << KVM_ITS_CTE_RDBASE_SHIFT) |
collection->collection_id);
val = cpu_to_le64(val);
return kvm_write_guest(its->dev->kvm, gpa, &val, esz);
return kvm_write_guest_lock(its->dev->kvm, gpa, &val, esz);
}
static int vgic_its_restore_cte(struct vgic_its *its, gpa_t gpa, int esz)
......@@ -2317,7 +2324,7 @@ static int vgic_its_save_collection_table(struct vgic_its *its)
*/
val = 0;
BUG_ON(cte_esz > sizeof(val));
ret = kvm_write_guest(its->dev->kvm, gpa, &val, cte_esz);
ret = kvm_write_guest_lock(its->dev->kvm, gpa, &val, cte_esz);
return ret;
}
......
......@@ -358,7 +358,7 @@ int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq)
if (status) {
/* clear consumed data */
val &= ~(1 << bit_nr);
ret = kvm_write_guest(kvm, ptr, &val, 1);
ret = kvm_write_guest_lock(kvm, ptr, &val, 1);
if (ret)
return ret;
}
......@@ -409,7 +409,7 @@ int vgic_v3_save_pending_tables(struct kvm *kvm)
else
val &= ~(1 << bit_nr);
ret = kvm_write_guest(kvm, ptr, &val, 1);
ret = kvm_write_guest_lock(kvm, ptr, &val, 1);
if (ret)
return ret;
}
......
......@@ -867,15 +867,21 @@ void kvm_vgic_flush_hwstate(struct kvm_vcpu *vcpu)
* either observe the new interrupt before or after doing this check,
* and introducing additional synchronization mechanism doesn't change
* this.
*
* Note that we still need to go through the whole thing if anything
* can be directly injected (GICv4).
*/
if (list_empty(&vcpu->arch.vgic_cpu.ap_list_head))
if (list_empty(&vcpu->arch.vgic_cpu.ap_list_head) &&
!vgic_supports_direct_msis(vcpu->kvm))
return;
DEBUG_SPINLOCK_BUG_ON(!irqs_disabled());
raw_spin_lock(&vcpu->arch.vgic_cpu.ap_list_lock);
vgic_flush_lr_state(vcpu);
raw_spin_unlock(&vcpu->arch.vgic_cpu.ap_list_lock);
if (!list_empty(&vcpu->arch.vgic_cpu.ap_list_head)) {
raw_spin_lock(&vcpu->arch.vgic_cpu.ap_list_lock);
vgic_flush_lr_state(vcpu);
raw_spin_unlock(&vcpu->arch.vgic_cpu.ap_list_lock);
}
if (can_access_vgic_from_kernel())
vgic_restore_state(vcpu);
......
......@@ -214,9 +214,9 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
if (flags & EPOLLHUP) {
/* The eventfd is closing, detach from KVM */
unsigned long flags;
unsigned long iflags;
spin_lock_irqsave(&kvm->irqfds.lock, flags);
spin_lock_irqsave(&kvm->irqfds.lock, iflags);
/*
* We must check if someone deactivated the irqfd before
......@@ -230,7 +230,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
if (irqfd_is_active(irqfd))
irqfd_deactivate(irqfd);
spin_unlock_irqrestore(&kvm->irqfds.lock, flags);
spin_unlock_irqrestore(&kvm->irqfds.lock, iflags);
}
return 0;
......
......@@ -2905,6 +2905,9 @@ static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
{
struct kvm_device *dev = filp->private_data;
if (dev->kvm->mm != current->mm)
return -EIO;
switch (ioctl) {
case KVM_SET_DEVICE_ATTR:
return kvm_device_ioctl_attr(dev, dev->ops->set_attr, arg);
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册