提交 · 4566654bb9be9e8864df417bb72ceee5136b6a6a · openeuler / Kernel

24 9月, 2014 10 次提交

KVM: vmx: Inject #GP on invalid PAT CR · 4566654b

由 Nadav Amit 提交于 9月 18, 2014

Guest which sets the PAT CR to invalid value should get a #GP.  Currently, if
vmx supports loading PAT CR during entry, then the value is not checked.  This
patch makes the required check in that case.
Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4566654b

KVM: x86: emulating descriptor load misses long-mode case · 040c8dc8

由 Nadav Amit 提交于 9月 18, 2014

In 64-bit mode a #GP should be delivered to the guest "if the code segment
descriptor pointed to by the selector in the 64-bit gate doesn't have the L-bit
set and the D-bit clear." - Intel SDM "Interrupt 13â€”General Protection
Exception (#GP)".

This patch fixes the behavior of CS loading emulation code. Although the
comment says that segment loading is not supported in long mode, this function
is executed in long mode, so the fix is necassary.
Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

040c8dc8

KVM: x86: directly use kvm_make_request again · 77c3913b

由 Liang Chen 提交于 9月 18, 2014

A one-line wrapper around kvm_make_request is not particularly
useful. Replace kvm_mmu_flush_tlb() with kvm_make_request().
Signed-off-by: NLiang Chen <liangchen.linux@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

77c3913b

KVM: x86: count actual tlb flushes · a70656b6

由 Radim Krčmář 提交于 9月 18, 2014

- we count KVM_REQ_TLB_FLUSH requests, not actual flushes
  (KVM can have multiple requests for one flush)
- flushes from kvm_flush_remote_tlbs aren't counted
- it's easy to make a direct request by mistake

Solve these by postponing the counting to kvm_check_request().
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NLiang Chen <liangchen.linux@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a70656b6

KVM: nested VMX: disable perf cpuid reporting · bc613494

由 Marcelo Tosatti 提交于 9月 18, 2014

Initilization of L2 guest with -cpu host, on L1 guest with -cpu host
triggers:

(qemu) KVM: entry failed, hardware error 0x7
...
nested_vmx_run: VMCS MSR_{LOAD,STORE} unsupported

Nested VMX MSR load/store support is not sufficient to
allow perf for L2 guest.

Until properly fixed, trap CPUID and disable function 0xA.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bc613494

KVM: x86: Don't report guest userspace emulation error to userspace · a2b9e6c1

由 Nadav Amit 提交于 9月 17, 2014

Commit fc3a9157 ("KVM: X86: Don't report L2 emulation failures to
user-space") disabled the reporting of L2 (nested guest) emulation failures to
userspace due to race-condition between a vmexit and the instruction emulator.
The same rational applies also to userspace applications that are permitted by
the guest OS to access MMIO area or perform PIO.

This patch extends the current behavior - of injecting a #UD instead of
reporting it to userspace - also for guest userspace code.
Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a2b9e6c1

kvm: Make init_rmode_tss() return 0 on success. · 1f755a82

由 Paolo Bonzini 提交于 9月 16, 2014

In init_rmode_tss(), there two variables indicating the return
value, r and ret, and it return 0 on error, 1 on success. The function
is only called by vmx_set_tss_addr(), and ret is redundant.

This patch removes the redundant variable, by making init_rmode_tss()
return 0 on success, -errno on failure.
Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1f755a82

KVM: x86: Warn if guest virtual address space is not 48-bits · dd598091

由 Nadav Amit 提交于 9月 16, 2014

The KVM emulator code assumes that the guest virtual address space (in 64-bit)
is 48-bits wide. Fail the KVM_SET_CPUID and KVM_SET_CPUID2 ioctl if
userspace tries to create a guest that does not obey this restriction.
Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

dd598091

kvm-vfio: do not use module_init · 3c3c29fd

由 Paolo Bonzini 提交于 9月 24, 2014

/me got confused between the kernel and QEMU.  In the kernel, you can
only have one module_init function, and it will prevent unloading the
module unless you also have the corresponding module_exit function.

So, commit 80ce1639 (KVM: VFIO: register kvm_device_ops dynamically,
2014-09-02) broke unloading of the kvm module, by adding a module_init
function and no module_exit.

Repair it by making kvm_vfio_ops_init weak, and checking it in
kvm_init.

Cc: Will Deacon <will.deacon@arm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Alex Williamson <Alex.Williamson@redhat.com>
Fixes: 80ce1639Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3c3c29fd

KVM: EVENTFD: Remove inclusion of irq.h · 29f1b65b

由 Christoffer Dall 提交于 9月 22, 2014

Commit c77dcacb (KVM: Move more code under CONFIG_HAVE_KVM_IRQFD) added
functionality that depends on definitions in ioapic.h when
__KVM_HAVE_IOAPIC is defined.

At the same time, kvm-arm commit 0ba09511 (KVM: EVENTFD: remove inclusion
of irq.h) removed the inclusion of irq.h, an architecture-specific header
that is not present on ARM but which happened to include ioapic.h on x86.

Include ioapic.h directly in eventfd.c if __KVM_HAVE_IOAPIC is defined.
This fixes x86 and lets ARM use eventfd.c.
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

29f1b65b

17 9月, 2014 6 次提交

kvm: Make init_rmode_identity_map() return 0 on success. · f51770ed

由 Tang Chen 提交于 9月 16, 2014

In init_rmode_identity_map(), there two variables indicating the return
value, r and ret, and it return 0 on error, 1 on success. The function
is only called by vmx_create_vcpu(), and ret is redundant.

This patch removes the redundant variable, and makes init_rmode_identity_map()
return 0 on success, -errno on failure.
Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f51770ed

kvm: Remove ept_identity_pagetable from struct kvm_arch. · a255d479

由 Tang Chen 提交于 9月 16, 2014

kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
it is never used to refer to the page at all.

In vcpu initialization, it indicates two things:
1. indicates if ept page is allocated
2. indicates if a memory slot for identity page is initialized

Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
identity pagetable is initialized. So we can remove ept_identity_pagetable.

NOTE: In the original code, ept identity pagetable page is pinned in memroy.
As a result, it cannot be migrated/hot-removed. After this patch, since
kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page
is no longer pinned in memory. And it can be migrated/hot-removed.
Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: NGleb Natapov <gleb@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a255d479

KVM: VFIO: register kvm_device_ops dynamically · 80ce1639

由 Will Deacon 提交于 9月 02, 2014

Now that we have a dynamic means to register kvm_device_ops, use that
for the VFIO kvm device, instead of relying on the static table.

This is achieved by a module_init call to register the ops with KVM.

Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: NAlex Williamson <Alex.Williamson@redhat.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

80ce1639

KVM: s390: register flic ops dynamically · 84877d93

由 Cornelia Huck 提交于 9月 02, 2014

Using the new kvm_register_device_ops() interface makes us get rid of
an #ifdef in common code.

Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NCornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

84877d93

KVM: ARM: vgic: register kvm_device_ops dynamically · c06a841b

由 Will Deacon 提交于 9月 02, 2014

Now that we have a dynamic means to register kvm_device_ops, use that
for the ARM VGIC, instead of relying on the static table.

Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
Reviewed-by: NChristoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c06a841b

KVM: device: add simple registration mechanism for kvm_device_ops · d60eacb0

由 Will Deacon 提交于 9月 02, 2014

kvm_ioctl_create_device currently has knowledge of all the device types
and their associated ops. This is fairly inflexible when adding support
for new in-kernel device emulations, so move what we currently have out
into a table, which can support dynamic registration of ops by new
drivers for virtual hardware.

Cc: Alex Williamson <Alex.Williamson@redhat.com>
Cc: Alex Graf <agraf@suse.de>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: NCornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: NChristoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d60eacb0

16 9月, 2014 2 次提交

kvm: ioapic: conditionally delay irq delivery duringeoi broadcast · 184564ef

由 Zhang Haoyu 提交于 9月 11, 2014

Currently, we call ioapic_service() immediately when we find the irq is still
active during eoi broadcast. But for real hardware, there's some delay between
the EOI writing and irq delivery. If we do not emulate this behavior, and
re-inject the interrupt immediately after the guest sends an EOI and re-enables
interrupts, a guest might spend all its time in the ISR if it has a broken
handler for a level-triggered interrupt.

Such livelock actually happens with Windows guests when resuming from
hibernation.

As there's no way to recognize the broken handle from new raised ones, this patch
delays an interrupt if 10.000 consecutive EOIs found that the interrupt was
still high. The guest can then make a little forward progress, until a proper
IRQ handler is set or until some detection routine in the guest (such as
Linux's note_interrupt()) recognizes the situation.

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NZhang Haoyu <zhanghy@sangfor.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

184564ef

KVM: x86: Use kvm_make_request when applicable · 105b21bb

由 Guo Hui Liu 提交于 9月 12, 2014

This patch replace the set_bit method by kvm_make_request
to make code more readable and consistent.
Signed-off-by: NGuo Hui Liu <liuguohui@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

105b21bb

11 9月, 2014 3 次提交

KVM: x86: make apic_accept_irq tracepoint more generic · a183b638

由 Paolo Bonzini 提交于 9月 11, 2014

Initially the tracepoint was added only to the APIC_DM_FIXED case,
also because it reported coalesced interrupts that only made sense
for that case.  However, the coalesced argument is not used anymore
and tracing other delivery modes is useful, so hoist the call out
of the switch statement.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a183b638

kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address. · 73a6d941

由 Tang Chen 提交于 9月 11, 2014

We have APIC_DEFAULT_PHYS_BASE defined as 0xfee00000, which is also the address of
apic access page. So use this macro.
Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: NGleb Natapov <gleb@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

73a6d941

Merge tag 'kvm-s390-next-20140910' of... · 2c69c1a3

由 Paolo Bonzini 提交于 9月 11, 2014

Merge tag 'kvm-s390-next-20140910' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-next

KVM: s390: Fixes and features for next (3.18)

1. Crypto/CPACF support: To enable the MSA4 instructions we have to
   provide a common control structure for each SIE control block
2. Two cleanups found by a static code checker: one redundant assignment
   and one useless if
3. Fix the page handling of the diag10 ballooning interface. If the
   guest freed the pages at absolute 0 some checks and frees were
   incorrect
4. Limit guests to 16TB
5. Add __must_check to interrupt injection code

2c69c1a3

10 9月, 2014 9 次提交

KVM: s390/interrupt: remove double assignment · bfac1f59

由 Christian Borntraeger 提交于 9月 03, 2014

r is already initialized to 0.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: NThomas Huth <thuth@linux.vnet.ibm.com>

bfac1f59

KVM: s390/cmm: Fix prefix handling for diag 10 balloon · f7a960af

由 Christian Borntraeger 提交于 9月 03, 2014

The old handling of prefix pages was broken in the diag10 ballooner.
We now rely on gmap_discard to check for start > end and do a
slow path if the prefix swap pages are affected:
1. discard the pages from start to prefix
2. discard the absolute 0 pages
3. discard the pages after prefix swap to end
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: NThomas Huth <thuth@linux.vnet.ibm.com>

f7a960af

KVM: s390: get rid of constant condition in ipte_unlock_simple · 6b331952

由 Christian Borntraeger 提交于 9月 03, 2014

Due to the earlier check we know that ipte_lock_count must be 0.
No need to add a useless if. Let's make clear that we are going
to always wakeup when we execute that code.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

6b331952

KVM: s390: unintended fallthrough for external call · f346026e

由 Christian Borntraeger 提交于 9月 03, 2014

We must not fallthrough if the conditions for external call are not met.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: NThomas Huth <thuth@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org

f346026e

KVM: s390: Limit guest size to 16TB · 0349985a

由 Christian Borntraeger 提交于 8月 25, 2014

Currently we fill up a full 5 level page table to hold the guest
mapping. Since commit "support gmap page tables with less than 5
levels" we can do better.
Having more than 4 TB might be useful for some testing scenarios,
so let's just limit ourselves to 16TB guest size.
Having more than that is totally untested as I do not have enough
swap space/memory.

We continue to allow ucontrol the full size.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Acked-by: NCornelia Huck <cornelia.huck@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>

0349985a

KVM: s390: add __must_check to interrupt deliver functions · 614aeab4

由 Christian Borntraeger 提交于 8月 25, 2014

We now propagate interrupt injection errors back to the ioctl. We
should mark functions that might fail with __must_check.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Acked-by: NJens Freimann <jfrei@linux.vnet.ibm.com>

614aeab4

KVM: CPACF: Enable MSA4 instructions for kvm guest · 5102ee87

由 Tony Krowiak 提交于 6月 27, 2014

We have to provide a per guest crypto block for the CPUs to
enable MSA4 instructions. According to icainfo on z196 or
later this enables CCM-AES-128, CMAC-AES-128, CMAC-AES-192
and CMAC-AES-256.
Signed-off-by: NTony Krowiak <akrowiak@linux.vnet.ibm.com>
Reviewed-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: NMichael Mueller <mimu@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
[split MSA4/protected key into two patches]

5102ee87

KVM: fix api documentation of KVM_GET_EMULATED_CPUID · 209cf19f

由 Alex Bennée 提交于 9月 09, 2014

It looks like when this was initially merged it got accidentally included
in the following section. I've just moved it back in the correct section
and re-numbered it as other ioctls have been added since.
Signed-off-by: NAlex BennÃ©e <alex.bennee@linaro.org>
Acked-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

209cf19f

KVM: document KVM_SET_GUEST_DEBUG api · 4bd9d344

由 Alex Bennée 提交于 9月 09, 2014

In preparation for working on the ARM implementation I noticed the debug
interface was missing from the API document. I've pieced together the
expected behaviour from the code and commit messages written it up as
best I can.
Signed-off-by: NAlex BennÃ©e <alex.bennee@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4bd9d344

05 9月, 2014 5 次提交

KVM: remove redundant assignments in __kvm_set_memory_region · f2a25160

由 Christian Borntraeger 提交于 9月 04, 2014

__kvm_set_memory_region sets r to EINVAL very early.
Doing it again is not necessary. The same is true later on, where
r is assigned -ENOMEM twice.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f2a25160

KVM: remove redundant assigment of return value in kvm_dev_ioctl · a13f533b

由 Christian Borntraeger 提交于 9月 04, 2014

The first statement of kvm_dev_ioctl is
        long r = -EINVAL;

No need to reassign the same value.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a13f533b

KVM: remove redundant check of in_spin_loop · 34656113

由 Christian Borntraeger 提交于 9月 04, 2014

The expression `vcpu->spin_loop.in_spin_loop' is always true,
because it is evaluated only when the condition
`!vcpu->spin_loop.in_spin_loop' is false.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

34656113

KVM: x86: propagate exception from permission checks on the nested page fault · 54987b7a

由 Paolo Bonzini 提交于 9月 02, 2014

Currently, if a permission error happens during the translation of
the final GPA to HPA, walk_addr_generic returns 0 but does not fill
in walker->fault.  To avoid this, add an x86_exception* argument
to the translate_gpa function, and let it fill in walker->fault.
The nested_page_fault field will be true, since the walk_mmu is the
nested_mmu and translate_gpu instead operates on the "outer" (NPT)
instance.
Reported-by: NValentine Sinitsyn <valentine.sinitsyn@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

54987b7a

KVM: x86: skip writeback on injection of nested exception · ef54bcfe

由 Paolo Bonzini 提交于 9月 04, 2014

If a nested page fault happens during emulation, we will inject a vmexit,
not a page fault.  However because writeback happens after the injection,
we will write ctxt->eip from L2 into the L1 EIP.  We do not write back
if an instruction caused an interception vmexit---do the same for page
faults.
Suggested-by: NGleb Natapov <gleb@kernel.org>
Reviewed-by: NGleb Natapov <gleb@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ef54bcfe

03 9月, 2014 5 次提交

KVM: nSVM: propagate the NPF EXITINFO to the guest · 5e352519

由 Paolo Bonzini 提交于 9月 02, 2014

This is similar to what the EPT code does with the exit qualification.
This allows the guest to see a valid value for bits 33:32.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5e352519

KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD · a0c0feb5

由 Paolo Bonzini 提交于 9月 02, 2014

Bit 8 would be the "global" bit, which does not quite make sense for non-leaf
page table entries. Intel ignores it; AMD ignores it in PDEs, but reserves it
in PDPEs and PML4Es. The SVM test is relying on this behavior, so enforce it.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a0c0feb5

KVM: mmio: cleanup kvm_set_mmio_spte_mask · d1431483

由 Tiejun Chen 提交于 9月 01, 2014

Just reuse rsvd_bits() inside kvm_set_mmio_spte_mask()
for slightly better code.
Signed-off-by: NTiejun Chen <tiejun.chen@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d1431483

kvm: x86: fix stale mmio cache bug · 56f17dd3

由 David Matlack 提交于 8月 18, 2014

The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:

(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.

(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.

(3) Guest attempts to read or write to gpa X again. On Intel, this
generates an EPT_MISCONFIG. The memory slot generation number that
was incremented in (2) would normally take care of this but we fast
path mmio faults through quickly_check_mmio_pf(), which only checks
the per-vcpu mmio cache. Since we hit the cache, KVM passes a
KVM_EXIT_MMIO up to userspace.

This patch fixes the issue by using the memslot generation number
to validate the mmio cache.

Cc: stable@vger.kernel.org
Signed-off-by: NDavid Matlack <dmatlack@google.com>
[xiaoguangrong: adjust the code to make it simpler for stable-tree fix.]
Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Tested-by: NDavid Matlack <dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

56f17dd3

kvm: fix potentially corrupt mmio cache · ee3d1570

由 David Matlack 提交于 8月 18, 2014

vcpu exits and memslot mutations can run concurrently as long as the
vcpu does not aquire the slots mutex. Thus it is theoretically possible
for memslots to change underneath a vcpu that is handling an exit.

If we increment the memslot generation number again after
synchronize_srcu_expedited(), vcpus can safely cache memslot generation
without maintaining a single rcu_dereference through an entire vm exit.
And much of the x86/kvm code does not maintain a single rcu_dereference
of the current memslots during each exit.

We can prevent the following case:

   vcpu (CPU 0)                             | thread (CPU 1)
--------------------------------------------+--------------------------
1  vm exit                                  |
2  srcu_read_unlock(&kvm->srcu)             |
3  decide to cache something based on       |
     old memslots                           |
4                                           | change memslots
                                            | (increments generation)
5                                           | synchronize_srcu(&kvm->srcu);
6  retrieve generation # from new memslots  |
7  tag cache with new memslot generation    |
8  srcu_read_unlock(&kvm->srcu)             |
...                                         |
   <action based on cache occurs even       |
    though the caching decision was based   |
    on the old memslots>                    |
...                                         |
   <action *continues* to occur until next  |
    memslot generation change, which may    |
    be never>                               |
                                            |

By incrementing the generation after synchronizing with kvm->srcu readers,
we ensure that the generation retrieved in (6) will become invalid soon
after (8).

Keeping the existing increment is not strictly necessary, but we
do keep it and just move it for consistency from update_memslots to
install_new_memslots.  It invalidates old cached MMIOs immediately,
instead of having to wait for the end of synchronize_srcu_expedited,
which makes the code more clearly correct in case CPU 1 is preempted
right after synchronize_srcu() returns.

To avoid halving the generation space in SPTEs, always presume that the
low bit of the generation is zero when reconstructing a generation number
out of an SPTE.  This effectively disables MMIO caching in SPTEs during
the call to synchronize_srcu_expedited.  Using the low bit this way is
somewhat like a seqcount---where the protected thing is a cache, and
instead of retrying we can simply punt if we observe the low bit to be 1.

Cc: stable@vger.kernel.org
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ee3d1570

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功