提交 · 382409b4c43e5b44ae4a869ff793d3cf01d12004 · openeuler / Kernel

25 5月, 2019 1 次提交

KVM: Fix spinlock taken warning during host resume · 2eb06c30

由 Wanpeng Li 提交于 5月 17, 2019

 WARNING: CPU: 0 PID: 13554 at kvm/arch/x86/kvm//../../../virt/kvm/kvm_main.c:4183 kvm_resume+0x3c/0x40 [kvm]
  CPU: 0 PID: 13554 Comm: step_after_susp Tainted: G           OE     5.1.0-rc4+ #1
  RIP: 0010:kvm_resume+0x3c/0x40 [kvm]
  Call Trace:
   syscore_resume+0x63/0x2d0
   suspend_devices_and_enter+0x9d1/0xa40
   pm_suspend+0x33a/0x3b0
   state_store+0x82/0xf0
   kobj_attr_store+0x12/0x20
   sysfs_kf_write+0x4b/0x60
   kernfs_fop_write+0x120/0x1a0
   __vfs_write+0x1b/0x40
   vfs_write+0xcd/0x1d0
   ksys_write+0x5f/0xe0
   __x64_sys_write+0x1a/0x20
   do_syscall_64+0x6f/0x6c0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Commit ca84d1a2 (KVM: x86: Add clock sync request to hardware enable) mentioned
that "we always hold kvm_lock when hardware_enable is called.  The one place that
doesn't need to worry about it is resume, as resuming a frozen CPU, the spinlock
won't be taken." However, commit 6706dae9 (virt/kvm: Replace spin_is_locked() with
lockdep) introduces a bug, it asserts when the lock is not held which is contrary
to the original goal.

This patch fixes it by WARN_ON when the lock is held.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Fixes: 6706dae9 ("virt/kvm: Replace spin_is_locked() with lockdep")
[Wrap with #ifdef CONFIG_LOCKDEP - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2eb06c30

17 5月, 2019 1 次提交

kvm: fix compilation on aarch64 · c011d23b

由 Paolo Bonzini 提交于 5月 17, 2019

Commit e45adf66 ("KVM: Introduce a new guest mapping API", 2019-01-31)
introduced a build failure on aarch64 defconfig:

$ make -j$(nproc) ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- O=out defconfig \
                Image.gz
...
../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:
    In function '__kvm_map_gfn':
../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1763:9: error:
    implicit declaration of function 'memremap'; did you mean 'memset_p'?
../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1763:46: error:
    'MEMREMAP_WB' undeclared (first use in this function)
../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:
    In function 'kvm_vcpu_unmap':
../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1795:3: error:
    implicit declaration of function 'memunmap'; did you mean 'vm_munmap'?

because these functions are declared in <linux/io.h> rather than <asm/io.h>,
and the former was being pulled in already on x86 but not on aarch64.
Reported-by: NNathan Chancellor <natechancellor@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c011d23b

15 5月, 2019 1 次提交

mm/mmu_notifier: convert user range->blockable to helper function · dfcd6660

由 Jérôme Glisse 提交于 5月 13, 2019

Use the mmu_notifier_range_blockable() helper function instead of directly
dereferencing the range->blockable field.  This is done to make it easier
to change the mmu_notifier range field.

This patch is the outcome of the following coccinelle patch:

%<-------------------------------------------------------------------
@@
identifier I1, FN;
@@
FN(..., struct mmu_notifier_range *I1, ...) {
<...
-I1->blockable
+mmu_notifier_range_blockable(I1)
...>
}
------------------------------------------------------------------->%

spatch --in-place --sp-file blockable.spatch --dir .

Link: http://lkml.kernel.org/r/20190326164747.24405-3-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
Reviewed-by: NIra Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dfcd6660

14 5月, 2019 1 次提交

KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device · 4894fbcc

由 Cédric Le Goater 提交于 5月 09, 2019

There is no need to test for the device pointer validity when releasing
a KVM device. The file descriptor should identify it safely.

Fixes: 2bde9b3e ("KVM: Introduce a 'release' method for KVM devices")
Signed-off-by: NCédric Le Goater <clg@kaod.org>
Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

4894fbcc

08 5月, 2019 3 次提交

KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 · d7547c55

由 Peter Xu 提交于 5月 08, 2019

The previous KVM_CAP_MANUAL_DIRTY_LOG_PROTECT has some problem which
blocks the correct usage from userspace.  Obsolete the old one and
introduce a new capability bit for it.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d7547c55

KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one · 53eac7a8

由 Peter Xu 提交于 5月 08, 2019

Just imaging the case where num_pages < BITS_PER_LONG, then the loop
will be skipped while it shouldn't.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Fixes: 2a31b9dbSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>

53eac7a8

KVM: Fix the bitmap range to copy during clear dirty · 4ddc9204

由 Peter Xu 提交于 5月 08, 2019

kvm_dirty_bitmap_bytes() will return the size of the dirty bitmap of
the memslot rather than the size of bitmap passed over from the ioctl.
Here for KVM_CLEAR_DIRTY_LOG we should only copy exactly the size of
bitmap that covers kvm_clear_dirty_log.num_pages.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 2a31b9dbSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4ddc9204

01 5月, 2019 4 次提交

KVM: Introduce a new guest mapping API · e45adf66

由 KarimAllah Ahmed 提交于 1月 31, 2019

In KVM, specially for nested guests, there is a dominant pattern of:

	=> map guest memory -> do_something -> unmap guest memory

In addition to all this unnecessarily noise in the code due to boiler plate
code, most of the time the mapping function does not properly handle memory
that is not backed by "struct page". This new guest mapping API encapsulate
most of this boiler plate code and also handles guest memory that is not
backed by "struct page".

The current implementation of this API is using memremap for memory that is
not backed by a "struct page" which would lead to a huge slow-down if it
was used for high-frequency mapping operations. The API does not have any
effect on current setups where guest memory is backed by a "struct page".
Further patches are going to also introduce a pfn-cache which would
significantly improve the performance of the memremap case.
Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e45adf66

kvm_main: fix some comments · b8b00220

由 Jiang Biao 提交于 4月 23, 2019

is_dirty has been renamed to flush, but the comment for it is
outdated. And the description about @flush parameter for
kvm_clear_dirty_log_protect() is missing, add it in this patch
as well.
Signed-off-by: NJiang Biao <benbjiang@tencent.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b8b00220

KVM: fix KVM_CLEAR_DIRTY_LOG for memory slots of unaligned size · 65c4189d

由 Paolo Bonzini 提交于 4月 17, 2019

If a memory slot's size is not a multiple of 64 pages (256K), then
the KVM_CLEAR_DIRTY_LOG API is unusable: clearing the final 64 pages
either requires the requested page range to go beyond memslot->npages,
or requires log->num_pages to be unaligned, and kvm_clear_dirty_log_protect
requires log->num_pages to be both in range and aligned.

To allow this case, allow log->num_pages not to be a multiple of 64 if
it ends exactly on the last page of the slot.
Reported-by: NPeter Xu <peterx@redhat.com>
Fixes: 98938aa8 ("KVM: validate userspace input in kvm_clear_dirty_log_protect()", 2019-01-02)
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

65c4189d

KVM: fix KVM_CLEAR_DIRTY_LOG for memory slots of unaligned size · 76d58e0f

由 Paolo Bonzini 提交于 4月 17, 2019

76d58e0f

30 4月, 2019 2 次提交

KVM: Introduce a 'release' method for KVM devices · 2bde9b3e

由 Cédric Le Goater 提交于 4月 18, 2019

When a P9 sPAPR VM boots, the CAS negotiation process determines which
interrupt mode to use (XICS legacy or XIVE native) and invokes a
machine reset to activate the chosen mode.

To be able to switch from one interrupt mode to another, we introduce
the capability to release a KVM device without destroying the VM. The
KVM device interface is extended with a new 'release' method which is
called when the file descriptor of the device is closed.

Once 'release' is called, the 'destroy' method will not be called
anymore as the device is removed from the device list of the VM.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NCédric Le Goater <clg@kaod.org>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

2bde9b3e

KVM: Introduce a 'mmap' method for KVM devices · a1cd3f08

由 Cédric Le Goater 提交于 4月 18, 2019

Some KVM devices will want to handle special mappings related to the
underlying HW. For instance, the XIVE interrupt controller of the
POWER9 processor has MMIO pages for thread interrupt management and
for interrupt source control that need to be exposed to the guest when
the OS has the required support.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NCédric Le Goater <clg@kaod.org>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

a1cd3f08

26 4月, 2019 1 次提交

KVM: polling: add architecture backend to disable polling · cdd6ad3a

由 Christian Borntraeger 提交于 3月 05, 2019

There are cases where halt polling is unwanted. For example when running
KVM on an over committed LPAR we rather want to give back the CPU to
neighbour LPARs instead of polling. Let us provide a callback that
allows architectures to disable polling.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

cdd6ad3a

16 4月, 2019 2 次提交

kvm: move KVM_CAP_NR_MEMSLOTS to common code · c110ae57

由 Paolo Bonzini 提交于 3月 28, 2019

All architectures except MIPS were defining it in the same way,
and memory slots are handled entirely by common code so there
is no point in keeping the definition per-architecture.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c110ae57

KVM: fix spectrev1 gadgets · 1d487e9b

由 Paolo Bonzini 提交于 4月 11, 2019

These were found with smatch, and then generalized when applicable.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1d487e9b

29 3月, 2019 1 次提交

KVM: Reject device ioctls from processes other than the VM's creator · ddba9180

由 Sean Christopherson 提交于 2月 15, 2019

KVM's API requires thats ioctls must be issued from the same process
that created the VM.  In other words, userspace can play games with a
VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the
creator can do anything useful.  Explicitly reject device ioctls that
are issued by a process other than the VM's creator, and update KVM's
API documentation to extend its requirements to device ioctls.

Fixes: 852b6d57 ("kvm: add device control API")
Cc: <stable@vger.kernel.org>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ddba9180

01 3月, 2019 1 次提交

kvm: properly check debugfs dentry before using it · 8ed0579c

由 Greg Kroah-Hartman 提交于 2月 28, 2019

debugfs can now report an error code if something went wrong instead of
just NULL.  So if the return value is to be used as a "real" dentry, it
needs to be checked if it is an error before dereferencing it.

This is now happening because of ff9fb72b ("debugfs: return error
values, not NULL").  syzbot has found a way to trigger multiple debugfs
files attempting to be created, which fails, and then the error code
gets passed to dentry_path_raw() which obviously does not like it.
Reported-by: NEric Biggers <ebiggers@kernel.org>
Reported-and-tested-by: syzbot+7857962b4d45e602b8ad@syzkaller.appspotmail.com
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8ed0579c

23 2月, 2019 1 次提交

KVM: Minor cleanups for kvm_main.c · a2420107

由 Leo Yan 提交于 2月 22, 2019

This patch contains two minor cleanups: firstly it puts exported symbol
for kvm_io_bus_write() by following the function definition; secondly it
removes a redundant blank line.
Signed-off-by: NLeo Yan <leo.yan@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a2420107

21 2月, 2019 10 次提交

Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()" · a67794ca

由 Lan Tianyu 提交于 2月 02, 2019

The value of "dirty_bitmap[i]" is already check before setting its value
to mask. The following check of "mask" is redundant. The check of "mask" was
introduced by commit 58d2930f ("KVM: Eliminate extra function calls in
kvm_get_dirty_log_protect()"), revert it.
Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a67794ca

KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start · dee339b5

由 Nir Weiner 提交于 1月 27, 2019

grow_halt_poll_ns() have a strange behaviour in case
(vcpu->halt_poll_ns != 0) &&
(vcpu->halt_poll_ns < halt_poll_ns_grow_start).

In this case, vcpu->halt_poll_ns will be multiplied by grow factor
(halt_poll_ns_grow) which will require several grow iteration in order
to reach a value bigger than halt_poll_ns_grow_start.
This means that growing vcpu->halt_poll_ns from value of 0 is slower
than growing it from a positive value less than halt_poll_ns_grow_start.
Which is misleading and inaccurate.

Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
to halt_poll_ns_grow_start in any case that
(vcpu->halt_poll_ns < halt_poll_ns_grow_start).
Regardless if vcpu->halt_poll_ns is 0.

use READ_ONCE to get a consistent number for all cases.
Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

dee339b5

KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter · 49113d36

由 Nir Weiner 提交于 1月 27, 2019

The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
start value when raising up vcpu->halt_poll_ns.
It actually sets the first timeout to the first polling session.
This value has significant effect on how tolerant we are to outliers.
On the standard case, higher value is better - we will spend more time
in the polling busyloop, handle events/interrupts faster and result
in better performance.
But on outliers it puts us in a busy loop that does nothing.
Even if the shrink factor is zero, we will still waste time on the first
iteration.
The optimal value changes between different workloads. It depends on
outliers rate and polling sessions length.
As this value has significant effect on the dynamic halt-polling
algorithm, it should be configurable and exposed.
Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

49113d36

KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns · 7fa08e71

由 Nir Weiner 提交于 1月 27, 2019

grow_halt_poll_ns() have a strange behavior in case
(halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).

In this case, vcpu->halt_pol_ns will be set to zero.
That results in shrinking instead of growing.

Fix issue by changing grow_halt_poll_ns() to not modify
vcpu->halt_poll_ns in case halt_poll_ns_grow is zero
Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
Suggested-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7fa08e71

KVM: Move the memslot update in-progress flag to bit 63 · 164bf7e5

由 Sean Christopherson 提交于 2月 05, 2019

...now that KVM won't explode by moving it out of bit 0.  Using bit 63
eliminates the need to jump over bit 0, e.g. when calculating a new
memslots generation or when propagating the memslots generation to an
MMIO spte.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

164bf7e5

KVM: Remove the hack to trigger memslot generation wraparound · 0e32958e

由 Sean Christopherson 提交于 2月 05, 2019

x86 captures a subset of the memslot generation (19 bits) in its MMIO
sptes so that it can expedite emulated MMIO handling by checking only
the releveant spte, i.e. doesn't need to do a full page fault walk.

Because the MMIO sptes capture only 19 bits (due to limited space in
the sptes), there is a non-zero probability that the MMIO generation
could wrap, e.g. after 500k memslot updates.  Since normal usage is
extremely unlikely to result in 500k memslot updates, a hack was added
by commit 69c9ea93 ("KVM: MMU: init kvm generation close to mmio
wrap-around value") to offset the MMIO generation in order to trigger
a wraparound, e.g. after 150 memslot updates.

When separate memslot generation sequences were assigned to each
address space, commit 00f034a1 ("KVM: do not bias the generation
number in kvm_current_mmio_generation") moved the offset logic into the
initialization of the memslot generation itself so that the per-address
space bit(s) were not dropped/corrupted by the MMIO shenanigans.

Remove the offset hack for three reasons:

  - While it does exercise x86's kvm_mmu_invalidate_mmio_sptes(), simply
    wrapping the generation doesn't actually test the interesting case
    of having stale MMIO sptes with the new generation number, e.g. old
    sptes with a generation number of 0.

  - Triggering kvm_mmu_invalidate_mmio_sptes() prematurely makes its
    performance rather important since the probability of invalidating
    MMIO sptes jumps from "effectively never" to "fairly likely".  This
    limits what can be done in future patches, e.g. to simplify the
    invalidation code, as doing so without proper caution could lead to
    a noticeable performance regression.

  - Forcing the memslots generation, which is a 64-bit number, to wrap
    prevents KVM from assuming the memslots generation will never wrap.
    This in turn prevents KVM from using an arbitrary bit for the
    "update in-progress" flag, e.g. using bit 63 would immediately
    collide with using a large value as the starting generation number.
    The "update in-progress" flag is effectively forced into bit 0 so
    that it's (subtly) taken into account when incrementing the
    generation.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0e32958e

KVM: Explicitly define the "memslot update in-progress" bit · 361209e0

由 Sean Christopherson 提交于 2月 05, 2019

KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing.  Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.

Prior to commit 4bd518f1 ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...

Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.

Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag.  This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

361209e0

KVM: Call kvm_arch_memslots_updated() before updating memslots · 15248258

由 Sean Christopherson 提交于 2月 05, 2019

kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound. x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.

Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots. Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.

Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc: <stable@vger.kernel.org>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

15248258

kvm: Add memcg accounting to KVM allocations · b12ce36a

由 Ben Gardon 提交于 2月 11, 2019

There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.

Tested:
	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
	introduced no new failures.
	Ran a kernel memory accounting test which creates a VM to touch
	memory and then checks that the kernel memory allocated for the
	process is within certain bounds.
	With this patch we account for much more of the vmalloc and slab memory
	allocated for the VM.

There remain a few allocations which should be charged to the VM's
cgroup but are not. In they include:
        vcpu->run
        kvm->coalesced_mmio_ring
There allocations are unaccounted in this patch because they are mapped
to userspace, and accounting them to a cgroup causes problems. This
should be addressed in a future patch.
Signed-off-by: NBen Gardon <bgardon@google.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b12ce36a

kvm: Use struct_size() in kmalloc() · 90952cd3

由 Gustavo A. R. Silva 提交于 1月 30, 2019

One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.
Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

90952cd3

08 2月, 2019 1 次提交

kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974) · cfa39381

由 Jann Horn 提交于 1月 26, 2019

kvm_ioctl_create_device() does the following:

1. creates a device that holds a reference to the VM object (with a borrowed
   reference, the VM's refcount has not been bumped yet)
2. initializes the device
3. transfers the reference to the device to the caller's file descriptor table
4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real
   reference

The ownership transfer in step 3 must not happen before the reference to the VM
becomes a proper, non-borrowed reference, which only happens in step 4.
After step 3, an attacker can close the file descriptor and drop the borrowed
reference, which can cause the refcount of the kvm object to drop to zero.

This means that we need to grab a reference for the device before
anon_inode_getfd(), otherwise the VM can disappear from under us.

Fixes: 852b6d57 ("kvm: add device control API")
Cc: stable@kernel.org
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cfa39381

26 1月, 2019 1 次提交

virt/kvm: Replace spin_is_locked() with lockdep · 6706dae9

由 Paul E. McKenney 提交于 1月 08, 2019

lockdep_assert_held() is better suited to checking locking requirements,
since it only checks if the current thread holds the lock regardless of
whether someone else does. This is also a step towards possibly removing
spin_is_locked().
Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: <kvm@vger.kernel.org>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>

6706dae9

12 1月, 2019 1 次提交

KVM: validate userspace input in kvm_clear_dirty_log_protect() · 98938aa8

由 Tomas Bortoli 提交于 1月 02, 2019

The function at issue does not fully validate the content of the
structure pointed by the log parameter, though its content has just been
copied from userspace and lacks validation. Fix that.

Moreover, change the type of n to unsigned long as that is the type
returned by kvm_dirty_bitmap_bytes().
Signed-off-by: NTomas Bortoli <tomasbortoli@gmail.com>
Reported-by: syzbot+028366e52c9ace67deb3@syzkaller.appspotmail.com
[Squashed the fix from Paolo. - Radim.]
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

98938aa8

04 1月, 2019 1 次提交

Remove 'type' argument from access_ok() function · 96d4f267

由 Linus Torvalds 提交于 1月 03, 2019

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access.  But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model.  And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

 - csky still had the old "verify_area()" name as an alias.

 - the iter_iov code had magical hardcoded knowledge of the actual
   values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
   really used it)

 - microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something.  Any missed conversion should be trivially fixable, though.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

96d4f267

29 12月, 2018 1 次提交

mm/mmu_notifier: use structure for invalidate_range_start/end callback · 5d6527a7

由 Jérôme Glisse 提交于 12月 28, 2018

Patch series "mmu notifier contextual informations", v2.

This patchset adds contextual information, why an invalidation is
happening, to mmu notifier callback.  This is necessary for user of mmu
notifier that wish to maintains their own data structure without having to
add new fields to struct vm_area_struct (vma).

For instance device can have they own page table that mirror the process
address space.  When a vma is unmap (munmap() syscall) the device driver
can free the device page table for the range.

Today we do not have any information on why a mmu notifier call back is
happening and thus device driver have to assume that it is always an
munmap().  This is inefficient at it means that it needs to re-allocate
device page table on next page fault and rebuild the whole device driver
data structure for the range.

Other use case beside munmap() also exist, for instance it is pointless
for device driver to invalidate the device page table when the
invalidation is for the soft dirtyness tracking.  Or device driver can
optimize away mprotect() that change the page table permission access for
the range.

This patchset enables all this optimizations for device drivers.  I do not
include any of those in this series but another patchset I am posting will
leverage this.

The patchset is pretty simple from a code point of view.  The first two
patches consolidate all mmu notifier arguments into a struct so that it is
easier to add/change arguments.  The last patch adds the contextual
information (munmap, protection, soft dirty, clear, ...).

This patch (of 3):

To avoid having to change many callback definition everytime we want to
add a parameter use a structure to group all parameters for the
mmu_notifier invalidate_range_start/end callback.  No functional changes
with this patch.

[akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
Acked-by: NJan Kara <jack@suse.cz>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>	[infiniband]
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5d6527a7

21 12月, 2018 3 次提交

KVM/MMU: Move tlb flush in kvm_set_pte_rmapp() to kvm_mmu_notifier_change_pte() · 0cf853c5

由 Lan Tianyu 提交于 12月 06, 2018

This patch is to move tlb flush in kvm_set_pte_rmapp() to
kvm_mmu_notifier_change_pte() in order to avoid redundant tlb flush.
Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0cf853c5

kvm: Change offset in kvm_write_guest_offset_cached to unsigned · 7a86dab8

由 Jim Mattson 提交于 12月 14, 2018

Since the offset is added directly to the hva from the
gfn_to_hva_cache, a negative offset could result in an out of bounds
write. The existing BUG_ON only checks for addresses beyond the end of
the gfn_to_hva_cache, not for addresses before the start of the
gfn_to_hva_cache.

Note that all current call sites have non-negative offsets.

Fixes: 4ec6e863 ("kvm: Introduce kvm_write_guest_offset_cached()")
Reported-by: NCfir Cohen <cfir@google.com>
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NCfir Cohen <cfir@google.com>
Reviewed-by: NPeter Shier <pshier@google.com>
Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

7a86dab8

kvm: Disallow wraparound in kvm_gfn_to_hva_cache_init · f1b9dd5e

由 Jim Mattson 提交于 12月 17, 2018

Previously, in the case where (gpa + len) wrapped around, the entire
region was not validated, as the comment claimed. It doesn't actually
seem that wraparound should be allowed here at all.

Furthermore, since some callers don't check the return code from this
function, it seems prudent to clear ghc->memslot in the event of an
error.

Fixes: 8f964525 ("KVM: Allow cross page reads and writes from cached translations.")
Reported-by: NCfir Cohen <cfir@google.com>
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NCfir Cohen <cfir@google.com>
Reviewed-by: NMarc Orr <marcorr@google.com>
Cc: Andrew Honig <ahonig@google.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

f1b9dd5e

14 12月, 2018 3 次提交

kvm: introduce manual dirty log reprotect · 2a31b9db

由 Paolo Bonzini 提交于 10月 23, 2018

There are two problems with KVM_GET_DIRTY_LOG.  First, and less important,
it can take kvm->mmu_lock for an extended period of time.  Second, its user
can actually see many false positives in some cases.  The latter is due
to a benign race like this:

  1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
     them.
  2. The guest modifies the pages, causing them to be marked ditry.
  3. Userspace actually copies the pages.
  4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
     they were not written to since (3).

This is especially a problem for large guests, where the time between
(1) and (3) can be substantial.  This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns.  Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page.  The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2a31b9db

kvm: rename last argument to kvm_get_dirty_log_protect · 8fe65a82

由 Paolo Bonzini 提交于 10月 23, 2018

When manual dirty log reprotect will be enabled, kvm_get_dirty_log_protect's
pointer argument will always be false on exit, because no TLB flush is needed
until the manual re-protection operation. Rename it from "is_dirty" to "flush",
which more accurately tells the caller what they have to do with it.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8fe65a82

kvm: make KVM_CAP_ENABLE_CAP_VM architecture agnostic · e5d83c74

由 Paolo Bonzini 提交于 2月 16, 2017

The first such capability to be handled in virt/kvm/ will be manual
dirty page reprotection.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e5d83c74

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功