提交 · 1135dc02a7bb6d2b862392863c29d064675b5711 · openeuler / Kernel

17 12月, 2022 11 次提交

KVM: x86: enable dirty log gradually in small chunks · 1135dc02

由 Jay Zhou 提交于 2月 27, 2020

mainline inclusion
from mainline-v5.10
commit: 3c9bd400
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I66COX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=3c9bd4006bfc2dccda1823db61b3f470ef91cfaa

--------------------------------

It could take kvm->mmu_lock for an extended period of time when
enabling dirty log for the first time. The main cost is to clear
all the D-bits of last level SPTEs. This situation can benefit from
manual dirty log protect as well, which can reduce the mmu_lock
time taken. The sequence is like this:

1. Initialize all the bits of the dirty bitmap to 1 when enabling
   dirty log for the first time
2. Only write protect the huge pages
3. KVM_GET_DIRTY_LOG returns the dirty bitmap info
4. KVM_CLEAR_DIRTY_LOG will clear D-bit for each of the leaf level
   SPTEs gradually in small chunks

Under the Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz environment,
I did some tests with a 128G windows VM and counted the time taken
of memory_global_dirty_log_start, here is the numbers:

VM Size        Before    After optimization
128G           460ms     10ms
Signed-off-by: NJay Zhou <jianjay.zhou@huawei.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
[Remove test code because the test code is too different.]
Signed-off-by: zhengchuan<zhengchuan@huawei.com>

1135dc02

KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 · 2c0c29fa

由 Peter Xu 提交于 5月 08, 2019

mainline inclusion
from mainline-v5.10
commit: d7547c55
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I66COX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=d7547c55cbe7471255ca51f14bcd4699f5eaabe5

--------------------------------

The previous KVM_CAP_MANUAL_DIRTY_LOG_PROTECT has some problem which
blocks the correct usage from userspace.  Obsolete the old one and
introduce a new capability bit for it.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
[Remove test code because the test code is too different.]
Signed-off-by: zhengchuan<zhengchuan@huawei.com>

2c0c29fa

KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one · 95a88f43

由 Peter Xu 提交于 5月 08, 2019

mainline inclusion
from mainline-v5.10
commit: 53eac7a8
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I66COX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=53eac7a8f8cf3d7dc5ecac1946f31442f5eee5f3

--------------------------------

Just imaging the case where num_pages < BITS_PER_LONG, then the loop
will be skipped while it shouldn't.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Fixes: 2a31b9dbSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>

95a88f43

KVM: Fix the bitmap range to copy during clear dirty · 6022f129

由 Peter Xu 提交于 5月 08, 2019

mainline inclusion
from mainline-v5.10
commit: 4ddc9204
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I66COX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=4ddc9204572c33f2eb91fbdb1d99d8078388b67d

--------------------------------

kvm_dirty_bitmap_bytes() will return the size of the dirty bitmap of
the memslot rather than the size of bitmap passed over from the ioctl.
Here for KVM_CLEAR_DIRTY_LOG we should only copy exactly the size of
bitmap that covers kvm_clear_dirty_log.num_pages.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 2a31b9dbSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6022f129

kvm_main: fix some comments · d767613b

由 Jiang Biao 提交于 4月 23, 2019

mainline inclusion
from mainline-v5.10
commit:	b8b00220
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I66COX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=b8b002209c061273fd1ef7bb3c3c32301623a282

--------------------------------

is_dirty has been renamed to flush, but the comment for it is
outdated. And the description about @flush parameter for
kvm_clear_dirty_log_protect() is missing, add it in this patch
as well.
Signed-off-by: NJiang Biao <benbjiang@tencent.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d767613b

KVM: fix KVM_CLEAR_DIRTY_LOG for memory slots of unaligned size · 87bf4975

由 Paolo Bonzini 提交于 4月 17, 2019

mainline inclusion
from mainline-v5.1
commit: 76d58e0f
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I66COX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=76d58e0f07ec203bbdfcaabd9a9fc10a5a3ed5ea

--------------------------------

If a memory slot's size is not a multiple of 64 pages (256K), then
the KVM_CLEAR_DIRTY_LOG API is unusable: clearing the final 64 pages
either requires the requested page range to go beyond memslot->npages,
or requires log->num_pages to be unaligned, and kvm_clear_dirty_log_protect
requires log->num_pages to be both in range and aligned.

To allow this case, allow log->num_pages not to be a multiple of 64 if
it ends exactly on the last page of the slot.
Reported-by: NPeter Xu <peterx@redhat.com>
Fixes: 98938aa8 ("KVM: validate userspace input in kvm_clear_dirty_log_protect()", 2019-01-02)
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
[Remove test code because the test code is too different.]
Signed-off-by: zhengchuan<zhengchuan@huawei.com>

87bf4975

Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()" · 76f6024a

由 Lan Tianyu 提交于 2月 02, 2019

mainline inclusion
from mainline-v5.1
commit:	a67794ca
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I66COX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=a67794cafbc4594debf53dbe4e2a7708426f492e

--------------------------------

The value of "dirty_bitmap[i]" is already check before setting its value
to mask. The following check of "mask" is redundant. The check of "mask" was
introduced by commit 58d2930f ("KVM: Eliminate extra function calls in
kvm_get_dirty_log_protect()"), revert it.
Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

76f6024a

KVM: validate userspace input in kvm_clear_dirty_log_protect() · 67ef1ab0

由 Tomas Bortoli 提交于 1月 02, 2019

mainline inclusion
from mainline-v5.0
commit: 98938aa8
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I66COX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=98938aa8edd66dc95024d7c936a4bc315f6615ff

--------------------------------

The function at issue does not fully validate the content of the
structure pointed by the log parameter, though its content has just been
copied from userspace and lacks validation. Fix that.

Moreover, change the type of n to unsigned long as that is the type
returned by kvm_dirty_bitmap_bytes().
Signed-off-by: NTomas Bortoli <tomasbortoli@gmail.com>
Reported-by: syzbot+028366e52c9ace67deb3@syzkaller.appspotmail.com
[Squashed the fix from Paolo. - Radim.]
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

67ef1ab0

kvm: introduce manual dirty log reprotect · f6636bd7

由 Paolo Bonzini 提交于 10月 23, 2018

mainline inclusion
from mainline-v5.0
commit: 2a31b9db
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I66COX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=2a31b9db153530df4aa02dac8c32837bf5f47019

--------------------------------

There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:

1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).

This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
[Remove test code because the test code is too different.]
Signed-off-by: zhengchuan<zhengchuan@huawei.com>

f6636bd7

kvm: rename last argument to kvm_get_dirty_log_protect · b8194d78

由 Paolo Bonzini 提交于 10月 23, 2018

mainline inclusion
from mainline-v5.0
commit: 8fe65a82
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I66COX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=8fe65a8299f9e1f40cb95308ab7b3c4ad80bf801

--------------------------------

When manual dirty log reprotect will be enabled, kvm_get_dirty_log_protect's
pointer argument will always be false on exit, because no TLB flush is needed
until the manual re-protection operation.  Rename it from "is_dirty" to "flush",
which more accurately tells the caller what they have to do with it.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b8194d78

kvm: make KVM_CAP_ENABLE_CAP_VM architecture agnostic · 5694b2e4

由 Paolo Bonzini 提交于 2月 16, 2017

mainline inclusion
from mainline-v5.0
commit: e5d83c74
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I66COX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=e5d83c74a5800c2a1fa3ba982c1c4b2b39ae6db2

--------------------------------

The first such capability to be handled in virt/kvm/ will be manual
dirty page reprotection.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5694b2e4

30 1月, 2022 3 次提交

KVM: Move the memslot update in-progress flag to bit 63 · e272dde6

由 Sean Christopherson 提交于 1月 30, 2022

mainline inclusion
from mainline-v5.1-rc1
commit 164bf7e5
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4MKP4
CVE: NA

--------------------------------

...now that KVM won't explode by moving it out of bit 0.  Using bit 63
eliminates the need to jump over bit 0, e.g. when calculating a new
memslots generation or when propagating the memslots generation to an
MMIO spte.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> #openEuler_contributor
Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>
Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

e272dde6

KVM: Remove the hack to trigger memslot generation wraparound · 53e9bbf6

由 Sean Christopherson 提交于 1月 30, 2022

mainline inclusion
from mainline-v5.1-rc1
commit 0e32958e
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4MKP4
CVE: NA

--------------------------------

x86 captures a subset of the memslot generation (19 bits) in its MMIO
sptes so that it can expedite emulated MMIO handling by checking only
the releveant spte, i.e. doesn't need to do a full page fault walk.

Because the MMIO sptes capture only 19 bits (due to limited space in
the sptes), there is a non-zero probability that the MMIO generation
could wrap, e.g. after 500k memslot updates.  Since normal usage is
extremely unlikely to result in 500k memslot updates, a hack was added
by commit 69c9ea93 ("KVM: MMU: init kvm generation close to mmio
wrap-around value") to offset the MMIO generation in order to trigger
a wraparound, e.g. after 150 memslot updates.

When separate memslot generation sequences were assigned to each
address space, commit 00f034a1 ("KVM: do not bias the generation
number in kvm_current_mmio_generation") moved the offset logic into the
initialization of the memslot generation itself so that the per-address
space bit(s) were not dropped/corrupted by the MMIO shenanigans.

Remove the offset hack for three reasons:

  - While it does exercise x86's kvm_mmu_invalidate_mmio_sptes(), simply
    wrapping the generation doesn't actually test the interesting case
    of having stale MMIO sptes with the new generation number, e.g. old
    sptes with a generation number of 0.

  - Triggering kvm_mmu_invalidate_mmio_sptes() prematurely makes its
    performance rather important since the probability of invalidating
    MMIO sptes jumps from "effectively never" to "fairly likely".  This
    limits what can be done in future patches, e.g. to simplify the
    invalidation code, as doing so without proper caution could lead to
    a noticeable performance regression.

  - Forcing the memslots generation, which is a 64-bit number, to wrap
    prevents KVM from assuming the memslots generation will never wrap.
    This in turn prevents KVM from using an arbitrary bit for the
    "update in-progress" flag, e.g. using bit 63 would immediately
    collide with using a large value as the starting generation number.
    The "update in-progress" flag is effectively forced into bit 0 so
    that it's (subtly) taken into account when incrementing the
    generation.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> #openEuler_contributor
Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>
Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

53e9bbf6

KVM: Explicitly define the "memslot update in-progress" bit · 6cd5d909

由 Sean Christopherson 提交于 1月 30, 2022

mainline inclusion
from mainline-v5.1-rc1
commit 361209e0
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4MKP4
CVE: NA

--------------------------------

KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing.  Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.

Prior to commit 4bd518f1 ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...

Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.

Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag.  This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> #openEuler_contributor
Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>
Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

6cd5d909

22 9月, 2021 1 次提交

KVM: do not allow mapping valid but non-reference-counted pages · c99604c9

由 Nicholas Piggin 提交于 9月 22, 2021

stable inclusion
from linux-4.19.199
commit 117777467bc015f0dc5fc079eeba0fa80c965149
CVE: CVE-2021-22543

--------------------------------

commit f8be156b upstream.

It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on valid pages if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).

This addresses CVE-2021-22543.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Tested-by: NPaolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NOvidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c99604c9

01 6月, 2021 1 次提交

KVM: polling: add architecture backend to disable polling · 84e00d07

由 Christian Borntraeger 提交于 6月 01, 2021

mainline inclusion
from mainline-5.2
commit cdd6ad3a
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=34
CVE: NA

There are cases where halt polling is unwanted. For example when running
KVM on an over committed LPAR we rather want to give back the CPU to
neighbour LPARs instead of polling. Let us provide a callback that
allows architectures to disable polling.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NYubo Miao <miaoyubo@huawei.com>
Signed-off-by: NXiangyou Xie <xiexiangyou@huawei.com>
Reviewed-by: NHailiang Zhang <zhang.zhanghailiang@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NJiajun Chen <chenjiajun8@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

84e00d07

22 4月, 2021 3 次提交

KVM: Check preempted_in_kernel for involuntary preemption · 74ddca16

由 Wanpeng Li 提交于 4月 22, 2021

mainline inclusion
from mainline-v5.8-rc5
commit 046ddeed
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=35
CVE: NA

--------------------------------

preempted_in_kernel is updated in preempt_notifier when involuntary preemption
ocurrs, it can be stale when the voluntarily preempted vCPUs are taken into
account by kvm_vcpu_on_spin() loop. This patch lets it just check preempted_in_kernel
for involuntary preemption.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nzhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NChaochao Xing <xingchaochao@huawei.com>
Reviewed-by: NZengruan Ye <yezengruan@huawei.com>
Reviewed-by: NXiangyou Xie <xiexiangyou@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

74ddca16

KVM: Boost vCPUs that are delivering interrupts · 858bcb38

由 Wanpeng Li 提交于 4月 22, 2021

mainline inclusion
from mainline-v5.8-rc5
commit d73eb57b
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=35
CVE: NA

--------------------------------

Inspired by commit 9cac38dd (KVM/s390: Set preempted flag during
vcpu wakeup and interrupt delivery), we want to also boost not just
lock holders but also vCPUs that are delivering interrupts. Most
smp_call_function_many calls are synchronous, so the IPI target vCPUs
are also good yield candidates.  This patch introduces vcpu->ready to
boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do
not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken
into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected
(VT-d PI handles voluntary preemption separately, in pi_pre_block).

Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M

            vanilla     boosting    improved
1VM          21443       23520         9%
2VM           2800        8000       180%
3VM           1800        3100        72%

Testing on my Haswell desktop 8 HT, with 8 vCPUs VM 8GB RAM, two VMs,
one running ebizzy -M, the other running 'stress --cpu 2':

w/ boosting + w/o pv sched yield(vanilla)

            vanilla     boosting   improved
              1570         4000      155%

w/ boosting + w/ pv sched yield(vanilla)

            vanilla     boosting   improved
              1844         5157      179%

w/o boosting, perf top in VM:

 72.33%  [kernel]       [k] smp_call_function_many
  4.22%  [kernel]       [k] call_function_i
  3.71%  [kernel]       [k] async_page_fault

w/ boosting, perf top in VM:

 38.43%  [kernel]       [k] smp_call_function_many
  6.31%  [kernel]       [k] async_page_fault
  6.13%  libc-2.23.so   [.] __memcpy_avx_unaligned
  4.88%  [kernel]       [k] call_function_interrupt

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nzhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NChaochao Xing <xingchaochao@huawei.com>
Reviewed-by: NZengruan Ye <yezengruan@huawei.com>
Reviewed-by: NXiangyou Xie <xiexiangyou@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

858bcb38

kvm: debugfs: Export x86 kvm exits to vcpu_stat · 95118507

由 chenjiajun 提交于 4月 22, 2021

virt inclusion
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=33
CVE: NA

-------------------------------------------------
Export vcpu_stat via debugfs for x86, which contains x86 kvm exits items.
The path of the vcpu_stat is /sys/kernel/debug/kvm/vcpu_stat, and
each line of vcpu_stat is a collection of various kvm exits for a vcpu.
And through vcpu_stat, we only need to open one file to
tail performance of virtual machine, which is more convenient.
Signed-off-by: NFeng Lin <linfeng23@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Reviewed-by: NXiangyou Xie <xiexiangyou@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

95118507

14 4月, 2021 1 次提交

KVM: fix memory leak in kvm_io_bus_unregister_dev() · 9c183036

由 Rustam Kovhaev 提交于 4月 14, 2021

stable inclusion
from linux-4.19.148
commit 19184bd06f488af62924ff1747614a8cb284ad63
CVE: CVE-2020-36312

--------------------------------

[ Upstream commit f6588660 ]

when kmalloc() fails in kvm_io_bus_unregister_dev(), before removing
the bus, we should iterate over all other devices linked to it and call
kvm_iodevice_destructor() for them

Fixes: 90db1043 ("KVM: kvm_io_bus_unregister_dev() should never fail")
Cc: stable@vger.kernel.org
Reported-and-tested-by: syzbot+f196caa45793d6374707@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=f196caa45793d6374707Signed-off-by: NRustam Kovhaev <rkovhaev@gmail.com>
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200907185535.233114-1-rkovhaev@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

9c183036

11 3月, 2021 1 次提交

etmem: add etmem-scan feature · bad4d883

由 liubo 提交于 3月 11, 2021

euleros inclusion
category: feature
feature: etmem
bugzilla: 49889

-------------------------------------------------

etmem, the memory vertical expansion technology,
uses DRAM and high-performance storage new media to form multi-level memory storage.
By grading the stored data, etmem migrates the classified cold storage data
from the storage medium to the high-performance storage medium,
so as to achieve the purpose of memory capacity expansion and memory cost reduction.

The etmem feature is mainly composed of two parts: etmem_scan and
etmem_swap.

This patch is mainly used to generate etmem_scan.ko.
etmem_scan.ko is used to scan the virtual address of the target process
and return the address access information to
the user mode for grading cold and hot pages.
Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Nyanxiaodan <yanxiaodan@huawei.com>
Signed-off-by: NFeilong Lin <linfeilong@huawei.com>
Signed-off-by: Ngeruijun <geruijun@huawei.com>
Signed-off-by: Nliubo <liubo254@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: Jing Xiangfeng<jingxiangfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

bad4d883

29 10月, 2020 2 次提交

kvm: debugfs: aarch64 export cpu time related items to debugfs · b94fc31d

由 chenjiajun 提交于 8月 06, 2020

euleros inclusion
category: feature
bugzilla: NA
CVE: NA

This patch export cpu time related items to vcpu_stat.
Contain:
	steal, st_max, utime, stime, gtime

The definitions of these items are:
steal: cpu time VCPU waits for PCPU while it is servicing another VCPU
st_max: max scheduling delay
utime: cpu time in userspace
stime: cpu time in sys
gtime: cpu time in guest

Through these items, user can get many cpu usage info of vcpu, such as:
CPU Usage of Guest =  gtime_delta / delta_cputime
CPU Usage of Hyp = (utime_delta - gtime_delta + stime_delta) / delta_cputime
CPU Usage of Steal = steal_delta / delta_cputime
Max Scheduling Delay = st_max
Signed-off-by: Nliangpeng <liangpeng10@huawei.com>
Signed-off-by: Nchenjiajun <chenjiajun8@huawei.com>
Reviewed-by: Nzhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b94fc31d

kvm: debugfs: Export vcpu stat via debugfs · 8bcf970f

由 chenjiajun 提交于 8月 06, 2020

euleros inclusion
category: feature
bugzilla: NA
CVE: NA

This patch create debugfs entry for vcpu stat.
The entry path is /sys/kernel/debug/kvm/vcpu_stat.
And vcpu_stat contains partial kvm exits items of vcpu, include:
	pid, hvc_exit_stat, wfe_exit_stat, wfi_exit_stat,
	mmio_exit_user, mmio_exit_kernel, exits

Currently, The maximum vcpu limit is 1024.

From this vcpu_stat, user can get the number of these kvm exits items
over a period of time, which is helpful to monitor the virtual machine.
Signed-off-by: NZenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Nchenjiajun <chenjiajun8@huawei.com>
Reviewed-by: Nzhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8bcf970f

13 1月, 2020 1 次提交

KVM: Call kvm_arch_vcpu_blocking early into the blocking sequence · 66f2b4b5

由 Marc Zyngier 提交于 1月 13, 2020

mainline inclusion
from mainline-v5.4-rc1
commit 07ab0f8d
category: feature
feature: guest VLPI during the halt poll window
         can be immediately recognised

-------------------------------------------------

When a vpcu is about to block by calling kvm_vcpu_block, we call
back into the arch code to allow any form of synchronization that
may be required at this point (SVN stops the AVIC, ARM synchronises
the VMCR and enables GICv4 doorbells). But this synchronization
comes in quite late, as we've potentially waited for halt_poll_ns
to expire.

Instead, let's move kvm_arch_vcpu_blocking() to the beginning of
kvm_vcpu_block(), which on ARM has several benefits:

- VMCR gets synchronised early, meaning that any interrupt delivered
  during the polling window will be evaluated with the correct guest
  PMR
- GICv4 doorbells are enabled, which means that any guest interrupt
  directly injected during that window will be immediately recognised

Tang Nianyao ran some tests on a GICv4 machine to evaluate such
change, and reported up to a 10% improvement for netperf:

<quote>
	netperf result:
	D06 as server, intel 8180 server as client
	with change:
	package 512 bytes - 5500 Mbits/s
	package 64 bytes - 760 Mbits/s
	without change:
	package 512 bytes - 5000 Mbits/s
	package 64 bytes - 710 Mbits/s
</quote>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NZenghui Yu <yuzenghui@huawei.com>
Reviewed-by: NHailiang Zhang <zhang.zhanghailiang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

66f2b4b5

27 12月, 2019 14 次提交

kvm: properly check debugfs dentry before using it · dc9c1ddc

由 Greg Kroah-Hartman 提交于 12月 09, 2019

[ Upstream commit 8ed0579c ]

debugfs can now report an error code if something went wrong instead of
just NULL.  So if the return value is to be used as a "real" dentry, it
needs to be checked if it is an error before dereferencing it.

This is now happening because of ff9fb72b ("debugfs: return error
values, not NULL").  syzbot has found a way to trigger multiple debugfs
files attempting to be created, which fails, and then the error code
gets passed to dentry_path_raw() which obviously does not like it.
Reported-by: NEric Biggers <ebiggers@kernel.org>
Reported-and-tested-by: syzbot+7857962b4d45e602b8ad@syzkaller.appspotmail.com
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

dc9c1ddc

KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved · ddff527c

由 Sean Christopherson 提交于 12月 02, 2019

commit a78986aa upstream.

Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
pages, e.g. put pages grabbed via gup().  But for flows such as setting
A/D bits or shifting refcounts for transparent huge pages, KVM needs to
to avoid processing ZONE_DEVICE pages as the flows in question lack the
underlying machinery for proper handling of ZONE_DEVICE pages.

This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
when running a KVM guest backed with /dev/dax memory, as KVM straight up
doesn't put any references to ZONE_DEVICE pages acquired by gup().

Note, Dan Williams proposed an alternative solution of doing put_page()
on ZONE_DEVICE pages immediately after gup() in order to simplify the
auditing needed to ensure is_zone_device_page() is called if and only if
the backing device is pinned (via gup()).  But that approach would break
kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
unmap() when accessing guest memory, unlike KVM's secondary MMU, which
coordinates with mmu_notifier invalidations to avoid creating stale
page references, i.e. doesn't rely on pages being pinned.

[*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.plReported-by: NAdam Borowski <kilobyte@angband.pl>
Analyzed-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Cc: stable@vger.kernel.org
Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
[sean: backport to 4.x; resolve conflict in mmu.c]
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ddff527c

kvm: x86: mmu: Recovery of shattered NX large pages · d354f40f

由 Junaid Shahid 提交于 11月 13, 2019

commit 1aa9b957 upstream.

The page table pages corresponding to broken down large pages are zapped in
FIFO order, so that the large page can potentially be recovered, if it is
not longer being used for execution.  This removes the performance penalty
for walking deeper EPT page tables.

By default, one large page will last about one hour once the guest
reaches a steady state.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d354f40f

kvm: Add helper function for creating VM worker threads · 1745f9e4

由 Junaid Shahid 提交于 11月 13, 2019

commit c57c8046 upstream.

Add a function to create a kernel thread associated with a given VM. In
particular, it ensures that the worker thread inherits the priority and
cgroups of the calling thread.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1745f9e4

kvm: Convert kvm_lock to a mutex · 09846e15

由 Junaid Shahid 提交于 11月 13, 2019

commit 0d9ce162 upstream.

It doesn't seem as if there is any particular need for kvm_lock to be a
spinlock, so convert the lock to a mutex so that sleepable functions (in
particular cond_resched()) can be called while holding it.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

09846e15

kvm: x86, powerpc: do not allow clearing largepages debugfs entry · 5df12235

由 Paolo Bonzini 提交于 11月 13, 2019

commit 833b45de upstream.

The largepages debugfs entry is incremented/decremented as shadow
pages are created or destroyed.  Clearing it will result in an
underflow, which is harmless to KVM but ugly (and could be
misinterpreted by tools that use debugfs information), so make
this particular statistic read-only.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: kvm-ppc@vger.kernel.org
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

5df12235

KVM: Fix leak vCPU's VMCS value into other pCPU · 867fd9a1

由 Wanpeng Li 提交于 8月 05, 2019

commit 17e433b5 upstream.

After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
in the VMs after stress testing:

 INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
 Call Trace:
   flush_tlb_mm_range+0x68/0x140
   tlb_flush_mmu.part.75+0x37/0xe0
   tlb_finish_mmu+0x55/0x60
   zap_page_range+0x142/0x190
   SyS_madvise+0x3cd/0x9c0
   system_call_fastpath+0x1c/0x21

swait_active() sustains to be true before finish_swait() is called in
kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
by kvm_vcpu_on_spin() loop greatly increases the probability condition
kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
VMCS.

This patch fixes it by checking conservatively a subset of events.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Marc Zyngier <Marc.Zyngier@arm.com>
Cc: stable@vger.kernel.org
Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

867fd9a1

KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID · 41fae0e5

由 Thomas Huth 提交于 6月 09, 2019

commit a86cb413 upstream.

KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all
architectures. However, on s390x, the amount of usable CPUs is determined
during runtime - it is depending on the features of the machine the code
is running on. Since we are using the vcpu_id as an index into the SCA
structures that are defined by the hardware (see e.g. the sca_add_vcpu()
function), it is not only the amount of CPUs that is limited by the hard-
ware, but also the range of IDs that we can use.
Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too.
So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common
code into the architecture specific code, and on s390x we have to return
the same value here as for KVM_CAP_MAX_VCPUS.
This problem has been discovered with the kvm_create_max_vcpus selftest.
With this change applied, the selftest now passes on s390x, too.
Reviewed-by: NAndrew Jones <drjones@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NThomas Huth <thuth@redhat.com>
Message-Id: <20190523164309.13345-9-thuth@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

41fae0e5

KVM: fix spectrev1 gadgets · 579b95fc

由 Paolo Bonzini 提交于 5月 17, 2019

[ Upstream commit 1d487e9b ]

These were found with smatch, and then generalized when applicable.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

579b95fc

KVM: Reject device ioctls from processes other than the VM's creator · 612974b6

由 Sean Christopherson 提交于 2月 15, 2019

commit ddba9180 upstream.

KVM's API requires thats ioctls must be issued from the same process
that created the VM.  In other words, userspace can play games with a
VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the
creator can do anything useful.  Explicitly reject device ioctls that
are issued by a process other than the VM's creator, and update KVM's
API documentation to extend its requirements to device ioctls.

Fixes: 852b6d57 ("kvm: add device control API")
Cc: <stable@vger.kernel.org>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

612974b6

KVM: Call kvm_arch_memslots_updated() before updating memslots · 36b4540c

由 Sean Christopherson 提交于 2月 05, 2019

commit 152482580a1b0accb60676063a1ac57b2d12daf6 upstream.

kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound. x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.

Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots. Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.

Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc: <stable@vger.kernel.org>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

36b4540c

Remove 'type' argument from access_ok() function · 4983cb67

由 Linus Torvalds 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0-rc1
commit 96d4f267
category: cleanup
bugzilla: 9284
CVE: NA

It's a cleanup patch that prepare for applying CVE-2018-20669
patch 594cc251 ("make 'user_access_begin()' do 'access_ok()'")

-------------------------------------------------

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access.  But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model.  And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

 - csky still had the old "verify_area()" name as an alias.

 - the iter_iov code had magical hardcoded knowledge of the actual
   values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
   really used it)

 - microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something.  Any missed conversion should be trivially fixable, though.

Conflicts:
	drivers/media/v4l2-core/v4l2-compat-ioctl32.c
	drivers/infiniband/core/uverbs_main.c
	drivers/platform/goldfish/goldfish_pipe.c
	fs/namespace.c
	fs/select.c
	kernel/compat.c
	arch/powerpc/include/asm/uaccess.h
	arch/arm64/kernel/perf_callchain.c
	arch/arm64/include/asm/uaccess.h
	arch/ia64/kernel/signal.c
	arch/x86/entry/vsyscall/vsyscall_64.c
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
[yyl: adjust context]
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4983cb67

kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974) · 4db32219

由 Jann Horn 提交于 1月 26, 2019

commit cfa39381 upstream.

kvm_ioctl_create_device() does the following:

1. creates a device that holds a reference to the VM object (with a borrowed
   reference, the VM's refcount has not been bumped yet)
2. initializes the device
3. transfers the reference to the device to the caller's file descriptor table
4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real
   reference

The ownership transfer in step 3 must not happen before the reference to the VM
becomes a proper, non-borrowed reference, which only happens in step 4.
After step 3, an attacker can close the file descriptor and drop the borrowed
reference, which can cause the refcount of the kvm object to drop to zero.

This means that we need to grab a reference for the device before
anon_inode_getfd(), otherwise the VM can disappear from under us.

Fixes: 852b6d57 ("kvm: add device control API")
Cc: stable@kernel.org
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4db32219

kvm: Change offset in kvm_write_guest_offset_cached to unsigned · d7023bb1

由 Jim Mattson 提交于 12月 14, 2018

[ Upstream commit 7a86dab8 ]

Since the offset is added directly to the hva from the
gfn_to_hva_cache, a negative offset could result in an out of bounds
write. The existing BUG_ON only checks for addresses beyond the end of
the gfn_to_hva_cache, not for addresses before the start of the
gfn_to_hva_cache.

Note that all current call sites have non-negative offsets.

Fixes: 4ec6e863 ("kvm: Introduce kvm_write_guest_offset_cached()")
Reported-by: NCfir Cohen <cfir@google.com>
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NCfir Cohen <cfir@google.com>
Reviewed-by: NPeter Shier <pshier@google.com>
Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d7023bb1

23 8月, 2018 1 次提交

mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7

由 Michal Hocko 提交于 8月 21, 2018

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep.  That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.

We can do much better though.  Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.  Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range.  Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.

This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false.  This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.

I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that.  The first
part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode.  A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom.  This can be done e.g.  after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small.  Then we are looking for a proper process tear down.

[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
Reported-by: NDavid Rientjes <rientjes@google.com>
Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

93065ac7

06 8月, 2018 1 次提交

KVM: try __get_user_pages_fast even if not in atomic context · b9b33da2

由 Paolo Bonzini 提交于 7月 27, 2018

We are currently cutting hva_to_pfn_fast short if we do not want an
immediate exit, which is represented by !async && !atomic.  However,
this is unnecessary, and __get_user_pages_fast is *much* faster
because the regular get_user_pages takes pmd_lock/pte_lock.
In fact, when many CPUs take a nested vmexit at the same time
the contention on those locks is visible, and this patch removes
about 25% (compared to 4.18) from vmexit.flat on a 16 vCPU
nested guest.
Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b9b33da2

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功