提交 · 0adf65f53aae86aa86d8dccada02890545de8938 · openeuler / Kernel

10 6月, 2020 1 次提交

mm: don't include asm/pgtable.h if linux/mm.h is already included · e31cf2f4

由 Mike Rapoport 提交于 6月 08, 2020

Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once.  For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
        return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g.  pte_alloc() and
pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.

The include statements in such cases are remove with a simple loop:

	for f in $(git grep -l "include <linux/mm.h>") ; do
		sed -i -e '/include <asm\/pgtable.h>/ d' $f
	done
Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e31cf2f4

09 6月, 2020 1 次提交

mm/gup.c: convert to use get_user_{page|pages}_fast_only() · dadbb612

由 Souptick Joarder 提交于 6月 07, 2020

API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
align with pin_user_pages_fast_only().

As part of this we will get rid of write parameter.  Instead caller will
pass FOLL_WRITE to get_user_pages_fast_only().  This will not change any
existing functionality of the API.

All the callers are changed to pass FOLL_WRITE.

Also introduce get_user_page_fast_only(), and use it in a few places
that hard-code nr_pages to 1.

Updated the documentation of the API.
Signed-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>		[arch/powerpc/kvm]
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Suchanek <msuchanek@suse.de>
Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dadbb612

01 6月, 2020 4 次提交

KVM: check userspace_addr for all memslots · 09d952c9

由 Paolo Bonzini 提交于 6月 01, 2020

The userspace_addr alignment and range checks are not performed for private
memory slots that are prepared by KVM itself. This is unnecessary and makes
it questionable to use __*_user functions to access memory later on. We also
rely on the userspace address being aligned since we have an entire family
of functions to map gfn to pfn.

Fortunately skipping the check is completely unnecessary. Only x86 uses
private memslots and their userspace_addr is obtained from vm_mmap,
therefore it must be below PAGE_OFFSET. In fact, any attempt to pass
an address above PAGE_OFFSET would have failed because such an address
would return true for kvm_is_error_hva.
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

09d952c9

KVM: introduce kvm_read_guest_offset_cached() · 0958f0ce

由 Vitaly Kuznetsov 提交于 5月 25, 2020

We already have kvm_write_guest_offset_cached(), introduce read analogue.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-5-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0958f0ce

Revert "KVM: No need to retry for hva_to_pfn_remapped()" · a8387d0b

由 Paolo Bonzini 提交于 5月 29, 2020

This reverts commit 5b494aea.
If unlocked==true then the vma pointer could be invalidated, so the 2nd
follow_pfn() is potentially racy: we do need to get out and redo
find_vma_intersection().
Signed-off-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a8387d0b

KVM: check userspace_addr for all memslots · 45f08f4c

由 Paolo Bonzini 提交于 6月 01, 2020

45f08f4c

16 5月, 2020 4 次提交

KVM: Fix spelling in code comments · 656012c7

由 Fuad Tabba 提交于 4月 01, 2020

Fix spelling and typos (e.g., repeated words) in comments.
Signed-off-by: NFuad Tabba <tabba@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200401140310.29701-1-tabba@google.com

656012c7

kvm: add halt-polling cpu usage stats · cb953129

由 David Matlack 提交于 5月 08, 2020

Two new stats for exposing halt-polling cpu usage:
halt_poll_success_ns
halt_poll_fail_ns

Thus sum of these 2 stats is the total cpu time spent polling. "success"
means the VCPU polled until a virtual interrupt was delivered. "fail"
means the VCPU had to schedule out (either because the maximum poll time
was reached or it needed to yield the CPU).

To avoid touching every arch's kvm_vcpu_stat struct, only update and
export halt-polling cpu usage stats if we're on x86.

Exporting cpu usage as a u64 and in nanoseconds means we will overflow at
~500 years, which seems reasonably large.
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Signed-off-by: NJon Cargille <jcargill@google.com>
Reviewed-by: NJim Mattson <jmattson@google.com>

Message-Id: <20200508182240.68440-1-jcargill@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cb953129

KVM: VMX: Optimize posted-interrupt delivery for timer fastpath · 379a3c8e

由 Wanpeng Li 提交于 4月 28, 2020

While optimizing posted-interrupt delivery especially for the timer
fastpath scenario, I measured kvm_x86_ops.deliver_posted_interrupt()
to introduce substantial latency because the processor has to perform
all vmentry tasks, ack the posted interrupt notification vector,
read the posted-interrupt descriptor etc.

This is not only slow, it is also unnecessary when delivering an
interrupt to the current CPU (as is the case for the LAPIC timer) because
PIR->IRR and IRR->RVI synchronization is already performed on vmentry
Therefore skip kvm_vcpu_trigger_posted_interrupt in this case, and
instead do vmx_sync_pir_to_irr() on the EXIT_FASTPATH_REENTER_GUEST
fastpath as well.
Tested-by: NHaiwei Li <lihaiwei@tencent.com>
Cc: Haiwei Li <lihaiwei@tencent.com>
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Message-Id: <1588055009-12677-6-git-send-email-wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

379a3c8e

KVM: No need to retry for hva_to_pfn_remapped() · 5b494aea

由 Peter Xu 提交于 4月 16, 2020

hva_to_pfn_remapped() calls fixup_user_fault(), which has already
handled the retry gracefully.  Even if "unlocked" is set to true, it
means that we've got a VM_FAULT_RETRY inside fixup_user_fault(),
however the page fault has already retried and we should have the pfn
set correctly.  No need to do that again.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20200416155906.267462-1-peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5b494aea

14 5月, 2020 1 次提交

kvm: Replace vcpu->swait with rcuwait · da4ad88c

由 Davidlohr Bueso 提交于 4月 23, 2020

The use of any sort of waitqueue (simple or regular) for
wait/waking vcpus has always been an overkill and semantically
wrong. Because this is per-vcpu (which is blocked) there is
only ever a single waiting vcpu, thus no need for any sort of
queue.

As such, make use of the rcuwait primitive, with the following
considerations:

  - rcuwait already provides the proper barriers that serialize
  concurrent waiter and waker.

  - Task wakeup is done in rcu read critical region, with a
  stable task pointer.

  - Because there is no concurrency among waiters, we need
  not worry about rcuwait_wait_event() calls corrupting
  the wait->task. As a consequence, this saves the locking
  done in swait when modifying the queue. This also applies
  to per-vcore wait for powerpc kvm-hv.

The x86 tscdeadline_latency test mentioned in 8577370f
("KVM: Use simple waitqueue for vcpu->wq") shows that, on avg,
latency is reduced by around 15-20% with this change.

Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: kvmarm@lists.cs.columbia.edu
Cc: linux-mips@vger.kernel.org
Reviewed-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Message-Id: <20200424054837.5138-6-dave@stgolabs.net>
[Avoid extra logic changes. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

da4ad88c

08 5月, 2020 1 次提交

KVM: Introduce kvm_make_all_cpus_request_except() · 54163a34

由 Suravee Suthikulpanit 提交于 5月 06, 2020

This allows making request to all other vcpus except the one
specified in the parameter.
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <1588771076-73790-2-git-send-email-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

54163a34

25 4月, 2020 1 次提交

kvm: add capability for halt polling · acd05785

由 David Matlack 提交于 4月 17, 2020

KVM_CAP_HALT_POLL is a per-VM capability that lets userspace
control the halt-polling time, allowing halt-polling to be tuned or
disabled on particular VMs.

With dynamic halt-polling, a VM's VCPUs can poll from anywhere from
[0, halt_poll_ns] on each halt. KVM_CAP_HALT_POLL sets the
upper limit on the poll time.
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Signed-off-by: NJon Cargille <jcargill@google.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Message-Id: <20200417221446.108733-1-jcargill@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

acd05785

21 4月, 2020 4 次提交

KVM: SVM: avoid infinite loop on NPF from bad address · e72436bc

由 Paolo Bonzini 提交于 4月 17, 2020

When a nested page fault is taken from an address that does not have
a memslot associated to it, kvm_mmu_do_page_fault returns RET_PF_EMULATE
(via mmu_set_spte) and kvm_mmu_page_fault then invokes svm_need_emulation_on_page_fault.

The default answer there is to return false, but in this case this just
causes the page fault to be retried ad libitum.  Since this is not a
fast path, and the only other case where it is taken is an erratum,
just stick a kvm_vcpu_gfn_to_memslot check in there to detect the
common case where the erratum is not happening.

This fixes an infinite loop in the new set_memory_region_test.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e72436bc

KVM: Remove redundant argument to kvm_arch_vcpu_ioctl_run · 1b94f6f8

由 Tianjia Zhang 提交于 4月 16, 2020

In earlier versions of kvm, 'kvm_run' was an independent structure
and was not included in the vcpu structure. At present, 'kvm_run'
is already included in the vcpu structure, so the parameter
'kvm_run' is redundant.

This patch simplifies the function definition, removes the extra
'kvm_run' parameter, and extracts it from the 'kvm_vcpu' structure
if necessary.
Signed-off-by: NTianjia Zhang <tianjia.zhang@linux.alibaba.com>
Message-Id: <20200416051057.26526-1-tianjia.zhang@linux.alibaba.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1b94f6f8

KVM: x86/mmu: Avoid an extra memslot lookup in try_async_pf() for L2 · c36b7150

由 Paolo Bonzini 提交于 4月 16, 2020

Create a new function kvm_is_visible_memslot() and use it from
kvm_is_visible_gfn(); use the new function in try_async_pf() too,
to avoid an extra memslot lookup.

Opportunistically squish a multi-line comment into a single-line comment.

Note, the end result, KVM_PFN_NOSLOT, is unchanged.

Cc: Jim Mattson <jmattson@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c36b7150

KVM: x86: move kvm_create_vcpu_debugfs after last failure point · 63d04348

由 Paolo Bonzini 提交于 4月 01, 2020

The placement of kvm_create_vcpu_debugfs is more or less irrelevant, since
it cannot fail and userspace should not care about the debugfs entries until
it knows the vcpu has been created. Moving it after the last failure
point removes the need to remove the directory when unwinding the creation.
Reviewed-by: NEmanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20200331224222.393439-1-pbonzini@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

63d04348

16 4月, 2020 1 次提交

KVM: remove redundant assignment to variable r · 788109c1

由 Colin Ian King 提交于 4月 10, 2020

The variable r is being assigned  with a value that is never read
and it is being updated later with a new value.  The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Message-Id: <20200410113526.13822-1-colin.king@canonical.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

788109c1

31 3月, 2020 1 次提交

KVM: Pass kvm_init()'s opaque param to additional arch funcs · b9904085

由 Sean Christopherson 提交于 3月 21, 2020

Pass @opaque to kvm_arch_hardware_setup() and
kvm_arch_check_processor_compat() to allow architecture specific code to
reference @opaque without having to stash it away in a temporary global
variable.  This will enable x86 to separate its vendor specific callback
ops, which are passed via @opaque, into "init" and "runtime" ops without
having to stash away the "init" ops.

No functional change intended.
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Tested-by: Cornelia Huck <cohuck@redhat.com> #s390
Acked-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200321202603.19355-2-sean.j.christopherson@intel.com>
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b9904085

26 3月, 2020 1 次提交

KVM: Fix out of range accesses to memslots · 0774a964

由 Sean Christopherson 提交于 3月 20, 2020

Reset the LRU slot if it becomes invalid when deleting a memslot to fix
an out-of-bounds/use-after-free access when searching through memslots.

Explicitly check for there being no used slots in search_memslots(), and
in the caller of s390's approximation variant.

Fixes: 36947254 ("KVM: Dynamically size memslot array based on number of used slots")
Reported-by: NQian Cai <cai@lca.pw>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200320205546.2396-2-sean.j.christopherson@intel.com>
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0774a964

17 3月, 2020 17 次提交

KVM: Drop largepages_enabled and its accessor/mutator · 600087b6

由 Sean Christopherson 提交于 3月 02, 2020

Drop largepages_enabled, kvm_largepages_enabled() and
kvm_disable_largepages() now that all users are gone.

Note, largepages_enabled was an x86-only flag that got left in common
KVM code when KVM gained support for multiple architectures.

No functional change intended.
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

600087b6

KVM: Drop gfn_to_pfn_atomic() · 2bde08f9

由 Peter Xu 提交于 3月 04, 2020

It's never used anywhere now.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2bde08f9

KVM: x86: enable dirty log gradually in small chunks · 3c9bd400

由 Jay Zhou 提交于 2月 27, 2020

It could take kvm->mmu_lock for an extended period of time when
enabling dirty log for the first time. The main cost is to clear
all the D-bits of last level SPTEs. This situation can benefit from
manual dirty log protect as well, which can reduce the mmu_lock
time taken. The sequence is like this:

1. Initialize all the bits of the dirty bitmap to 1 when enabling
   dirty log for the first time
2. Only write protect the huge pages
3. KVM_GET_DIRTY_LOG returns the dirty bitmap info
4. KVM_CLEAR_DIRTY_LOG will clear D-bit for each of the leaf level
   SPTEs gradually in small chunks

Under the Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz environment,
I did some tests with a 128G windows VM and counted the time taken
of memory_global_dirty_log_start, here is the numbers:

VM Size        Before    After optimization
128G           460ms     10ms
Signed-off-by: NJay Zhou <jianjay.zhou@huawei.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3c9bd400

KVM: Dynamically size memslot array based on number of used slots · 36947254

由 Sean Christopherson 提交于 2月 18, 2020

Now that the memslot logic doesn't assume memslots are always non-NULL,
dynamically size the array of memslots instead of unconditionally
allocating memory for the maximum number of memslots.

Note, because a to-be-deleted memslot must first be invalidated, the
array size cannot be immediately reduced when deleting a memslot.
However, consecutive deletions will realize the memory savings, i.e.
a second deletion will trim the entry.
Tested-by: NChristoffer Dall <christoffer.dall@arm.com>
Tested-by: NMarc Zyngier <maz@kernel.org>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

36947254

KVM: Terminate memslot walks via used_slots · 0577d1ab

由 Sean Christopherson 提交于 2月 18, 2020

Refactor memslot handling to treat the number of used slots as the de
facto size of the memslot array, e.g. return NULL from id_to_memslot()
when an invalid index is provided instead of relying on npages==0 to
detect an invalid memslot. Rework the sorting and walking of memslots
in advance of dynamically sizing memslots to aid bisection and debug,
e.g. with luck, a bug in the refactoring will bisect here and/or hit a
WARN instead of randomly corrupting memory.

Alternatively, a global null/invalid memslot could be returned, i.e. so
callers of id_to_memslot() don't have to explicitly check for a NULL
memslot, but that approach runs the risk of introducing difficult-to-
debug issues, e.g. if the global null slot is modified. Constifying
the return from id_to_memslot() to combat such issues is possible, but
would require a massive refactoring of arch specific code and would
still be susceptible to casting shenanigans.

Add function comments to update_memslots() and search_memslots() to
explicitly (and loudly) state how memslots are sorted.

Opportunistically stuff @hva with a non-canonical value when deleting a
private memslot on x86 to detect bogus usage of the freed slot.

No functional change intended.
Tested-by: NChristoffer Dall <christoffer.dall@arm.com>
Tested-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0577d1ab

KVM: Ensure validity of memslot with respect to kvm_get_dirty_log() · 2a49f61d

由 Sean Christopherson 提交于 2月 18, 2020

Rework kvm_get_dirty_log() so that it "returns" the associated memslot
on success. A future patch will rework memslot handling such that
id_to_memslot() can return NULL, returning the memslot makes it more
obvious that the validity of the memslot has been verified, i.e.
precludes the need to add validity checks in the arch code that are
technically unnecessary.

To maintain ordering in s390, move the call to kvm_arch_sync_dirty_log()
from s390's kvm_vm_ioctl_get_dirty_log() to the new kvm_get_dirty_log().
This is a nop for PPC, the only other arch that doesn't select
KVM_GENERIC_DIRTYLOG_READ_PROTECT, as its sync_dirty_log() is empty.

Ideally, moving the sync_dirty_log() call would be done in a separate
patch, but it can't be done in a follow-on patch because that would
temporarily break s390's ordering. Making the move in a preparatory
patch would be functionally correct, but would create an odd scenario
where the moved sync_dirty_log() would operate on a "different" memslot
due to consuming the result of a different id_to_memslot(). The
memslot couldn't actually be different as slots_lock is held, but the
code is confusing enough as it is, i.e. moving sync_dirty_log() in this
patch is the lesser of all evils.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2a49f61d

KVM: Provide common implementation for generic dirty log functions · 0dff0846

由 Sean Christopherson 提交于 2月 18, 2020

Move the implementations of KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG
for CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT into common KVM code.
The arch specific implemenations are extremely similar, differing
only in whether the dirty log needs to be sync'd from hardware (x86)
and how the TLBs are flushed. Add new arch hooks to handle sync
and TLB flush; the sync will also be used for non-generic dirty log
support in a future patch (s390).

The ulterior motive for providing a common implementation is to
eliminate the dependency between arch and common code with respect to
the memslot referenced by the dirty log, i.e. to make it obvious in the
code that the validity of the memslot is guaranteed, as a future patch
will rework memslot handling such that id_to_memslot() can return NULL.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0dff0846

KVM: Clean up local variable usage in __kvm_set_memory_region() · 163da372

由 Sean Christopherson 提交于 2月 18, 2020

Clean up __kvm_set_memory_region() to achieve several goals:

  - Remove local variables that serve no real purpose
  - Improve the readability of the code
  - Better show the relationship between the 'old' and 'new' memslot
  - Prepare for dynamically sizing memslots
  - Document subtle gotchas (via comments)

Note, using 'tmp' to hold the initial memslot is not strictly necessary
at this juncture, e.g. 'old' could be directly copied from
id_to_memslot(), but keep the pointer usage as id_to_memslot() will be
able to return a NULL pointer once memslots are dynamically sized.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

163da372

KVM: Simplify kvm_free_memslot() and all its descendents · e96c81ee

由 Sean Christopherson 提交于 2月 18, 2020

Now that all callers of kvm_free_memslot() pass NULL for @dont, remove
the param from the top-level routine and all arch's implementations.

No functional change intended.
Tested-by: NChristoffer Dall <christoffer.dall@arm.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e96c81ee

KVM: Move memslot deletion to helper function · 5c0b4f3d

由 Sean Christopherson 提交于 2月 18, 2020

Move memslot deletion into its own routine so that the success path for
other memslot updates does not need to use kvm_free_memslot(), i.e. can
explicitly destroy the dirty bitmap when necessary.  This paves the way
for dropping @dont from kvm_free_memslot(), i.e. all callers now pass
NULL for @dont.

Add a comment above the code to make a copy of the existing memslot
prior to deletion, it is not at all obvious that the pointer will become
stale during sorting and/or installation of new memslots.

Note, kvm_arch_commit_memory_region() allows an architecture to free
resources when moving a memslot or changing its flags, e.g. x86 frees
its arch specific memslot metadata during commit_memory_region().
Acked-by: NChristoffer Dall <christoffer.dall@arm.com>
Tested-by: NChristoffer Dall <christoffer.dall@arm.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5c0b4f3d

KVM: Drop "const" attribute from old memslot in commit_memory_region() · 9d4c197c

由 Sean Christopherson 提交于 2月 18, 2020

Drop the "const" attribute from @old in kvm_arch_commit_memory_region()
to allow arch specific code to free arch specific resources in the old
memslot without having to cast away the attribute. Freeing resources in
kvm_arch_commit_memory_region() paves the way for simplifying
kvm_free_memslot() by eliminating the last usage of its @dont param.
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9d4c197c

KVM: Move setting of memslot into helper routine · cf47f50b

由 Sean Christopherson 提交于 2月 18, 2020

Split out the core functionality of setting a memslot into a separate
helper in preparation for moving memslot deletion into its own routine.
Tested-by: NChristoffer Dall <christoffer.dall@arm.com>
Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cf47f50b

KVM: Refactor error handling for setting memory region · 71a4c30b

由 Sean Christopherson 提交于 2月 18, 2020

Replace a big pile o' gotos with returns to make it more obvious what
error code is being returned, and to prepare for refactoring the
functional, i.e. post-checks, portion of __kvm_set_memory_region().
Reviewed-by: NJanosch Frank <frankja@linux.ibm.com>
Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

71a4c30b

KVM: Explicitly free allocated-but-unused dirty bitmap · bd0e96fd

由 Sean Christopherson 提交于 2月 18, 2020

Explicitly free an allocated-but-unused dirty bitmap instead of relying
on kvm_free_memslot() if an error occurs in __kvm_set_memory_region().
There is no longer a need to abuse kvm_free_memslot() to free arch
specific resources as arch specific code is now called only after the
common flow is guaranteed to succeed. Arch code can still fail, but
it's responsible for its own cleanup in that case.

Eliminating the error path's abuse of kvm_free_memslot() paves the way
for simplifying kvm_free_memslot(), i.e. dropping its @dont param.
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bd0e96fd

KVM: Drop kvm_arch_create_memslot() · 414de7ab

由 Sean Christopherson 提交于 2月 18, 2020

Remove kvm_arch_create_memslot() now that all arch implementations are
effectively nops.  Removing kvm_arch_create_memslot() eliminates the
possibility for arch specific code to allocate memory prior to setting
a memslot, which sets the stage for simplifying kvm_free_memslot().

Cc: Janosch Frank <frankja@linux.ibm.com>
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

414de7ab

KVM: Don't free new memslot if allocation of said memslot fails · 13f67889

由 Sean Christopherson 提交于 2月 18, 2020

The two implementations of kvm_arch_create_memslot() in x86 and PPC are
both good citizens and free up all local resources if creation fails.
Return immediately (via a superfluous goto) instead of calling
kvm_free_memslot().

Note, the call to kvm_free_memslot() is effectively an expensive nop in
this case as there are no resources to be freed.

No functional change intended.
Acked-by: NChristoffer Dall <christoffer.dall@arm.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

13f67889

KVM: Reinstall old memslots if arch preparation fails · 13ea5255

由 Sean Christopherson 提交于 2月 18, 2020

Reinstall the old memslots if preparing the new memory region fails
after invalidating a to-be-{re}moved memslot.

Remove the superfluous 'old_memslots' variable so that it's somewhat
clear that the error handling path needs to free the unused memslots,
not simply the 'old' memslots.

Fixes: bc6678a3 ("KVM: introduce kvm->srcu and convert kvm_set_memory_region to SRCU update")
Reviewed-by: NChristoffer Dall <christoffer.dall@arm.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

13ea5255

12 2月, 2020 1 次提交

KVM: Disable preemption in kvm_get_running_vcpu() · 1f03b2bc

由 Marc Zyngier 提交于 2月 07, 2020

Accessing a per-cpu variable only makes sense when preemption is
disabled (and the kernel does check this when the right debug options
are switched on).

For kvm_get_running_vcpu(), it is fine to return the value after
re-enabling preemption, as the preempt notifiers will make sure that
this is kept consistent across task migration (the comment above the
function hints at it, but lacks the crucial preemption management).

While we're at it, move the comment from the ARM code, which explains
why the whole thing works.

Fixes: 7495e22b ("KVM: Move running VCPU from ARM to common code").
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: NZenghui Yu <yuzenghui@huawei.com>
Tested-by: NZenghui Yu <yuzenghui@huawei.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/318984f6-bc36-33a3-abc6-bf2295974b06@huawei.com
Message-id: <20200207163410.31276-1-maz@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1f03b2bc

05 2月, 2020 1 次提交

KVM: fix overflow of zero page refcount with ksm running · 7df003c8

由 Zhuang Yanying 提交于 10月 12, 2019

We are testing Virtual Machine with KSM on v5.4-rc2 kernel,
and found the zero_page refcount overflow.
The cause of refcount overflow is increased in try_async_pf
(get_user_page) without being decreased in mmu_set_spte()
while handling ept violation.
In kvm_release_pfn_clean(), only unreserved page will call
put_page. However, zero page is reserved.
So, as well as creating and destroy vm, the refcount of
zero page will continue to increase until it overflows.

step1:
echo 10000 > /sys/kernel/pages_to_scan/pages_to_scan
echo 1 > /sys/kernel/pages_to_scan/run
echo 1 > /sys/kernel/pages_to_scan/use_zero_pages

step2:
just create several normal qemu kvm vms.
And destroy it after 10s.
Repeat this action all the time.

After a long period of time, all domains hang because
of the refcount of zero page overflow.

Qemu print error log as follow:
 â€¦
 error: kvm run failed Bad address
 EAX=00006cdc EBX=00000008 ECX=80202001 EDX=078bfbfd
 ESI=ffffffff EDI=00000000 EBP=00000008 ESP=00006cc4
 EIP=000efd75 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
 CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
 SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
 DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
 FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
 GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
 LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
 TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
 GDT=     000f7070 00000037
 IDT=     000f70ae 00000000
 CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
 DR6=00000000ffff0ff0 DR7=0000000000000400
 EFER=0000000000000000
 Code=00 01 00 00 00 e9 e8 00 00 00 c7 05 4c 55 0f 00 01 00 00 00 <8b> 35 00 00 01 00 8b 3d 04 00 01 00 b8 d8 d3 00 00 c1 e0 08 0c ea a3 00 00 01 00 c7 05 04
 â€¦

Meanwhile, a kernel warning is departed.

 [40914.836375] WARNING: CPU: 3 PID: 82067 at ./include/linux/mm.h:987 try_get_page+0x1f/0x30
 [40914.836412] CPU: 3 PID: 82067 Comm: CPU 0/KVM Kdump: loaded Tainted: G           OE     5.2.0-rc2 #5
 [40914.836415] RIP: 0010:try_get_page+0x1f/0x30
 [40914.836417] Code: 40 00 c3 0f 1f 84 00 00 00 00 00 48 8b 47 08 a8 01 75 11 8b 47 34 85 c0 7e 10 f0 ff 47 34 b8 01 00 00 00 c3 48 8d 78 ff eb e9 <0f> 0b 31 c0 c3 66 90 66 2e 0f 1f 84 00 0
 0 00 00 00 48 8b 47 08 a8
 [40914.836418] RSP: 0018:ffffb4144e523988 EFLAGS: 00010286
 [40914.836419] RAX: 0000000080000000 RBX: 0000000000000326 RCX: 0000000000000000
 [40914.836420] RDX: 0000000000000000 RSI: 00004ffdeba10000 RDI: ffffdf07093f6440
 [40914.836421] RBP: ffffdf07093f6440 R08: 800000424fd91225 R09: 0000000000000000
 [40914.836421] R10: ffff9eb41bfeebb8 R11: 0000000000000000 R12: ffffdf06bbd1e8a8
 [40914.836422] R13: 0000000000000080 R14: 800000424fd91225 R15: ffffdf07093f6440
 [40914.836423] FS:  00007fb60ffff700(0000) GS:ffff9eb4802c0000(0000) knlGS:0000000000000000
 [40914.836425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [40914.836426] CR2: 0000000000000000 CR3: 0000002f220e6002 CR4: 00000000003626e0
 [40914.836427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [40914.836427] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [40914.836428] Call Trace:
 [40914.836433]  follow_page_pte+0x302/0x47b
 [40914.836437]  __get_user_pages+0xf1/0x7d0
 [40914.836441]  ? irq_work_queue+0x9/0x70
 [40914.836443]  get_user_pages_unlocked+0x13f/0x1e0
 [40914.836469]  __gfn_to_pfn_memslot+0x10e/0x400 [kvm]
 [40914.836486]  try_async_pf+0x87/0x240 [kvm]
 [40914.836503]  tdp_page_fault+0x139/0x270 [kvm]
 [40914.836523]  kvm_mmu_page_fault+0x76/0x5e0 [kvm]
 [40914.836588]  vcpu_enter_guest+0xb45/0x1570 [kvm]
 [40914.836632]  kvm_arch_vcpu_ioctl_run+0x35d/0x580 [kvm]
 [40914.836645]  kvm_vcpu_ioctl+0x26e/0x5d0 [kvm]
 [40914.836650]  do_vfs_ioctl+0xa9/0x620
 [40914.836653]  ksys_ioctl+0x60/0x90
 [40914.836654]  __x64_sys_ioctl+0x16/0x20
 [40914.836658]  do_syscall_64+0x5b/0x180
 [40914.836664]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 [40914.836666] RIP: 0033:0x7fb61cb6bfc7
Signed-off-by: NLinFeng <linfeng23@huawei.com>
Signed-off-by: NZhuang Yanying <ann.zhuangyanying@huawei.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7df003c8

31 1月, 2020 1 次提交

x86/kvm: Cache gfn to pfn translation · 91724814

由 Boris Ostrovsky 提交于 12月 05, 2019

__kvm_map_gfn()'s call to gfn_to_pfn_memslot() is
* relatively expensive
* in certain cases (such as when done from atomic context) cannot be called

Stashing gfn-to-pfn mapping should help with both cases.

This is part of CVE-2019-3016.
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

91724814

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功