提交 · 4e15a073a168b62311db911a55c4d4f1500c2821 · gsplhtlxg / clone-Linux

27 10月, 2018 1 次提交

Revert "mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks" · 4e15a073

由 Michal Hocko 提交于 10月 26, 2018

Revert 5ff7091f ("mm, mmu_notifier: annotate mmu notifiers with
blockable invalidate callbacks").

MMU_INVALIDATE_DOES_NOT_BLOCK flags was the only one used and it is no
longer needed since 93065ac7 ("mm, oom: distinguish blockable mode for
mmu notifiers").  We now have a full support for per range !blocking
behavior so we can drop the stop gap workaround which the per notifier
flag was used for.

Link: http://lkml.kernel.org/r/20180827112623.8992-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4e15a073

17 10月, 2018 4 次提交

KVM: refine the comment of function gfn_to_hva_memslot_prot() · 970c0d4b

由 Wei Yang 提交于 10月 09, 2018

The original comment is little hard to understand.

No functional change, just amend the comment a little.
Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

970c0d4b

kvm/x86 : add coalesced pio support · 0804c849

由 Peng Hao 提交于 10月 14, 2018

Coalesced pio is based on coalesced mmio and can be used for some port
like rtc port, pci-host config port and so on.

Specially in case of rtc as coalesced pio, some versions of windows guest
access rtc frequently because of rtc as system tick. guest access rtc like
this: write register index to 0x70, then write or read data from 0x71.
writing 0x70 port is just as index and do nothing else. So we can use
coalesced pio to handle this scene to reduce VM-EXIT time.

When starting and closing a virtual machine, it will access pci-host config
port frequently. So setting these port as coalesced pio can reduce startup
and shutdown time.

without my patch, get the vm-exit time of accessing rtc 0x70 and piix 0xcf8
using perf tools: (guest OS : windows 7 64bit)
IO Port Access Samples Samples% Time% Min Time Max Time Avg time
0x70:POUT 86 30.99% 74.59% 9us 29us 10.75us (+- 3.41%)
0xcf8:POUT 1119 2.60% 2.12% 2.79us 56.83us 3.41us (+- 2.23%)

with my patch
IO Port Access Samples Samples% Time% Min Time Max Time Avg time
0x70:POUT 106 32.02% 29.47% 0us 10us 1.57us (+- 7.38%)
0xcf8:POUT 1065 1.67% 0.28% 0.41us 65.44us 0.66us (+- 10.55%)
Signed-off-by: NPeng Hao <peng.hao2@zte.com.cn>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0804c849

KVM: leverage change to adjust slots->used_slots in update_memslots() · 31fc4f95

由 Wei Yang 提交于 8月 22, 2018

update_memslots() is only called by __kvm_set_memory_region(), in which
"change" is calculated and indicates how to adjust slots->used_slots

  * increase by one if it is KVM_MR_CREATE
  * decrease by one if it is KVM_MR_DELETE
  * not change for others

This patch adjusts slots->used_slots in update_memslots() based on "change"
value instead of re-calculate those states again.
Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

31fc4f95

KVM: x86: hyperv: optimize 'all cpus' case in kvm_hv_flush_tlb() · a812297c

由 Vitaly Kuznetsov 提交于 8月 22, 2018

We can use 'NULL' to represent 'all cpus' case in
kvm_make_vcpus_request_mask() and avoid building vCPU mask with
all vCPUs.
Suggested-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a812297c

23 8月, 2018 1 次提交

mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7

由 Michal Hocko 提交于 8月 21, 2018

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep.  That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.

We can do much better though.  Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.  Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range.  Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.

This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false.  This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.

I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that.  The first
part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode.  A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom.  This can be done e.g.  after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small.  Then we are looking for a proper process tear down.

[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
Reported-by: NDavid Rientjes <rientjes@google.com>
Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

93065ac7

06 8月, 2018 3 次提交

KVM: try __get_user_pages_fast even if not in atomic context · b9b33da2

由 Paolo Bonzini 提交于 7月 27, 2018

We are currently cutting hva_to_pfn_fast short if we do not want an
immediate exit, which is represented by !async && !atomic.  However,
this is unnecessary, and __get_user_pages_fast is *much* faster
because the regular get_user_pages takes pmd_lock/pte_lock.
In fact, when many CPUs take a nested vmexit at the same time
the contention on those locks is visible, and this patch removes
about 25% (compared to 4.18) from vmexit.flat on a 16 vCPU
nested guest.
Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b9b33da2

KVM: x86: Add tlb remote flush callback in kvm_x86_ops. · b08660e5

由 Tianyu Lan 提交于 7月 19, 2018

This patch is to provide a way for platforms to register hv tlb remote
flush callback and this helps to optimize operation of tlb flush
among vcpus for nested virtualization case.
Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b08660e5

kvm: x86: Use fast CR3 switch for nested VMX · 50c28f21

由 Junaid Shahid 提交于 6月 27, 2018

Use the fast CR3 switch mechanism to locklessly change the MMU root
page when switching between L1 and L2. The switch from L2 to L1 should
always go through the fast path, while the switch from L1 to L2 should
go through the fast path if L1's CR3/EPTP for L2 hasn't changed
since the last time.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

50c28f21

21 7月, 2018 1 次提交
- E
  kvm: Don't open code task_pid in kvm_vcpu_ioctl · 71dbc8a9
  由 Eric W. Biederman 提交于 7月 16, 2017
```
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
```
  71dbc8a9
13 7月, 2018 1 次提交

KVM: s390: a utility function for migration · 03133347

由 Claudio Imbrenda 提交于 4月 30, 2018

Introduce a utility function that will be used later on for storage
attributes migration, and use it in kvm_main.c to replace existing code
that does the same thing.
Signed-off-by: NClaudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Message-Id: <1525106005-13931-2-git-send-email-imbrenda@linux.vnet.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

03133347

22 6月, 2018 1 次提交

KVM: Enforce error in ioctl for compat tasks when !KVM_COMPAT · 7ddfd3e0

由 Marc Zyngier 提交于 6月 17, 2018

The current behaviour of the compat ioctls is a bit odd.
We provide a compat_ioctl method when KVM_COMPAT is set, and NULL
otherwise. But NULL means that the normal, non-compat ioctl should
be used directly for compat tasks, and there is no way to actually
prevent a compat task from issueing KVM ioctls.

This patch changes this behaviour, by always registering a compat_ioctl
method, even if KVM_COMPAT is not selected. In that case, the callback
will always return -EINVAL.

Fixes: de8e5d74 ("KVM: Disable compat ioctl for s390")
Reported-by: NMark Rutland <mark.rutland@arm.com>
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Acked-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>

7ddfd3e0

20 6月, 2018 1 次提交

sched/swait: Rename to exclusive · b3dae109

由 Peter Zijlstra 提交于 6月 12, 2018

Since swait basically implemented exclusive waits only, make sure
the API reflects that.

  $ git grep -l -e "\<swake_up\>"
		-e "\<swait_event[^ (]*"
		-e "\<prepare_to_swait\>" | while read file;
    do
	sed -i -e 's/\<swake_up\>/&_one/g'
	       -e 's/\<swait_event[^ (]*/&_exclusive/g'
	       -e 's/\<prepare_to_swait\>/&_exclusive/g' $file;
    done

With a few manual touch-ups.
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: bigeasy@linutronix.de
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180612083909.261946548@infradead.org

b3dae109

13 6月, 2018 1 次提交

treewide: Use array_size() in vmalloc() · 42bc47b3

由 Kees Cook 提交于 6月 12, 2018

The vmalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

        vmalloc(a * b)

with:
        vmalloc(array_size(a, b))

as well as handling cases of:

        vmalloc(a * b * c)

with:

        vmalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

        vmalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  vmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  vmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  vmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  vmalloc(
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

  vmalloc(
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  vmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  vmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  vmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  vmalloc(C1 * C2 * C3, ...)
|
  vmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
  vmalloc(C1 * C2, ...)
|
  vmalloc(
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)
Signed-off-by: NKees Cook <keescook@chromium.org>

42bc47b3

02 6月, 2018 2 次提交

kvm: no need to check return value of debugfs_create functions · 929f45e3

由 Greg Kroah-Hartman 提交于 5月 29, 2018

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

This cleans up the error handling a lot, as this code will never get
hit.

Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christoffer Dall <christoffer.dall@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim KrÄmÃ¡Å™" <rkrcmar@redhat.com>
Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Cc: kvm-ppc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: kvmarm@lists.cs.columbia.edu
Cc: kvm@vger.kernel.org
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

929f45e3

kvm: Change return type to vm_fault_t · 1499fa80

由 Souptick Joarder 提交于 4月 19, 2018

Use new return type vm_fault_t for fault handler. For
now, this is just documenting that the function returns
a VM_FAULT value rather than an errno. Once all instances
are converted, vm_fault_t will become a distinct type.

commit 1c8f4220 ("mm: change return type to vm_fault_t")
Signed-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1499fa80

26 5月, 2018 1 次提交

KVM: introduce kvm_make_vcpus_request_mask() API · 7053df4e

由 Vitaly Kuznetsov 提交于 5月 16, 2018

Hyper-V style PV TLB flush hypercalls inmplementation will use this API.
To avoid memory allocation in CONFIG_CPUMASK_OFFSTACK case add
cpumask_var_t argument.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

7053df4e

25 5月, 2018 1 次提交

KVM: arm/arm64: Introduce kvm_arch_vcpu_run_pid_change · bd2a6394

由 Christoffer Dall 提交于 2月 23, 2018

KVM/ARM differs from other architectures in having to maintain an
additional virtual address space from that of the host and the
guest, because we split the execution of KVM across both EL1 and
EL2.

This results in a need to explicitly map data structures into EL2
(hyp) which are accessed from the hyp code.  As we are about to be
more clever with our FPSIMD handling on arm64, which stores data in
the task struct and uses thread_info flags, we will have to map
parts of the currently executing task struct into the EL2 virtual
address space.

However, we don't want to do this on every KVM_RUN, because it is a
fairly expensive operation to walk the page tables, and the common
execution mode is to map a single thread to a VCPU.  By introducing
a hook that architectures can select with
HAVE_KVM_VCPU_RUN_PID_CHANGE, we do not introduce overhead for
other architectures, but have a simple way to only map the data we
need when required for arm64.

This patch introduces the framework only, and wires it up in the
arm/arm64 KVM common code.

No functional change.
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: NDave Martin <Dave.Martin@arm.com>
Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>

bd2a6394

07 3月, 2018 1 次提交

kvm: use insert sort in kvm_io_bus_register_dev function · d4c67a7a

由 Gal Hammer 提交于 1月 16, 2018

The loading time of a VM is quite significant with a CPU usage
reaching 100% when loading a VM that its virtio devices use a
large amount of virt-queues (e.g. a virtio-serial device with
max_ports=511). Most of the time is spend in re-sorting the
kvm_io_bus kvm_io_range array when a new eventfd is registered.

The patch replaces the existing method with an insert sort.
Reviewed-by: NMarcel Apfelbaum <marcel@redhat.com>
Reviewed-by: NUri Lublin <ulublin@redhat.com>
Signed-off-by: NGal Hammer <ghammer@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

d4c67a7a

24 2月, 2018 1 次提交

KVM: mmu: Fix overlap between public and private memslots · b28676bb

由 Wanpeng Li 提交于 2月 13, 2018

Reported by syzkaller:

    pte_list_remove: ffff9714eb1f8078 0->BUG
    ------------[ cut here ]------------
    kernel BUG at arch/x86/kvm/mmu.c:1157!
    invalid opcode: 0000 [#1] SMP
    RIP: 0010:pte_list_remove+0x11b/0x120 [kvm]
    Call Trace:
     drop_spte+0x83/0xb0 [kvm]
     mmu_page_zap_pte+0xcc/0xe0 [kvm]
     kvm_mmu_prepare_zap_page+0x81/0x4a0 [kvm]
     kvm_mmu_invalidate_zap_all_pages+0x159/0x220 [kvm]
     kvm_arch_flush_shadow_all+0xe/0x10 [kvm]
     kvm_mmu_notifier_release+0x6c/0xa0 [kvm]
     ? kvm_mmu_notifier_release+0x5/0xa0 [kvm]
     __mmu_notifier_release+0x79/0x110
     ? __mmu_notifier_release+0x5/0x110
     exit_mmap+0x15a/0x170
     ? do_exit+0x281/0xcb0
     mmput+0x66/0x160
     do_exit+0x2c9/0xcb0
     ? __context_tracking_exit.part.5+0x4a/0x150
     do_group_exit+0x50/0xd0
     SyS_exit_group+0x14/0x20
     do_syscall_64+0x73/0x1f0
     entry_SYSCALL64_slow_path+0x25/0x25

The reason is that when creates new memslot, there is no guarantee for new
memslot not overlap with private memslots. This can be triggered by the
following program:

   #include <fcntl.h>
   #include <pthread.h>
   #include <setjmp.h>
   #include <signal.h>
   #include <stddef.h>
   #include <stdint.h>
   #include <stdio.h>
   #include <stdlib.h>
   #include <string.h>
   #include <sys/ioctl.h>
   #include <sys/stat.h>
   #include <sys/syscall.h>
   #include <sys/types.h>
   #include <unistd.h>
   #include <linux/kvm.h>

   long r[16];

   int main()
   {
	void *p = valloc(0x4000);

	r[2] = open("/dev/kvm", 0);
	r[3] = ioctl(r[2], KVM_CREATE_VM, 0x0ul);

	uint64_t addr = 0xf000;
	ioctl(r[3], KVM_SET_IDENTITY_MAP_ADDR, &addr);
	r[6] = ioctl(r[3], KVM_CREATE_VCPU, 0x0ul);
	ioctl(r[3], KVM_SET_TSS_ADDR, 0x0ul);
	ioctl(r[6], KVM_RUN, 0);
	ioctl(r[6], KVM_RUN, 0);

	struct kvm_userspace_memory_region mr = {
		.slot = 0,
		.flags = KVM_MEM_LOG_DIRTY_PAGES,
		.guest_phys_addr = 0xf000,
		.memory_size = 0x4000,
		.userspace_addr = (uintptr_t) p
	};
	ioctl(r[3], KVM_SET_USER_MEMORY_REGION, &mr);
	return 0;
   }

This patch fixes the bug by not adding a new memslot even if it
overlaps with private memslots.
Reported-by: NDmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
---
 virt/kvm/kvm_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

b28676bb

01 2月, 2018 3 次提交

mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks · 5ff7091f

由 David Rientjes 提交于 1月 31, 2018

Commit 4d4bbd85 ("mm, oom_reaper: skip mm structs with mmu
notifiers") prevented the oom reaper from unmapping private anonymous
memory with the oom reaper when the oom victim mm had mmu notifiers
registered.

The rationale is that doing mmu_notifier_invalidate_range_{start,end}()
around the unmap_page_range(), which is needed, can block and the oom
killer will stall forever waiting for the victim to exit, which may not
be possible without reaping.

That concern is real, but only true for mmu notifiers that have
blockable invalidate_range_{start,end}() callbacks.  This patch adds a
"flags" field to mmu notifier ops that can set a bit to indicate that
these callbacks do not block.

The implementation is steered toward an expensive slowpath, such as
after the oom reaper has grabbed mm->mmap_sem of a still alive oom
victim.

[rientjes@google.com: mmu_notifier_invalidate_range_end() can also call the invalidate_range() must not block, fix comment]
  Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801091339570.240101@chino.kir.corp.google.com
[akpm@linux-foundation.org: make mm_has_blockable_invalidate_notifiers() return bool, use rwsem_is_locked()]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141329500.74052@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Acked-by: NDimitri Sivanich <sivanich@hpe.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5ff7091f

kvm: embed vcpu id to dentry of vcpu anon inode · e46b4692

由 Masatake YAMATO 提交于 1月 20, 2018

All d-entries for vcpu have the same, "anon_inode:kvm-vcpu". That means
it is impossible to know the mapping between fds for vcpu and vcpu
from userland.

    # LC_ALL=C ls -l /proc/617/fd | grep vcpu
    lrwx------. 1 qemu qemu 64 Jan  7 16:50 18 -> anon_inode:kvm-vcpu
    lrwx------. 1 qemu qemu 64 Jan  7 16:50 19 -> anon_inode:kvm-vcpu

It is also impossible to know the mapping between vma for kvm_run
structure and vcpu from userland.

    # LC_ALL=C grep vcpu /proc/617/maps
    7f9d842d0000-7f9d842d3000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu
    7f9d842d3000-7f9d842d6000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu

This change adds vcpu id to d-entries for vcpu. With this change
you can get the following output:

    # LC_ALL=C ls -l /proc/617/fd | grep vcpu
    lrwx------. 1 qemu qemu 64 Jan  7 16:50 18 -> anon_inode:kvm-vcpu:0
    lrwx------. 1 qemu qemu 64 Jan  7 16:50 19 -> anon_inode:kvm-vcpu:1

    # LC_ALL=C grep vcpu /proc/617/maps
    7f9d842d0000-7f9d842d3000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu:0
    7f9d842d3000-7f9d842d6000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu:1

With the mappings known from the output, a tool like strace can report more details
of qemu-kvm process activities. Here is the strace output of my local prototype:

    # ./strace -KK -f -p 617 2>&1 | grep 'KVM_RUN\| K'
    ...
    [pid   664] ioctl(18, KVM_RUN, 0)       = 0 (KVM_EXIT_MMIO)
     K ready_for_interrupt_injection=1, if_flag=0, flags=0, cr8=0000000000000000, apic_base=0x000000fee00d00
     K phys_addr=0, len=1634035803, [33, 0, 0, 0, 0, 0, 0, 0], is_write=112
    [pid   664] ioctl(18, KVM_RUN, 0)       = 0 (KVM_EXIT_MMIO)
     K ready_for_interrupt_injection=1, if_flag=1, flags=0, cr8=0000000000000000, apic_base=0x000000fee00d00
     K phys_addr=0, len=1634035803, [33, 0, 0, 0, 0, 0, 0, 0], is_write=112
    ...
Signed-off-by: NMasatake YAMATO <yamato@redhat.com>
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

e46b4692

kvm: Map PFN-type memory regions as writable (if possible) · a340b3e2

由 KarimAllah Ahmed 提交于 1月 17, 2018

For EPT-violations that are triggered by a read, the pages are also mapped with
write permissions (if their memory region is also writable). That would avoid
getting yet another fault on the same page when a write occurs.

This optimization only happens when you have a "struct page" backing the memory
region. So also enable it for memory regions that do not have a "struct page".

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

a340b3e2

16 1月, 2018 1 次提交

kvm: whitelist struct kvm_vcpu_arch · 46515736

由 Paolo Bonzini 提交于 10月 26, 2017

On x86, ARM and s390, struct kvm_vcpu_arch has a usercopy region
that is read and written by the KVM_GET/SET_CPUID2 ioctls (x86)
or KVM_GET/SET_ONE_REG (ARM/s390).  Without whitelisting the area,
KVM is completely broken on those architectures with usercopy hardening
enabled.

For now, allow writing to the entire struct on all architectures.
The KVM tree will not refine this to an architecture-specific
subset of struct kvm_vcpu_arch.

Cc: kernel-hardening@lists.openwall.com
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Borntraeger <borntraeger@redhat.com>
Cc: Christoffer Dall <cdall@linaro.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Acked-by: NChristoffer Dall <christoffer.dall@linaro.org>
Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NKees Cook <keescook@chromium.org>

46515736

14 12月, 2017 16 次提交

KVM: introduce kvm_arch_vcpu_async_ioctl · 5cb0944c

由 Paolo Bonzini 提交于 12月 12, 2017

After the vcpu_load/vcpu_put pushdown, the handling of asynchronous VCPU
ioctl is already much clearer in that it is obvious that they bypass
vcpu_load and vcpu_put.

However, it is still not perfect in that the different state of the VCPU
mutex is still hidden in the caller.  Separate those ioctls into a new
function kvm_arch_vcpu_async_ioctl that returns -ENOIOCTLCMD for more
"traditional" synchronous ioctls.

Cc: James Hogan <jhogan@kernel.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Suggested-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5cb0944c

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl · 9b062471

由 Christoffer Dall 提交于 12月 04, 2017

Move the calls to vcpu_load() and vcpu_put() in to the architecture
specific implementations of kvm_arch_vcpu_ioctl() which dispatches
further architecture-specific ioctls on to other functions.

Some architectures support asynchronous vcpu ioctls which cannot call
vcpu_load() or take the vcpu->mutex, because that would prevent
concurrent execution with a running VCPU, which is the intended purpose
of these ioctls, for example because they inject interrupts.

We repeat the separate checks for these specifics in the architecture
code for MIPS, S390 and PPC, and avoid taking the vcpu->mutex and
calling vcpu_load for these ioctls.
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9b062471

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_fpu · 6a96bc7f

由 Christoffer Dall 提交于 12月 04, 2017

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_fpu().
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6a96bc7f

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_fpu · 1393123e

由 Christoffer Dall 提交于 12月 04, 2017

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_fpu().
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1393123e

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_guest_debug · 66b56562

由 Christoffer Dall 提交于 12月 04, 2017

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_guest_debug().
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

66b56562

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_translate · 1da5b61d

由 Christoffer Dall 提交于 12月 04, 2017

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_translate().
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1da5b61d

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_mpstate · e83dff5e

由 Christoffer Dall 提交于 12月 04, 2017

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_mpstate().
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e83dff5e

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_mpstate · fd232561

由 Christoffer Dall 提交于 12月 04, 2017

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_mpstate().
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fd232561

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_sregs · b4ef9d4e

由 Christoffer Dall 提交于 12月 04, 2017

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_sregs().
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b4ef9d4e

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_sregs · bcdec41c

由 Christoffer Dall 提交于 12月 04, 2017

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_sregs().
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bcdec41c

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_regs · 875656fe

由 Christoffer Dall 提交于 12月 04, 2017

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_regs().
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

875656fe

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_regs · 1fc9b76b

由 Christoffer Dall 提交于 12月 04, 2017

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_regs().
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1fc9b76b

KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_run · accb757d

由 Christoffer Dall 提交于 12月 04, 2017

Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_run().
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> # s390 parts
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
[Rebased. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

accb757d

KVM: Prepare for moving vcpu_load/vcpu_put into arch specific code · 8a32dd60

由 Christoffer Dall 提交于 12月 04, 2017

In preparation for moving calls to vcpu_load() and vcpu_put() into the
architecture specific implementations of the KVM vcpu ioctls, move the
calls in the main kvm_vcpu_ioctl() dispatcher function to each case
of the ioctl select statement.  This allows us to move the vcpu_load()
and vcpu_put() calls into architecture specific implementations of vcpu
ioctls, one by one.
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8a32dd60

KVM: Take vcpu->mutex outside vcpu_load · ec7660cc

由 Christoffer Dall 提交于 12月 04, 2017

As we're about to call vcpu_load() from architecture-specific
implementations of the KVM vcpu ioctls, but yet we access data
structures protected by the vcpu->mutex in the generic code, factor
this logic out from vcpu_load().

x86 is the only architecture which calls vcpu_load() outside of the main
vcpu ioctl function, and these calls will no longer take the vcpu mutex
following this patch.  However, with the exception of
kvm_arch_vcpu_postcreate (see below), the callers are either in the
creation or destruction path of the VCPU, which means there cannot be
any concurrent access to the data structure, because the file descriptor
is not yet accessible, or is already gone.

kvm_arch_vcpu_postcreate makes the newly created vcpu potentially
accessible by other in-kernel threads through the kvm->vcpus array, and
we therefore take the vcpu mutex in this case directly.
Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ec7660cc

kvm_main: Use common error handling code in kvm_dev_ioctl_create_vm() · 78588335

由 Markus Elfring 提交于 11月 21, 2017

Add a jump target so that a bit of exception handling can be better reused
at the end of this function.

This issue was detected by using the Coccinelle software.
Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

78588335