提交 · 80b63faf028fba79e630d3643b0e615bddf4067b · openeuler / raspberrypi-kernel

24 10月, 2010 9 次提交

KVM: MMU: fix regression from rework mmu_shrink() code · 80b63faf

由 Xiaotian Feng 提交于 8月 24, 2010

Latest kvm mmu_shrink code rework makes kernel changes kvm->arch.n_used_mmu_pages/
kvm->arch.n_max_mmu_pages at kvm_mmu_free_page/kvm_mmu_alloc_page, which is called
by kvm_mmu_commit_zap_page. So the kvm->arch.n_used_mmu_pages or
kvm_mmu_available_pages(vcpu->kvm) is unchanged after kvm_mmu_prepare_zap_page(),
This caused kvm_mmu_change_mmu_pages/__kvm_mmu_free_some_pages loops forever.
Moving kvm_mmu_commit_zap_page would make the while loop performs as normal.
Reported-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NXiaotian Feng <dfeng@redhat.com>
Tested-by: NAvi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Tim Pepper <lnxninja@linux.vnet.ibm.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

80b63faf

KVM: create aggregate kvm_total_used_mmu_pages value · 45221ab6

由 Dave Hansen 提交于 8月 19, 2010

Of slab shrinkers, the VM code says:

 * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
 * querying the cache size, so a fastpath for that case is appropriate.

and it *means* it.  Look at how it calls the shrinkers:

    nr_before = (*shrinker->shrink)(0, gfp_mask);
    shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);

So, if you do anything stupid in your shrinker, the VM will doubly
punish you.

The mmu_shrink() function takes the global kvm_lock, then acquires
every VM's kvm->mmu_lock in sequence.  If we have 100 VMs, then
we're going to take 101 locks.  We do it twice, so each call takes
202 locks.  If we're under memory pressure, we can have each cpu
trying to do this.  It can get really hairy, and we've seen lock
spinning in mmu_shrink() be the dominant entry in profiles.

This is guaranteed to optimize at least half of those lock
aquisitions away.  It removes the need to take any of the locks
when simply trying to count objects.

A 'percpu_counter' can be a large object, but we only have one
of these for the entire system.  There are not any better
alternatives at the moment, especially ones that handle CPU
hotplug.
Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: NTim Pepper <lnxninja@linux.vnet.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

45221ab6

KVM: replace x86 kvm n_free_mmu_pages with n_used_mmu_pages · 49d5ca26

由 Dave Hansen 提交于 8月 19, 2010

Doing this makes the code much more readable.  That's
borne out by the fact that this patch removes code.  "used"
also happens to be the number that we need to return back to
the slab code when our shrinker gets called.  Keeping this
value as opposed to free makes the next patch simpler.

So, 'struct kvm' is kzalloc()'d.  'struct kvm_arch' is a
structure member (and not a pointer) of 'struct kvm'.  That
means they start out zeroed.  I _think_ they get initialized
properly by kvm_mmu_change_mmu_pages().  But, that only happens
via kvm ioctls.

Another benefit of storing 'used' intead of 'free' is
that the values are consistent from the moment the structure is
allocated: no negative "used" value.
Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: NTim Pepper <lnxninja@linux.vnet.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

49d5ca26

KVM: rename x86 kvm->arch.n_alloc_mmu_pages · 39de71ec

由 Dave Hansen 提交于 8月 19, 2010

arch.n_alloc_mmu_pages is a poor choice of name. This value truly
means, "the number of pages which _may_ be allocated".  But,
reading the name, "n_alloc_mmu_pages" implies "the number of allocated
mmu pages", which is dead wrong.

It's really the high watermark, so let's give it a name to match:
nr_max_mmu_pages.  This change will make the next few patches
much more obvious and easy to read.
Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: NTim Pepper <lnxninja@linux.vnet.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

39de71ec

KVM: abstract kvm x86 mmu->n_free_mmu_pages · e0df7b9f

由 Dave Hansen 提交于 8月 19, 2010

"free" is a poor name for this value.  In this context, it means,
"the number of mmu pages which this kvm instance should be able to
allocate."  But "free" implies much more that the objects are there
and ready for use.  "available" is a much better description, especially
when you see how it is calculated.

In this patch, we abstract its use into a function.  We'll soon
replace the function's contents by calculating the value in a
different way.

All of the reads of n_free_mmu_pages are taken care of in this
patch.  The modification sites will be handled in a patch
later in the series.
Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: NTim Pepper <lnxninja@linux.vnet.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

e0df7b9f

KVM: MMU: mark page dirty only when page is really written · 4132779b

由 Xiao Guangrong 提交于 8月 02, 2010

Mark page dirty only when this page is really written, it's more exacter,
and also can fix dirty page marking in speculation path
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

4132779b

KVM: MMU: move bits lost judgement into a separate function · 8672b721

由 Xiao Guangrong 提交于 8月 02, 2010

Introduce spte_has_volatile_bits() function to judge whether spte
bits will miss, it's more readable and can help us to cleanup code
later
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

8672b721

KVM: MMU: using kvm_set_pfn_accessed() instead of mark_page_accessed() · 251464c4

由 Xiao Guangrong 提交于 8月 02, 2010

It's a small cleanup that using using kvm_set_pfn_accessed() instead
of mark_page_accessed()
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

251464c4

KVM: MMU: remove valueless output message · 19ada5c4

由 Xiao Guangrong 提交于 7月 27, 2010

After commit 53383eaad08d, the '*spte' has updated before call
rmap_remove()(in most case it's 'shadow_trap_nonpresent_pte'), so
remove this information from error message
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

19ada5c4

07 8月, 2010 1 次提交

x86, kvm: Remove cast obsoleted by set_64bit() prototype cleanup · 7645e432

由 H. Peter Anvin 提交于 8月 06, 2010

KVM ended up having to put a pretty ugly wrapper around set_64bit()
in order to get the type right.  Now set_64bit() takes the expected
u64 type, and this wrapper can be cleaned up.
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
Cc: Avi Kivity <avi@redhat.com>
LKML-Reference: <4C5C4E7A.8040603@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7645e432

02 8月, 2010 17 次提交

KVM: MMU: using __xchg_spte more smarter · 9a3aad70

由 Xiao Guangrong 提交于 7月 16, 2010

Sometimes, atomically set spte is not needed, this patch call __xchg_spte()
more smartly

Note: if the old mapping's access bit is already set, we no need atomic operation
since the access bit is not lost
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

9a3aad70

KVM: MMU: cleanup spte set and accssed/dirty tracking · e4b502ea

由 Xiao Guangrong 提交于 7月 16, 2010

Introduce set_spte_track_bits() to cleanup current code
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

e4b502ea

KVM: MMU: don't atomicly set spte if it's not present · be233d49

由 Xiao Guangrong 提交于 7月 16, 2010

If the old mapping is not present, the spte.a is not lost, so no need
atomic operation to set it
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

be233d49

KVM: MMU: fix page dirty tracking lost while sync page · 9ed5520d

由 Xiao Guangrong 提交于 7月 16, 2010

In sync-page path, if spte.writable is changed, it will lose page dirty
tracking, for example:

assume spte.writable = 0 in a unsync-page, when it's synced, it map spte
to writable(that is spte.writable = 1), later guest write spte.gfn, it means
spte.gfn is dirty, then guest changed this mapping to read-only, after it's
synced,  spte.writable = 0

So, when host release the spte, it detect spte.writable = 0 and not mark page
dirty
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

9ed5520d

KVM: MMU: fix broken page accessed tracking with ept enabled · daa3db69

由 Xiao Guangrong 提交于 7月 16, 2010

In current code, if ept is enabled(shadow_accessed_mask = 0), the page
accessed tracking is lost.
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

daa3db69

KVM: MMU: add missing reserved bits check in speculative path · fa1de2bf

由 Xiao Guangrong 提交于 7月 16, 2010

In the speculative path, we should check guest pte's reserved bits just as
the real processor does
Reported-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

fa1de2bf

KVM: MMU: fix mmu notifier invalidate handler for huge spte · 6e3e243c

由 Andrea Arcangeli 提交于 7月 16, 2010

The index wasn't calculated correctly (off by one) for huge spte so KVM guest
was unstable with transparent hugepages.
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Reviewed-by: NReviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

6e3e243c

KVM: MMU: Add validate_direct_spte() helper · a357bd22

由 Avi Kivity 提交于 7月 13, 2010

Add a helper to verify that a direct shadow page is valid wrt the required
access permissions; drop the page if it is not valid.
Reviewed-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

a357bd22

KVM: MMU: Add drop_large_spte() helper · a3aa51cf

由 Avi Kivity 提交于 7月 13, 2010

To clarify spte fetching code, move large spte handling into a helper.
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

a3aa51cf

KVM: MMU: Use __set_spte to link shadow pages · 121eee97

由 Avi Kivity 提交于 7月 13, 2010

To avoid split accesses to 64 bit sptes on i386, use __set_spte() to link
shadow pages together.

(not technically required since shadow pages are __GFP_KERNEL, so upper 32
bits are always clear)
Reviewed-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

121eee97

KVM: MMU: Add link_shadow_page() helper · 32ef26a3

由 Avi Kivity 提交于 7月 13, 2010

To simplify the process of fetching an spte, add a helper that links
a shadow page to an spte.
Reviewed-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

32ef26a3

KVM: Return EFAULT from kvm ioctl when guest accesses bad area · edba23e5

由 Gleb Natapov 提交于 7月 07, 2010

Currently if guest access address that belongs to memory slot but is not
backed up by page or page is read only KVM treats it like MMIO access.
Remove that capability. It was never part of the interface and should
not be relied upon.
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

edba23e5

KVM: MMU: Don't drop accessed bit while updating an spte · b79b93f9

由 Avi Kivity 提交于 6月 06, 2010

__set_spte() will happily replace an spte with the accessed bit set with
one that has the accessed bit clear.  Add a helper update_spte() which checks
for this condition and updates the page flag if needed.
Signed-off-by: NAvi Kivity <avi@redhat.com>

b79b93f9

KVM: MMU: Atomically check for accessed bit when dropping an spte · a9221dd5

由 Avi Kivity 提交于 6月 06, 2010

Currently, in the window between the check for the accessed bit, and actually
dropping the spte, a vcpu can access the page through the spte and set the bit,
which will be ignored by the mmu.

Fix by using an exchange operation to atmoically fetch the spte and drop it.
Signed-off-by: NAvi Kivity <avi@redhat.com>

a9221dd5

KVM: MMU: Move accessed/dirty bit checks from rmap_remove() to drop_spte() · ce061867

由 Avi Kivity 提交于 6月 06, 2010

Since we need to make the check atomic, move it to the place that will
set the new spte.
Signed-off-by: NAvi Kivity <avi@redhat.com>

ce061867

KVM: MMU: Introduce drop_spte() · be38d276

由 Avi Kivity 提交于 6月 06, 2010

When we call rmap_remove(), we (almost) always immediately follow it by
an __set_spte() to a nonpresent pte.  Since we need to perform the two
operations atomically, to avoid losing the dirty and accessed bits, introduce
a helper drop_spte() and convert all call sites.

The operation is still nonatomic at this point.
Signed-off-by: NAvi Kivity <avi@redhat.com>

be38d276

KVM: VMX: fix tlb flush with invalid root · dd180b3e

由 Xiao Guangrong 提交于 7月 03, 2010

Commit 341d9b535b6c simplify reload logic while entry guest mode, it
can avoid unnecessary sync-root if KVM_REQ_MMU_RELOAD and
KVM_REQ_MMU_SYNC both set.

But, it cause a issue that when we handle 'KVM_REQ_TLB_FLUSH', the
root is invalid, it is triggered during my test:

Kernel BUG at ffffffffa00212b8 [verbose debug info unavailable]
......

Fixed by directly return if the root is not ready.
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

dd180b3e

01 8月, 2010 13 次提交

KVM: Remove unnecessary divide operations · 82855413

由 Joerg Roedel 提交于 7月 01, 2010

This patch converts unnecessary divide and modulo operations
in the KVM large page related code into logical operations.
This allows to convert gfn_t to u64 while not breaking 32
bit builds.
Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

82855413

KVM: MMU: fix writable sync sp mapping · 36a2e677

由 Xiao Guangrong 提交于 6月 30, 2010

While we sync many unsync sp at one time(in mmu_sync_children()),
we may mapping the spte writable, it's dangerous, if one unsync
sp's mapping gfn is another unsync page's gfn.

For example:

SP1.pte[0] = P
SP2.gfn's pfn = P
[SP1.pte[0] = SP2.gfn's pfn]

First, we write protected SP1 and SP2, but SP1 and SP2 are still the
unsync sp.

Then, sync SP1 first, it will detect SP1.pte[0].gfn only has one unsync-sp,
that is SP2, so it will mapping it writable, but we plan to sync SP2 soon,
at this point, the SP2->unsync is not reliable since later we sync SP2 but
SP2->gfn is already writable.

So the final result is: SP2 is the sync page but SP2.gfn is writable.

This bug will corrupt guest's page table, fixed by mark read-only mapping
if the mapped gfn has shadow pages.
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

36a2e677

KVM: Add mini-API for vcpu->requests · a8eeb04a

由 Avi Kivity 提交于 5月 10, 2010

Makes it a little more readable and hackable.
Signed-off-by: NAvi Kivity <avi@redhat.com>

a8eeb04a

KVM: Remove memory alias support · a1f4d395

由 Avi Kivity 提交于 6月 21, 2010

As advertised in feature-removal-schedule.txt.  Equivalent support is provided
by overlapping memory regions.
Signed-off-by: NAvi Kivity <avi@redhat.com>

a1f4d395

KVM: MMU: don't walk every parent pages while mark unsync · 1047df1f

由 Xiao Guangrong 提交于 6月 11, 2010

While we mark the parent's unsync_child_bitmap, if the parent is already
unsynced, it no need walk it's parent, it can reduce some unnecessary
workload
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

1047df1f

KVM: MMU: clear unsync_child_bitmap completely · 7a8f1a74

由 Xiao Guangrong 提交于 6月 11, 2010

In current code, some page's unsync_child_bitmap is not cleared completely
in mmu_sync_children(), for example, if two PDPEs shard one PDT, one of
PDPE's unsync_child_bitmap is not cleared.

Currently, it not harm anything just little overload, but it's the prepare
work for the later patch
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

7a8f1a74

KVM: MMU: cleanup for __mmu_unsync_walk() · ebdea638

由 Xiao Guangrong 提交于 6月 11, 2010

Decrease sp->unsync_children after clear unsync_child_bitmap bit
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

ebdea638

KVM: MMU: don't mark pte notrap if it's just sync transient · be71e061

由 Xiao Guangrong 提交于 6月 11, 2010

If the sync-sp just sync transient, don't mark its pte notrap
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

be71e061

KVM: MMU: avoid double write protected in sync page path · f918b443

由 Xiao Guangrong 提交于 6月 11, 2010

The sync page is already write protected in mmu_sync_children(), don't
write protected it again
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

f918b443

KVM: Fix mov cr3 #GP at wrong instruction · 2390218b

由 Avi Kivity 提交于 6月 10, 2010

On Intel, we call skip_emulated_instruction() even if we injected a #GP,
resulting in the #GP pointing at the wrong address.

Fix by injecting the exception and skipping the instruction at the same place,
so we can do just one or the other.
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

2390218b

KVM: MMU: delay local tlb flush · 3b5d1321

由 Xiao Guangrong 提交于 6月 08, 2010

delay local tlb flush until enter guest moden, it can reduce vpid flush
frequency and reduce remote tlb flush IPI(if KVM_REQ_TLB_FLUSH bit is
already set, IPI is not sent)
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

3b5d1321

KVM: MMU: use wrapper function to flush local tlb · 5304efde

由 Xiao Guangrong 提交于 6月 08, 2010

Use kvm_mmu_flush_tlb() function instead of calling
kvm_x86_ops->tlb_flush(vcpu) directly.
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

5304efde

KVM: MMU: remove unnecessary remote tlb flush · 4f78fd08

由 Xiao Guangrong 提交于 6月 08, 2010

This remote tlb flush is no necessary since we have synced while
sp is zapped
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

4f78fd08