提交 · 7edd0ce05892831c77fa4cebe24a6056d33336d5 · openanolis / cloud-kernel

15 10月, 2008 13 次提交

A
KVM: Consolidate PIC isr clearing into a function · 7edd0ce0
由 Avi Kivity 提交于 7月 07, 2008
```
Signed-off-by: NAvi Kivity <avi@qumranet.com>
```
7edd0ce0

KVM: VMX: Remove redundant check in handle_rmode_exception · 60bd83a1

由 Mohammed Gamal 提交于 7月 12, 2008

Since checking for vcpu->arch.rmode.active is already done whenever we
call handle_rmode_exception(), checking it inside the function is redundant.
Signed-off-by: NMohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

60bd83a1

KVM: VMX: Move interrupt post-processing to vmx_complete_interrupts() · f7d9238f

由 Avi Kivity 提交于 7月 03, 2008

Instead of looking at failed injections in the vm entry path, move
processing to the exit path in vmx_complete_interrupts(). This simplifes
the logic and removes any state that is hidden in vmx registers.
Signed-off-by: NAvi Kivity <avi@qumranet.com>

f7d9238f

KVM: Add a pending interrupt queue · 937a7eae

由 Avi Kivity 提交于 7月 03, 2008

Similar to the exception queue, this hold interrupts that have been
accepted by the virtual processor core but not yet injected.

Not yet used.
Signed-off-by: NAvi Kivity <avi@qumranet.com>

937a7eae

KVM: VMX: Fix pending exception processing · 35920a35

由 Avi Kivity 提交于 7月 03, 2008

The vmx code assumes that IDT-Vectoring can only be set when an exception
is injected due to the exception in question.  That's not true, however:
if the exception is injected correctly, and later another exception occurs
but its delivery is blocked due to a fault, then we will incorrectly assume
the first exception was not delivered.

Fix by unconditionally dequeuing the pending exception, and requeuing it
(or the second exception) if we see it in the IDT-Vectoring field.
Signed-off-by: NAvi Kivity <avi@qumranet.com>

35920a35

KVM: Clear exception queue before emulating an instruction · 26eef70c

由 Avi Kivity 提交于 7月 03, 2008

If we're emulating an instruction, either it will succeed, in which case
any previously queued exception will be spurious, or we will requeue the
same exception.
Signed-off-by: NAvi Kivity <avi@qumranet.com>

26eef70c

KVM: VMX: Move nmi injection failure processing to vm exit path · 668f612f

由 Avi Kivity 提交于 7月 02, 2008

Instead of processing nmi injection failure in the vm entry path, move
it to the vm exit path (vm_complete_interrupts()).  This separates nmi
injection from nmi post-processing, and moves the nmi state from the VT
state into vcpu state (new variable nmi_injected specifying an injection
in progress).
Signed-off-by: NAvi Kivity <avi@qumranet.com>

668f612f

KVM: Move NMI IRET fault processing to new vmx_complete_interrupts() · cf393f75

由 Avi Kivity 提交于 7月 01, 2008

Currently most interrupt exit processing is handled on the entry path,
which is confusing. Move the NMI IRET fault processing to a new function,
vmx_complete_interrupts(), which is called on the vmexit path.
Signed-off-by: NAvi Kivity <avi@qumranet.com>

cf393f75

KVM: MMU: Simplify kvm_mmu_zap_page() · 5b5c6a5a

由 Avi Kivity 提交于 7月 11, 2008

The twisty maze of conditionals can be reduced.

[joerg: fix tlb flushing]
Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

5b5c6a5a

A
KVM: MMU: Separate the code for unlinking a shadow page from its parents · 31aa2b44
由 Avi Kivity 提交于 7月 11, 2008
```
Place into own function, in preparation for further cleanups.
Signed-off-by: NAvi Kivity <avi@qumranet.com>
```
31aa2b44

KVM: Introduce kvm_set_irq to inject interrupts in guests · 867767a3

由 Amit Shah 提交于 6月 27, 2008

This function injects an interrupt into the guest given the kvm struct,
the (guest) irq number and the interrupt level.
Signed-off-by: NAmit Shah <amit.shah@qumranet.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

867767a3

KVM: x86: accessors for guest registers · 5fdbf976

由 Marcelo Tosatti 提交于 6月 27, 2008

As suggested by Avi, introduce accessors to read/write guest registers.
This simplifies the ->cache_regs/->decache_regs interface, and improves
register caching which is important for VMX, where the cost of
vmcs_read/vmcs_write is significant.

[avi: fix warnings]
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

5fdbf976

KVM: VMX: Rename misnamed msr bits · ca60dfbb

由 Sheng Yang 提交于 6月 24, 2008

MSR_IA32_FEATURE_LOCKED is just a bit in fact, which shouldn't be prefixed with
MSR_.  So is MSR_IA32_FEATURE_VMXON_ENABLED.
Signed-off-by: NSheng Yang <sheng.yang@intel.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

ca60dfbb

11 9月, 2008 3 次提交

KVM: VMX: Always return old for clear_flush_young() when using EPT · 534e38b4

由 Sheng Yang 提交于 9月 08, 2008

As well as discard fake accessed bit and dirty bit of EPT.
Signed-off-by: NSheng Yang <sheng.yang@intel.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

534e38b4

KVM: SVM: fix guest global tlb flushes with NPT · e5eab0ce

由 Joerg Roedel 提交于 9月 09, 2008

Accesses to CR4 are intercepted even with Nested Paging enabled. But the code
does not check if the guest wants to do a global TLB flush. So this flush gets
lost. This patch adds the check and the flush to svm_set_cr4.
Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

e5eab0ce

KVM: SVM: fix random segfaults with NPT enabled · 44874f84

由 Joerg Roedel 提交于 8月 27, 2008

This patch introduces a guest TLB flush on every NPF exit in KVM. This fixes
random segfaults and #UD exceptions in the guest seen under some workloads
(e.g. long running compile workloads or tbench). A kernbench run with and
without that fix showed that it has a slowdown lower than 0.5%

Cc: stable@kernel.org
Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

44874f84

10 9月, 2008 1 次提交

x86: move VMX MSRs to msr-index.h · 315a6558

由 Sheng Yang 提交于 9月 09, 2008

They are hardware specific MSRs, and we would use them in virtualization
feature detection later.
Signed-off-by: NSheng Yang <sheng.yang@intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

315a6558

25 8月, 2008 1 次提交

KVM: MMU: Fix torn shadow pte · cd5998eb

由 Avi Kivity 提交于 8月 22, 2008

The shadow code assigns a pte directly in one place, which is nonatomic on
i386 can can cause random memory references. Fix by using an atomic setter.
Signed-off-by: NAvi Kivity <avi@qumranet.com>

cd5998eb

29 7月, 2008 5 次提交

A
KVM: Advertise synchronized mmu support to userspace · ed848624
由 Avi Kivity 提交于 7月 29, 2008
```
Signed-off-by: NAvi Kivity <avi@qumranet.com>
```
ed848624

KVM: Synchronize guest physical memory map to host virtual memory map · e930bffe

由 Andrea Arcangeli 提交于 7月 25, 2008

Synchronize changes to host virtual addresses which are part of
a KVM memory slot to the KVM shadow mmu.  This allows pte operations
like swapping, page migration, and madvise() to transparently work
with KVM.
Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

e930bffe

KVM: Allow browsing memslots with mmu_lock · 604b38ac

由 Andrea Arcangeli 提交于 7月 25, 2008

This allows reading memslots with only the mmu_lock hold for mmu
notifiers that runs in atomic context and with mmu_lock held.
Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

604b38ac

KVM: Allow reading aliases with mmu_lock · a1708ce8

由 Andrea Arcangeli 提交于 7月 25, 2008

This allows the mmu notifier code to run unalias_gfn with only the
mmu_lock held.  Only alias writes need the mmu_lock held. Readers will
either take the slots_lock in read mode or the mmu_lock.
Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

a1708ce8

mmu-notifiers: core · cddb8a5c

由 Andrea Arcangeli 提交于 7月 28, 2008

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
 There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte".  In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present).  The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.

Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set.  Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space.  Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page.  Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0.  This is just an example.

This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably.  And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) mm_take_all_locks() to register the mmu notifier when the whole VM
   isn't doing anything with "mm".  This allows mmu notifier users to keep
   track if the VM is in the middle of the invalidate_range_begin/end
   critical section with an atomic counter incraese in range_begin and
   decreased in range_end.  No secondary MMU page fault is allowed to map
   any spte or secondary tlb reference, while the VM is in the middle of
   range_begin/end as any page returned by get_user_pages in that critical
   section could later immediately be freed without any further
   ->invalidate_page notification (invalidate_range_begin/end works on
   ranges and ->invalidate_page isn't called immediately before freeing
   the page).  To stop all page freeing and pagetable overwrites the
   mmap_sem must be taken in write mode and all other anon_vma/i_mmap
   locks must be taken too.

2) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
   CONFIG_KVM=m/y.  In the current kernel kvm won't yet take advantage of
   mmu notifiers, but this already allows to compile a KVM external module
   against a kernel with mmu notifiers enabled and from the next pull from
   kvm.git we'll start using them.  And GRU/XPMEM will also be able to
   continue the development by enabling KVM=m in their config, until they
   submit all GRU/XPMEM GPLv2 code to the mainline kernel.  Then they can
   also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
   This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
   are all =n.

The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR.  Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled.  Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.

 struct  kvm *kvm_arch_create_vm(void)
 {
        struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+       int err;

        if (!kvm)
                return ERR_PTR(-ENOMEM);

        INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

+       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+       if (err) {
+               kfree(kvm);
+               return ERR_PTR(err);
+       }
+
        return kvm;
 }

mmu_notifier_unregister returns void and it's reliable.

The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cddb8a5c

27 7月, 2008 7 次提交

KVM: VMX: Fix undefined beaviour of EPT after reload kvm-intel.ko · 5fdbcb9d

由 Sheng Yang 提交于 7月 16, 2008

As well as move set base/mask ptes to vmx_init().
Signed-off-by: NSheng Yang <sheng.yang@intel.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

5fdbcb9d

S
KVM: VMX: Fix bypass_guest_pf enabling when disable EPT in module parameter · 5ec5726a
由 Sheng Yang 提交于 7月 16, 2008
```
Signed-off-by: NSheng Yang <sheng.yang@intel.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>
```
5ec5726a

KVM: task switch: translate guest segment limit to virt-extension byte granular field · c93cd3a5

由 Marcelo Tosatti 提交于 7月 19, 2008

If 'g' is one then limit is 4kb granular.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

c93cd3a5

KVM: Avoid instruction emulation when event delivery is pending · 577bdc49

由 Avi Kivity 提交于 7月 19, 2008

When an event (such as an interrupt) is injected, and the stack is
shadowed (and therefore write protected), the guest will exit.  The
current code will see that the stack is shadowed and emulate a few
instructions, each time postponing the injection.  Eventually the
injection may succeed, but at that time the guest may be unwilling
to accept the interrupt (for example, the TPR may have changed).

This occurs every once in a while during a Windows 2008 boot.

Fix by unshadowing the fault address if the fault was due to an event
injection.
Signed-off-by: NAvi Kivity <avi@qumranet.com>

577bdc49

KVM: task switch: use seg regs provided by subarch instead of reading from GDT · 34198bf8

由 Marcelo Tosatti 提交于 7月 16, 2008

There is no guarantee that the old TSS descriptor in the GDT contains
the proper base address. This is the case for Windows installation's
reboot-via-triplefault.

Use guest registers instead. Also translate the address properly.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

34198bf8

KVM: task switch: segment base is linear address · 98899aa0

由 Marcelo Tosatti 提交于 7月 16, 2008

The segment base is always a linear address, so translate before
accessing guest memory.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

98899aa0

KVM: SVM: allow enabling/disabling NPT by reloading only the architecture module · 5f4cb662

由 Joerg Roedel 提交于 7月 14, 2008

If NPT is enabled after loading both KVM modules on AMD and it should be
disabled, both KVM modules must be reloaded. If only the architecture module is
reloaded the behavior is undefined. With this patch it is possible to disable
NPT only by reloading the kvm_amd module.
Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

5f4cb662

20 7月, 2008 10 次提交

KVM: MMU: Fix potential race setting upper shadow ptes on nonpae hosts · 722c05f2

由 Avi Kivity 提交于 7月 13, 2008

The direct mapped shadow code (used for real mode and two dimensional paging)
sets upper-level ptes using direct assignment rather than calling
set_shadow_pte().  A nonpae host will split this into two writes, which opens
up a race if another vcpu accesses the same memory area.

Fix by calling set_shadow_pte() instead of assigning directly.

Noticed by Izik Eidus.
Signed-off-by: NAvi Kivity <avi@qumranet.com>

722c05f2

KVM: x86 emulator: emulate clflush · 2a7c5b8b

由 Glauber Costa 提交于 7月 10, 2008

If the guest issues a clflush in a mmio address, the instruction
can trap into the hypervisor. Currently, we do not decode clflush
properly, causing the guest to hang. This patch fixes this emulating
clflush (opcode 0f ae).
Signed-off-by: NGlauber Costa <gcosta@redhat.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

2a7c5b8b

KVM: MMU: improve invalid shadow root page handling · 376c53c2

由 Marcelo Tosatti 提交于 7月 10, 2008

Harden kvm_mmu_zap_page() against invalid root pages that
had been shadowed from memslots that are gone.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

376c53c2

KVM: MMU: nuke shadowed pgtable pages and ptes on memslot destruction · 34d4cb8f

由 Marcelo Tosatti 提交于 7月 10, 2008

Flush the shadow mmu before removing regions to avoid stale entries.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

34d4cb8f

A
KVM: Prefix some x86 low level function with kvm_, to avoid namespace issues · d6e88aec
由 Avi Kivity 提交于 7月 10, 2008
```
Fixes compilation with CONFIG_VMI enabled.
Signed-off-by: NAvi Kivity <avi@qumranet.com>
```
d6e88aec

KVM: check injected pic irq within valid pic irqs · c65bbfa1

由 Ben-Ami Yassour 提交于 7月 06, 2008

Check that an injected pic irq is between 0 and 15.
Signed-off-by: NBen-Ami Yassour <benami@il.ibm.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

c65bbfa1

KVM: x86 emulator: Fix HLT instruction · 19fdfa0d

由 Mohammed Gamal 提交于 7月 06, 2008

This patch fixes issue encountered with HLT instruction
under FreeDOS's HIMEM XMS Driver.

The HLT instruction jumped directly to the done label and
skips updating the EIP value, therefore causing the guest
to spin endlessly on the same instruction.

The patch changes the instruction so that it writes back
the updated EIP value.
Signed-off-by: NMohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

19fdfa0d

A
KVM: Apply the kernel sigmask to vcpus blocked due to being uninitialized · ac9f6dc0
由 Avi Kivity 提交于 7月 06, 2008
```
Signed-off-by: NAvi Kivity <avi@qumranet.com>
```
ac9f6dc0

KVM: VMX: Add ept_sync_context in flush_tlb · 4e1096d2

由 Sheng Yang 提交于 7月 06, 2008

Fix a potention issue caused by kvm_mmu_slot_remove_write_access(). The
old behavior don't sync EPT TLB with modified EPT entry, which result
in inconsistent content of EPT TLB and EPT table.
Signed-off-by: NSheng Yang <sheng.yang@intel.com>
Signed-off-by: NAvi Kivity <avi@qumranet.com>

4e1096d2

KVM: mmu_shrink: kvm_mmu_zap_page requires slots_lock to be held · 5a4c9288

由 Marcelo Tosatti 提交于 7月 03, 2008

kvm_mmu_zap_page() needs slots lock held (rmap_remove->gfn_to_memslot,
for example).

Since kvm_lock spinlock is held in mmu_shrink(), do a non-blocking
down_read_trylock().

Untested.
Signed-off-by: NAvi Kivity <avi@qumranet.com>

5a4c9288

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功