提交 · d79095eee26f3e2c812f1c92763d5edcb1edae60 · openanolis / cloud-kernel

02 8月, 2012 1 次提交

KVM: VMX: Fix ds/es corruption on i386 with preemption · aa67f609

由 Avi Kivity 提交于 8月 01, 2012

Commit b2da15ac ("KVM: VMX: Optimize %ds, %es reload") broke i386
in the following scenario:

  vcpu_load
  ...
  vmx_save_host_state
  vmx_vcpu_run
  (ds.rpl, es.rpl cleared by hardware)

  interrupt
    push ds, es  # pushes bad ds, es
    schedule
      vmx_vcpu_put
        vmx_load_host_state
          reload ds, es (with __USER_DS)
    pop ds, es  # of other thread's stack
    iret
  # other thread runs
  interrupt
    push ds, es
    schedule  # back in vcpu thread
    pop ds, es  # now with rpl=0
    iret
  ...
  vcpu_put
  resume_userspace
  iret  # clears ds, es due to mismatched rpl

(instead of resume_userspace, we might return with SYSEXIT and then
take an exception; when the exception IRETs we end up with cleared
ds, es)

Fix by avoiding the optimization on i386 and reloading ds, es on the
lightweight exit path.
Reported-by: NChris Clayron <chris2553@googlemail.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

aa67f609

12 7月, 2012 1 次提交

KVM: VMX: Implement PCID/INVPCID for guests with EPT · ad756a16

由 Mao, Junjie 提交于 7月 02, 2012

This patch handles PCID/INVPCID for guests.

Process-context identifiers (PCIDs) are a facility by which a logical processor
may cache information for multiple linear-address spaces so that the processor
may retain cached information when software switches to a different linear
address space. Refer to section 4.10.1 in IA32 Intel Software Developer's Manual
Volume 3A for details.

For guests with EPT, the PCID feature is enabled and INVPCID behaves as running
natively.
For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD.
Signed-off-by: NJunjie Mao <junjie.mao@intel.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

ad756a16

11 7月, 2012 1 次提交

KVM: VMX: export PFEC.P bit on ept · 4f5982a5

由 Xiao Guangrong 提交于 6月 20, 2012

Export the present bit of page fault error code, the later patch
will use it
Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

4f5982a5

09 7月, 2012 8 次提交

KVM: VMX: Emulate invalid guest state by default · a27685c3

由 Avi Kivity 提交于 6月 12, 2012

Our emulation should be complete enough that we can emulate guests
while they are in big real mode, or in a mode transition that is not
virtualizable without unrestricted guest support.
Signed-off-by: NAvi Kivity <avi@redhat.com>

a27685c3

A
KVM: VMX: Improve error reporting during invalid guest state emulation · de5f70e0
由 Avi Kivity 提交于 6月 12, 2012
```
If instruction emulation fails, report it properly to userspace.
Signed-off-by: NAvi Kivity <avi@redhat.com>
```
de5f70e0

KVM: VMX: Stop invalid guest state emulation on pending event · de87dcdd

由 Avi Kivity 提交于 6月 12, 2012

Process the event, possibly injecting an interrupt, before continuing.
Signed-off-by: NAvi Kivity <avi@redhat.com>

de87dcdd

KVM: VMX: Continue emulating after batch exhausted · 7c068e45

由 Avi Kivity 提交于 6月 10, 2012

If we return early from an invalid guest state emulation loop, make
sure we return to it later if the guest state is still invalid.
Signed-off-by: NAvi Kivity <avi@redhat.com>

7c068e45

KVM: VMX: Fix interrupt exit condition during emulation · bdea48e3

由 Avi Kivity 提交于 6月 10, 2012

Checking EFLAGS.IF is incorrect as we might be in interrupt shadow.  If
that is the case, the main loop will notice that and not inject the interrupt,
causing an endless loop.

Fix by using vmx_interrupt_allowed() to check if we can inject an interrupt
instead.
Signed-off-by: NAvi Kivity <avi@redhat.com>

bdea48e3

KVM: VMX: Limit iterations with emulator_invalid_guest_state · b8405c18

由 Avi Kivity 提交于 6月 07, 2012

Otherwise, if the guest ends up looping, we never exit the srcu critical
section, which causes synchronize_srcu() to hang.
Signed-off-by: NAvi Kivity <avi@redhat.com>

b8405c18

KVM: VMX: Relax check on unusable segment · f0495f9b

由 Avi Kivity 提交于 6月 07, 2012

Some userspace (e.g. QEMU 1.1) munge the d and g bits of segment
descriptors, causing us not to recognize them as unusable segments
with emulate_invalid_guest_state=1.  Relax the check by testing for
segment not present (a non-present segment cannot be usable).
Signed-off-by: NAvi Kivity <avi@redhat.com>

f0495f9b

KVM: VMX: Return correct CPL during transition to protected mode · d881e6f6

由 Avi Kivity 提交于 6月 06, 2012

In protected mode, the CPL is defined as the lower two bits of CS, as set by
the last far jump. But during the transition to protected mode, there is no
last far jump, so we need to return zero (the inherited real mode CPL).

Fix by reading CPL from the cache during the transition. This isn't 100%
correct since we don't set the CPL cache on a far jump, but since protected
mode transition will always jump to a segment with RPL=0, it will always
work.
Signed-off-by: NAvi Kivity <avi@redhat.com>

d881e6f6

04 7月, 2012 1 次提交

KVM: VMX: code clean for vmx_init() · 2106a548

由 Guo Chao 提交于 6月 15, 2012

Signed-off-by: NGuo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

2106a548

06 6月, 2012 1 次提交

KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers · a737f256

由 Christoffer Dall 提交于 6月 03, 2012

Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.

Functions introduced or modified are:
 - kvm_err(fmt, ...)
 - kvm_info(fmt, ...)
 - kvm_debug(fmt, ...)
 - kvm_pr_unimpl(fmt, ...)
 - pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: NChristoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

a737f256

05 6月, 2012 4 次提交

KVM: VMX: Fix KVM_SET_SREGS with big real mode segments · b246dd5d

由 Orit Wasserman 提交于 5月 31, 2012

For example migration between Westmere and Nehelem hosts, caught in big real mode.

The code that fixes the segments for real mode guest was moved from enter_rmode
to vmx_set_segments. enter_rmode calls vmx_set_segments for each segment.
Signed-off-by: NOrit Wasserman <owasserm@rehdat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

b246dd5d

KVM: VMX: Use EPT Access bit in response to memory notifiers · 3f6d8c8a

由 Xudong Hao 提交于 5月 22, 2012

Signed-off-by: NHaitao Shan <haitao.shan@intel.com>
Signed-off-by: NXudong Hao <xudong.hao@intel.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

3f6d8c8a

KVM: VMX: Enable EPT A/D bits if supported by turning on relevant bit in EPTP · b38f9934

由 Xudong Hao 提交于 5月 28, 2012

In EPT page structure entry, Enable EPT A/D bits if processor supported.
Signed-off-by: NHaitao Shan <haitao.shan@intel.com>
Signed-off-by: NXudong Hao <xudong.hao@intel.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

b38f9934

KVM: VMX: Add parameter to control A/D bits support, default is on · 83c3a331

由 Xudong Hao 提交于 5月 28, 2012

Add kernel parameter to control A/D bits support, it's on by default.
Signed-off-by: NHaitao Shan <haitao.shan@intel.com>
Signed-off-by: NXudong Hao <xudong.hao@intel.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

83c3a331

17 5月, 2012 2 次提交

KVM: VMX: Optimize %ds, %es reload · b2da15ac

由 Avi Kivity 提交于 5月 13, 2012

On x86_64, we can defer %ds and %es reload to the heavyweight context switch,
since nothing in the lightweight paths uses the host %ds or %es (they are
ignored by the processor). Furthermore we can avoid the load if the segments
are null, by letting the hardware load the null segments for us. This is the
expected case.

On i386, we could avoid the reload entirely, since the entry.S paths take care
of reload, except for the SYSEXIT path which leaves %ds and %es set to __USER_DS.
So we set them to the same values as well.

Saves about 70 cycles out of 1600 (around 4%; noisy measurements).
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

b2da15ac

KVM: VMX: Fix %ds/%es clobber · 512d5649

由 Avi Kivity 提交于 5月 13, 2012

The vmx exit code unconditionally restores %ds and %es to __USER_DS. This
can override the user's values, since %ds and %es are not saved and restored
in x86_64 syscalls. In practice, this isn't dangerous since nobody uses
segment registers in long mode, least of all programs that use KVM.
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

512d5649

14 5月, 2012 1 次提交

KVM: VMX: unlike vmcs on fail path · 5f3fbc34

由 Xiao Guangrong 提交于 5月 14, 2012

fix:

[ 1529.577273] Call Trace:
[ 1529.577289]  [<ffffffffa060d58f>] kvm_arch_hardware_disable+0x13/0x30 [kvm]
[ 1529.577302]  [<ffffffffa05fa2d4>] hardware_disable_nolock+0x35/0x39 [kvm]
[ 1529.577311]  [<ffffffffa05fa29f>] ? cpumask_clear_cpu.constprop.31+0x13/0x13 [kvm]
[ 1529.577315]  [<ffffffff81096ba8>] on_each_cpu+0x44/0x84
[ 1529.577326]  [<ffffffffa05f98b5>] hardware_disable_all_nolock+0x34/0x36 [kvm]
[ 1529.577335]  [<ffffffffa05f98e2>] hardware_disable_all+0x2b/0x39 [kvm]
[ 1529.577349]  [<ffffffffa05fafe5>] kvm_put_kvm+0xed/0x10f [kvm]
[ 1529.577358]  [<ffffffffa05fb3d7>] kvm_vm_release+0x22/0x28 [kvm]
Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

5f3fbc34

19 4月, 2012 1 次提交

KVM: VMX: Fix kvm_set_shared_msr() called in preemptible context · 2225fd56

由 Avi Kivity 提交于 4月 18, 2012

kvm_set_shared_msr() may not be called in preemptible context,
but vmx_set_msr() does so:

  BUG: using smp_processor_id() in preemptible [00000000] code: qemu-kvm/22713
  caller is kvm_set_shared_msr+0x32/0xa0 [kvm]
  Pid: 22713, comm: qemu-kvm Not tainted 3.4.0-rc3+ #39
  Call Trace:
   [<ffffffff8131fa82>] debug_smp_processor_id+0xe2/0x100
   [<ffffffffa0328ae2>] kvm_set_shared_msr+0x32/0xa0 [kvm]
   [<ffffffffa03a103b>] vmx_set_msr+0x28b/0x2d0 [kvm_intel]
   ...

Making kvm_set_shared_msr() work in preemptible is cleaner, but
it's used in the fast path.  Making two variants is overkill, so
this patch just disables preemption around the call.
Reported-by: NDave Jones <davej@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

2225fd56

08 4月, 2012 1 次提交

KVM: VMX: Auto-load on CPUs with VMX · e9bda3b3

由 Josh Triplett 提交于 3月 20, 2012

Enable x86 feature-based autoloading for the kvm-intel module on CPUs
with X86_FEATURE_VMX.
Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
Acked-By: NKay Sievers <kay@vrfy.org>
Signed-off-by: NAvi Kivity <avi@redhat.com>

e9bda3b3

06 4月, 2012 1 次提交

KVM: VMX: vmx_set_cr0 expects kvm->srcu locked · 7a4f5ad0

由 Marcelo Tosatti 提交于 3月 27, 2012

vmx_set_cr0 is called from vcpu run context, therefore it expects
kvm->srcu to be held (for setting up the real-mode TSS).
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

7a4f5ad0

08 3月, 2012 6 次提交

KVM: nVMX: Fix erroneous exception bitmap check · 95871901

由 Nadav Har'El 提交于 3月 06, 2012

The code which checks whether to inject a pagefault to L1 or L2 (in
nested VMX) was wrong, incorrect in how it checked the PF_VECTOR bit.
Thanks to Dan Carpenter for spotting this.
Signed-off-by: NNadav Har'El <nyh@il.ibm.com>
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

95871901

KVM: VMX: Fix delayed load of shared MSRs · 9ee73970

由 Avi Kivity 提交于 3月 06, 2012

Shared MSRs (MSR_*STAR and related) are stored in both vmx->guest_msrs
and in the CPU registers, but vmx_set_msr() only updated memory. Prior
to 46199f33, this didn't matter, since we called vmx_load_host_state(),
which scheduled a vmx_save_host_state(), which re-synchronized the CPU
state, but now we don't, so the CPU state will not be synchronized until
the next exit to host userspace.  This mostly affects nested vmx workloads,
which play with these MSRs a lot.

Fix by loading the MSR eagerly.
Signed-off-by: NAvi Kivity <avi@redhat.com>

9ee73970

KVM: x86 emulator: Fix task switch privilege checks · 7f3d35fd

由 Kevin Wolf 提交于 2月 08, 2012

Currently, all task switches check privileges against the DPL of the
TSS. This is only correct for jmp/call to a TSS. If a task gate is used,
the DPL of this take gate is used for the check instead. Exceptions,
external interrupts and iret shouldn't perform any check.

[avi: kill kvm-kmod remnants]
Signed-off-by: NKevin Wolf <kwolf@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

7f3d35fd

KVM: VMX: remove yield_on_hlt · 10166744

由 Raghavendra K T 提交于 2月 07, 2012

yield_on_hlt was introduced for CPU bandwidth capping. Now it is
redundant with CFS hardlimit.

yield_on_hlt also complicates the scenario in paravirtual environment,
that needs to trap halt. for e.g. paravirtualized ticket spinlocks.
Acked-by: NAnthony Liguori <aliguori@us.ibm.com>
Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

10166744

KVM: Allow adjust_tsc_offset to be in host or guest cycles · f1e2b260

由 Marcelo Tosatti 提交于 2月 03, 2012

Redefine the API to take a parameter indicating whether an
adjustment is in host or guest cycles.
Signed-off-by: NZachary Amsden <zamsden@gmail.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

f1e2b260

KVM: Infrastructure for software and hardware based TSC rate scaling · cc578287

由 Zachary Amsden 提交于 2月 03, 2012

This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate.  Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code.  This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.

There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed.  Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used.  The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.

In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.

[avi: fix 64-bit division on i386]
Signed-off-by: NZachary Amsden <zamsden@gmail.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

cc578287

22 2月, 2012 1 次提交

i387: Split up <asm/i387.h> into exported and internal interfaces · 1361b83a

由 Linus Torvalds 提交于 2月 21, 2012

While various modules include <asm/i387.h> to get access to things we
actually *intend* for them to use, most of that header file was really
pretty low-level internal stuff that we really don't want to expose to
others.

So split the header file into two: the small exported interfaces remain
in <asm/i387.h>, while the internal definitions that are only used by
core architecture code are now in <asm/fpu-internal.h>.

The guiding principle for this was to expose functions that we export to
modules, and leave them in <asm/i387.h>, while stuff that is used by
task switching or was marked GPL-only is in <asm/fpu-internal.h>.

The fpu-internal.h file could be further split up too, especially since
arch/x86/kvm/ uses some of the remaining stuff for its module. But that
kvm usage should probably be abstracted out a bit, and at least now the
internal FPU accessor functions are much more contained. Even if it
isn't perhaps as contained as it _could_ be.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

1361b83a

19 2月, 2012 1 次提交

i387: move TS_USEDFPU flag from thread_info to task_struct · f94edacf

由 Linus Torvalds 提交于 2月 17, 2012

This moves the bit that indicates whether a thread has ownership of the
FPU from the TS_USEDFPU bit in thread_info->status to a word of its own
(called 'has_fpu') in task_struct->thread.has_fpu.

This fixes two independent bugs at the same time:

 - changing 'thread_info->status' from the scheduler causes nasty
   problems for the other users of that variable, since it is defined to
   be thread-synchronous (that's what the "TS_" part of the naming was
   supposed to indicate).

   So perfectly valid code could (and did) do

	ti->status |= TS_RESTORE_SIGMASK;

   and the compiler was free to do that as separate load, or and store
   instructions.  Which can cause problems with preemption, since a task
   switch could happen in between, and change the TS_USEDFPU bit. The
   change to TS_USEDFPU would be overwritten by the final store.

   In practice, this seldom happened, though, because the 'status' field
   was seldom used more than once, so gcc would generally tend to
   generate code that used a read-modify-write instruction and thus
   happened to avoid this problem - RMW instructions are naturally low
   fat and preemption-safe.

 - On x86-32, the current_thread_info() pointer would, during interrupts
   and softirqs, point to a *copy* of the real thread_info, because
   x86-32 uses %esp to calculate the thread_info address, and thus the
   separate irq (and softirq) stacks would cause these kinds of odd
   thread_info copy aliases.

   This is normally not a problem, since interrupts aren't supposed to
   look at thread information anyway (what thread is running at
   interrupt time really isn't very well-defined), but it confused the
   heck out of irq_fpu_usable() and the code that tried to squirrel
   away the FPU state.

   (It also caused untold confusion for us poor kernel developers).

It also turns out that using 'task_struct' is actually much more natural
for most of the call sites that care about the FPU state, since they
tend to work with the task struct for other reasons anyway (ie
scheduling).  And the FPU data that we are going to save/restore is
found there too.

Thanks to Arjan Van De Ven <arjan@linux.intel.com> for pointing us to
the %esp issue.

Cc: Arjan van de Ven <arjan@linux.intel.com>
Reported-and-tested-by: NRaphael Prevost <raphael@buro.asia>
Acked-and-tested-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Tested-by: NPeter Anvin <hpa@zytor.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f94edacf

17 2月, 2012 1 次提交

i387: don't ever touch TS_USEDFPU directly, use helper functions · 6d59d7a9

由 Linus Torvalds 提交于 2月 16, 2012

This creates three helper functions that do the TS_USEDFPU accesses, and
makes everybody that used to do it by hand use those helpers instead.

In addition, there's a couple of helper functions for the "change both
CR0.TS and TS_USEDFPU at the same time" case, and the places that do
that together have been changed to use those. That means that we have
fewer random places that open-code this situation.

The intent is partly to clarify the code without actually changing any
semantics yet (since we clearly still have some hard to reproduce bug in
this area), but also to make it much easier to use another approach
entirely to caching the CR0.TS bit for software accesses.

Right now we use a bit in the thread-info 'status' variable (this patch
does not change that), but we might want to make it a full field of its
own or even make it a per-cpu variable.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6d59d7a9

13 1月, 2012 1 次提交

module_param: make bool parameters really bool (arch) · 476bc001

由 Rusty Russell 提交于 1月 13, 2012

module_param(bool) used to counter-intuitively take an int.  In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.

It's time to remove the int/unsigned int option.  For this version
it'll simply give a warning, but it'll break next kernel version.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

476bc001

27 12月, 2011 6 次提交

KVM: VMX: Intercept RDPMC · fee84b07

由 Avi Kivity 提交于 11月 10, 2011

Intercept RDPMC and forward it to the PMU emulation code.
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

fee84b07

KVM: Move cpuid code to new file · 00b27a3e

由 Avi Kivity 提交于 11月 23, 2011

The cpuid code has grown; put it into a separate file.
Signed-off-by: NAvi Kivity <avi@redhat.com>

00b27a3e

KVM: introduce id_to_memslot function · 28a37544

由 Xiao Guangrong 提交于 11月 24, 2011

Introduce id_to_memslot to get memslot by slot id
Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

28a37544

KVM: VMX: remove unneeded vmx_load_host_state() calls. · 46199f33

由 Gleb Natapov 提交于 11月 17, 2011

vmx_load_host_state() does not handle msrs switching (except
MSR_KERNEL_GS_BASE) since commit 26bb0981. Remove call to it
where it is no longer make sense.
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

46199f33

KVM: nVMX: Fix warning-causing idt-vectoring-info behavior · 51cfe38e

由 Nadav Har'El 提交于 9月 22, 2011

When L0 wishes to inject an interrupt while L2 is running, it emulates an exit
to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original
nVMX patch 23, titled "Correct handling of interrupt injection".

Unfortunately, it is possible (though rare) that at this point there is valid
idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2,
and when L2 tried to run this interrupt's handler, it got a page fault - so
it returns the original interrupt vector in idt_vectoring_info. The problem
is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT
like we wished to, because the VMX spec guarantees that idt_vectoring_info
and exit_reason_external_interrupt can never happen together. This is not
just specified in the spec - a KVM L1 actually prints a kernel warning
"unexpected, valid vectoring info" if we violate this guarantee, and some
users noticed these warnings in L1's logs.

In order to better emulate a processor, which would never return the external
interrupt and the idt-vectoring-info together, we need to separate the two
injection steps: First, complete L1's injection into L2 (i.e., enter L2,
injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds
and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT.
Most of this is already in the code - the only change we need is to remain
in L2 (and not exit to L1) in this case.

Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that
although we do enter L2 first, it will exit immediately after processing its
injection, allowing us to promptly inject to L1.

Note how we test vmcs12->idt_vectoring_info_field; This isn't really the
vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated),
but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value
of this field. This was explained in patch 25 ("Correct handling of idt
vectoring info") of the original nVMX patch series.

Thanks to Dave Allan and to Federico Simoncelli for reporting this bug,
to Abel Gordon for helping me figure out the solution, and to Avi Kivity
for helping to improve it.
Signed-off-by: NNadav Har'El <nyh@il.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

51cfe38e

KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT · d6185f20

由 Nadav Har'El 提交于 9月 22, 2011

This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
This bit requests that when next entering the guest, we should run it only
for as little as possible, and exit again.

We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
to continue running so it can inject an event to it, we unfortunately cannot
just pretend to have run L2 for a little while - We must really launch L2,
otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
will be lost. So the existing code runs L2 in this case.
But L2 could potentially run for a long time until it exits, and the
injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
to request that L2 will be entered, as necessary, but will exit as soon as
possible after entry.

Our implementation of this request uses smp_send_reschedule() to send a
self-IPI, with interrupts disabled. The interrupts remain disabled until the
guest is entered, and then, after the entry is complete (often including
processing an injection and jumping to the relevant handler), the physical
interrupt is noticed and causes an exit.

On recent Intel processors, we could have achieved the same goal by using
MTF instead of a self-IPI. Another technique worth considering in the future
is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
slightly improve performance by avoiding the useless interrupt handler
which ends up being called when smp_send_reschedule() is used.
Signed-off-by: NNadav Har'El <nyh@il.ibm.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

d6185f20

17 11月, 2011 1 次提交
- G
  KVM: VMX: Check for automatic switch msr table overflow · e7fc6f93
  由 Gleb Natapov 提交于 10月 05, 2011
```
Signed-off-by: NGleb Natapov <gleb@redhat.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>
```
  e7fc6f93

openanolis / cloud-kernel 11 个月 前同步成功

openanolis / cloud-kernel
11 个月前同步成功