1. 20 5月, 2015 4 次提交
    • X
      KVM: MMU: fix SMAP virtualization · edc90b7d
      Xiao Guangrong 提交于
      KVM may turn a user page to a kernel page when kernel writes a readonly
      user page if CR0.WP = 1. This shadow page entry will be reused after
      SMAP is enabled so that kernel is allowed to access this user page
      
      Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu
      once CR4.SMAP is updated
      Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      edc90b7d
    • N
      KVM: x86: Fix zero iterations REP-string · 428e3d08
      Nadav Amit 提交于
      When a REP-string is executed in 64-bit mode with an address-size prefix,
      ECX/EDI/ESI are used as counter and pointers. When ECX is initially zero, Intel
      CPUs clear the high 32-bits of RCX, and recent Intel CPUs update the high bits
      of the pointers in MOVS/STOS. This behavior is specific to Intel according to
      few experiments.
      
      As one may guess, this is an undocumented behavior. Yet, it is observable in
      the guest, since at least VMX traps REP-INS/OUTS even when ECX=0. Note that
      VMware appears to get it right.  The behavior can be observed using the
      following code:
      
       #include <stdio.h>
      
       #define LOW_MASK	(0xffffffff00000000ull)
       #define ALL_MASK	(0xffffffffffffffffull)
       #define TEST(opcode)							\
      	do {								\
      	asm volatile(".byte 0xf2 \n\t .byte 0x67 \n\t .byte " opcode "\n\t" \
      			: "=S"(s), "=c"(c), "=D"(d) 			\
      			: "S"(ALL_MASK), "c"(LOW_MASK), "D"(ALL_MASK));	\
      	printf("opcode %s rcx=%llx rsi=%llx rdi=%llx\n",		\
      		opcode, c, s, d);					\
      	} while(0)
      
      void main()
      {
      	unsigned long long s, d, c;
      	iopl(3);
      	TEST("0x6c");
      	TEST("0x6d");
      	TEST("0x6e");
      	TEST("0x6f");
      	TEST("0xa4");
      	TEST("0xa5");
      	TEST("0xa6");
      	TEST("0xa7");
      	TEST("0xaa");
      	TEST("0xab");
      	TEST("0xae");
      	TEST("0xaf");
      }
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      428e3d08
    • N
      KVM: x86: Fix update RCX/RDI/RSI on REP-string · ee122a71
      Nadav Amit 提交于
      When REP-string instruction is preceded with an address-size prefix,
      ECX/EDI/ESI are used as the operation counter and pointers.  When they are
      updated, the high 32-bits of RCX/RDI/RSI are cleared, similarly to the way they
      are updated on every 32-bit register operation.  Fix it.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ee122a71
    • N
      KVM: x86: Fix DR7 mask on task-switch while debugging · 3db176d5
      Nadav Amit 提交于
      If the host sets hardware breakpoints to debug the guest, and a task-switch
      occurs in the guest, the architectural DR7 will not be updated. The effective
      DR7 would be updated instead.
      
      This fix puts the DR7 update during task-switch emulation, so it now uses the
      standard DR setting mechanism instead of the one that was previously used. As a
      bonus, the update of DR7 will now be effective for AMD as well.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3db176d5
  2. 08 5月, 2015 4 次提交
  3. 07 5月, 2015 10 次提交
  4. 27 4月, 2015 1 次提交
    • R
      kvm: x86: fix kvmclock update protocol · 5dca0d91
      Radim Krčmář 提交于
      The kvmclock spec says that the host will increment a version field to
      an odd number, then update stuff, then increment it to an even number.
      The host is buggy and doesn't do this, and the result is observable
      when one vcpu reads another vcpu's kvmclock data.
      
      There's no good way for a guest kernel to keep its vdso from reading
      a different vcpu's kvmclock data, but we don't need to care about
      changing VCPUs as long as we read a consistent data from kvmclock.
      (VCPU can change outside of this loop too, so it doesn't matter if we
      return a value not fit for this VCPU.)
      
      Based on a patch by Radim Krčmář.
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Acked-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5dca0d91
  5. 22 4月, 2015 1 次提交
    • B
      KVM: VMX: Preserve host CR4.MCE value while in guest mode. · 085e68ee
      Ben Serebrin 提交于
      The host's decision to enable machine check exceptions should remain
      in force during non-root mode.  KVM was writing 0 to cr4 on VCPU reset
      and passed a slightly-modified 0 to the vmcs.guest_cr4 value.
      
      Tested: Built.
      On earlier version, tested by injecting machine check
      while a guest is spinning.
      
      Before the change, if guest CR4.MCE==0, then the machine check is
      escalated to Catastrophic Error (CATERR) and the machine dies.
      If guest CR4.MCE==1, then the machine check causes VMEXIT and is
      handled normally by host Linux. After the change, injecting a machine
      check causes normal Linux machine check handling.
      Signed-off-by: NBen Serebrin <serebrin@google.com>
      Reviewed-by: NVenkatesh Srinivas <venkateshs@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      085e68ee
  6. 16 4月, 2015 1 次提交
  7. 15 4月, 2015 4 次提交
  8. 08 4月, 2015 15 次提交
    • W
      kvm: mmu: lazy collapse small sptes into large sptes · 3ea3b7fa
      Wanpeng Li 提交于
      Dirty logging tracks sptes in 4k granularity, meaning that large sptes
      have to be split.  If live migration is successful, the guest in the
      source machine will be destroyed and large sptes will be created in the
      destination. However, the guest continues to run in the source machine
      (for example if live migration fails), small sptes will remain around
      and cause bad performance.
      
      This patch introduce lazy collapsing of small sptes into large sptes.
      The rmap will be scanned in ioctl context when dirty logging is stopped,
      dropping those sptes which can be collapsed into a single large-page spte.
      Later page faults will create the large-page sptes.
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Message-Id: <1428046825-6905-1-git-send-email-wanpeng.li@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3ea3b7fa
    • N
      KVM: x86: Clear CR2 on VCPU reset · 1119022c
      Nadav Amit 提交于
      CR2 is not cleared as it should after reset.  See Intel SDM table named "IA-32
      Processor States Following Power-up, Reset, or INIT".
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1427933438-12782-5-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1119022c
    • N
      KVM: x86: DR0-DR3 are not clear on reset · ae561ede
      Nadav Amit 提交于
      DR0-DR3 are not cleared as they should during reset and when they are set from
      userspace.  It appears to be caused by c77fb5fe ("KVM: x86: Allow the guest
      to run with dirty debug registers").
      
      Force their reload on these situations.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1427933438-12782-4-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ae561ede
    • N
      KVM: x86: BSP in MSR_IA32_APICBASE is writable · 58d269d8
      Nadav Amit 提交于
      After reset, the CPU can change the BSP, which will be used upon INIT.  Reset
      should return the BSP which QEMU asked for, and therefore handled accordingly.
      
      To quote: "If the MP protocol has completed and a BSP is chosen, subsequent
      INITs (either to a specific processor or system wide) do not cause the MP
      protocol to be repeated."
      [Intel SDM 8.4.2: MP Initialization Protocol Requirements and Restrictions]
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1427933438-12782-3-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      58d269d8
    • R
      KVM: x86: simplify kvm_apic_map · 3b5a5ffa
      Radim Krčmář 提交于
      recalculate_apic_map() uses two passes over all VCPUs.  This is a relic
      from time when we selected a global mode in the first pass and set up
      the optimized table in the second pass (to have a consistent mode).
      
      Recent changes made mixed mode unoptimized and we can do it in one pass.
      Format of logical MDA is a function of the mode, so we encode it in
      apic_logical_id() and drop obsoleted variables from the struct.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-5-git-send-email-rkrcmar@redhat.com>
      [Add lid_bits temporary in apic_logical_id. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3b5a5ffa
    • R
      KVM: x86: avoid logical_map when it is invalid · 3548a259
      Radim Krčmář 提交于
      We want to support mixed modes and the easiest solution is to avoid
      optimizing those weird and unlikely scenarios.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-4-git-send-email-rkrcmar@redhat.com>
      [Add comment above KVM_APIC_MODE_* defines. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3548a259
    • R
      KVM: x86: fix mixed APIC mode broadcast · 9ea369b0
      Radim Krčmář 提交于
      Broadcast allowed only one global APIC mode, but mixed modes are
      theoretically possible.  x2APIC IPI doesn't mean 0xff as broadcast,
      the rest does.
      
      x2APIC broadcasts are accepted by xAPIC.  If we take SDM to be logical,
      even addreses beginning with 0xff should be accepted, but real hardware
      disagrees.  This patch aims for simple code by considering most of real
      behavior as undefined.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-3-git-send-email-rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9ea369b0
    • R
      KVM: x86: use MDA for interrupt matching · 03d2249e
      Radim Krčmář 提交于
      In mixed modes, we musn't deliver xAPIC IPIs like x2APIC and vice versa.
      Instead of preserving the information in apic_send_ipi(), we regain it
      by converting all destinations into correct MDA in the slow path.
      This allows easier reasoning about subsequent matching.
      
      Our kvm_apic_broadcast() had an interesting design decision: it didn't
      consider IOxAPIC 0xff as broadcast in x2APIC mode ...
      everything worked because IOxAPIC can't set that in physical mode and
      logical mode considered it as a message for first 8 VCPUs.
      This patch interprets IOxAPIC 0xff as x2APIC broadcast.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-2-git-send-email-rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      03d2249e
    • E
      KVM: nVMX: remove unnecessary double caching of MAXPHYADDR · 92d71bc6
      Eugene Korenevsky 提交于
      After speed-up of cpuid_maxphyaddr() it can be called frequently:
      instead of heavyweight enumeration of CPUID entries it returns a cached
      pre-computed value. It is also inlined now. So caching its result became
      unnecessary and can be removed.
      Signed-off-by: NEugene Korenevsky <ekorenevsky@gmail.com>
      Message-Id: <20150329205644.GA1258@gnote>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      92d71bc6
    • E
      KVM: nVMX: checks for address bits beyond MAXPHYADDR on VM-entry · 9090422f
      Eugene Korenevsky 提交于
      On each VM-entry CPU should check the following VMCS fields for zero bits
      beyond physical address width:
      -  APIC-access address
      -  virtual-APIC address
      -  posted-interrupt descriptor address
      This patch adds these checks required by Intel SDM.
      Signed-off-by: NEugene Korenevsky <ekorenevsky@gmail.com>
      Message-Id: <20150329205627.GA1244@gnote>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9090422f
    • E
      KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu · 5a4f55cd
      Eugene Korenevsky 提交于
      cpuid_maxphyaddr(), which performs lot of memory accesses is called
      extensively across KVM, especially in nVMX code.
      
      This patch adds a cached value of maxphyaddr to vcpu.arch to reduce the
      pressure onto CPU cache and simplify the code of cpuid_maxphyaddr()
      callers. The cached value is initialized in kvm_arch_vcpu_init() and
      reloaded every time CPUID is updated by usermode. It is obvious that
      these reloads occur infrequently.
      Signed-off-by: NEugene Korenevsky <ekorenevsky@gmail.com>
      Message-Id: <20150329205612.GA1223@gnote>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5a4f55cd
    • R
      KVM: vmx: pass error code with internal error #2 · 80f0e95d
      Radim Krčmář 提交于
      Exposing the on-stack error code with internal error is cheap and
      potentially useful.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1428001865-32280-1-git-send-email-rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      80f0e95d
    • P
      KVM: x86: optimize delivery of TSC deadline timer interrupt · 9c8fd1ba
      Paolo Bonzini 提交于
      The newly-added tracepoint shows the following results on
      the tscdeadline_latency test:
      
              qemu-kvm-8387  [002]  6425.558974: kvm_vcpu_wakeup:      poll time 10407 ns
              qemu-kvm-8387  [002]  6425.558984: kvm_vcpu_wakeup:      poll time 0 ns
              qemu-kvm-8387  [002]  6425.561242: kvm_vcpu_wakeup:      poll time 10477 ns
              qemu-kvm-8387  [002]  6425.561251: kvm_vcpu_wakeup:      poll time 0 ns
      
      and so on.  This is because we need to go through kvm_vcpu_block again
      after the timer IRQ is injected.  Avoid it by polling once before
      entering kvm_vcpu_block.
      
      On my machine (Xeon E5 Sandy Bridge) this removes about 500 cycles (7%)
      from the latency of the TSC deadline timer.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9c8fd1ba
    • P
      KVM: x86: extract blocking logic from __vcpu_run · 362c698f
      Paolo Bonzini 提交于
      Rename the old __vcpu_run to vcpu_run, and extract part of it to a new
      function vcpu_block.
      
      The next patch will add a new condition in vcpu_block, avoid extra
      indentation.
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      362c698f
    • W
      kvm: x86: fix x86 eflags fixed bit · 35fd68a3
      Wanpeng Li 提交于
      Guest can't be booted w/ ept=0, there is a message dumped as below:
      
      If you're running a guest on an Intel machine without unrestricted mode
      support, the failure can be most likely due to the guest entering an invalid
      state for Intel VT. For example, the guest maybe running in big real mode
      which is not supported on less recent Intel processors.
      
      EAX=00000011 EBX=f000d2f6 ECX=00006cac EDX=000f8956
      ESI=bffbdf62 EDI=00000000 EBP=00006c68 ESP=00006c68
      EIP=0000d187 EFL=00000004 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
      ES =e000 000e0000 ffffffff 00809300 DPL=0 DS16 [-WA]
      CS =f000 000f0000 ffffffff 00809b00 DPL=0 CS16 [-RA]
      SS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA]
      DS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA]
      FS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA]
      GS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA]
      LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
      TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
      GDT=     000f6a80 00000037
      IDT=     000f6abe 00000000
      CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
      DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
      DR6=00000000ffff0ff0 DR7=0000000000000400
      EFER=0000000000000000
      Code=01 1e b8 6a 2e 0f 01 16 74 6a 0f 20 c0 66 83 c8 01 0f 22 c0 <66> ea 8f d1 0f 00 08 00 b8 10 00 00 00 8e d8 8e c0 8e d0 8e e0 8e e8 89 c8 ff e2 89 c1 b8X
      
      X86 eflags bit 1 is fixed set, which means that 1 << 1 is set instead of 1,
      this patch fix it.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Message-Id: <1428473294-6633-1-git-send-email-wanpeng.li@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      35fd68a3