1. 11 5月, 2015 1 次提交
  2. 08 4月, 2015 6 次提交
    • W
      kvm: mmu: lazy collapse small sptes into large sptes · 3ea3b7fa
      Wanpeng Li 提交于
      Dirty logging tracks sptes in 4k granularity, meaning that large sptes
      have to be split.  If live migration is successful, the guest in the
      source machine will be destroyed and large sptes will be created in the
      destination. However, the guest continues to run in the source machine
      (for example if live migration fails), small sptes will remain around
      and cause bad performance.
      
      This patch introduce lazy collapsing of small sptes into large sptes.
      The rmap will be scanned in ioctl context when dirty logging is stopped,
      dropping those sptes which can be collapsed into a single large-page spte.
      Later page faults will create the large-page sptes.
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Message-Id: <1428046825-6905-1-git-send-email-wanpeng.li@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3ea3b7fa
    • N
      KVM: x86: DR0-DR3 are not clear on reset · ae561ede
      Nadav Amit 提交于
      DR0-DR3 are not cleared as they should during reset and when they are set from
      userspace.  It appears to be caused by c77fb5fe ("KVM: x86: Allow the guest
      to run with dirty debug registers").
      
      Force their reload on these situations.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1427933438-12782-4-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ae561ede
    • R
      KVM: x86: simplify kvm_apic_map · 3b5a5ffa
      Radim Krčmář 提交于
      recalculate_apic_map() uses two passes over all VCPUs.  This is a relic
      from time when we selected a global mode in the first pass and set up
      the optimized table in the second pass (to have a consistent mode).
      
      Recent changes made mixed mode unoptimized and we can do it in one pass.
      Format of logical MDA is a function of the mode, so we encode it in
      apic_logical_id() and drop obsoleted variables from the struct.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-5-git-send-email-rkrcmar@redhat.com>
      [Add lid_bits temporary in apic_logical_id. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3b5a5ffa
    • R
      KVM: x86: avoid logical_map when it is invalid · 3548a259
      Radim Krčmář 提交于
      We want to support mixed modes and the easiest solution is to avoid
      optimizing those weird and unlikely scenarios.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-4-git-send-email-rkrcmar@redhat.com>
      [Add comment above KVM_APIC_MODE_* defines. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3548a259
    • R
      KVM: x86: fix mixed APIC mode broadcast · 9ea369b0
      Radim Krčmář 提交于
      Broadcast allowed only one global APIC mode, but mixed modes are
      theoretically possible.  x2APIC IPI doesn't mean 0xff as broadcast,
      the rest does.
      
      x2APIC broadcasts are accepted by xAPIC.  If we take SDM to be logical,
      even addreses beginning with 0xff should be accepted, but real hardware
      disagrees.  This patch aims for simple code by considering most of real
      behavior as undefined.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-3-git-send-email-rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9ea369b0
    • E
      KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu · 5a4f55cd
      Eugene Korenevsky 提交于
      cpuid_maxphyaddr(), which performs lot of memory accesses is called
      extensively across KVM, especially in nVMX code.
      
      This patch adds a cached value of maxphyaddr to vcpu.arch to reduce the
      pressure onto CPU cache and simplify the code of cpuid_maxphyaddr()
      callers. The cached value is initialized in kvm_arch_vcpu_init() and
      reloaded every time CPUID is updated by usermode. It is obvious that
      these reloads occur infrequently.
      Signed-off-by: NEugene Korenevsky <ekorenevsky@gmail.com>
      Message-Id: <20150329205612.GA1223@gnote>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5a4f55cd
  3. 30 3月, 2015 2 次提交
  4. 11 3月, 2015 1 次提交
  5. 06 2月, 2015 1 次提交
    • P
      kvm: add halt_poll_ns module parameter · f7819512
      Paolo Bonzini 提交于
      This patch introduces a new module parameter for the KVM module; when it
      is present, KVM attempts a bit of polling on every HLT before scheduling
      itself out via kvm_vcpu_block.
      
      This parameter helps a lot for latency-bound workloads---in particular
      I tested it with O_DSYNC writes with a battery-backed disk in the host.
      In this case, writes are fast (because the data doesn't have to go all
      the way to the platters) but they cannot be merged by either the host or
      the guest.  KVM's performance here is usually around 30% of bare metal,
      or 50% if you use cache=directsync or cache=writethrough (these
      parameters avoid that the guest sends pointless flush requests, and
      at the same time they are not slow because of the battery-backed cache).
      The bad performance happens because on every halt the host CPU decides
      to halt itself too.  When the interrupt comes, the vCPU thread is then
      migrated to a new physical CPU, and in general the latency is horrible
      because the vCPU thread has to be scheduled back in.
      
      With this patch performance reaches 60-65% of bare metal and, more
      important, 99% of what you get if you use idle=poll in the guest.  This
      means that the tunable gets rid of this particular bottleneck, and more
      work can be done to improve performance in the kernel or QEMU.
      
      Of course there is some price to pay; every time an otherwise idle vCPUs
      is interrupted by an interrupt, it will poll unnecessarily and thus
      impose a little load on the host.  The above results were obtained with
      a mostly random value of the parameter (500000), and the load was around
      1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
      
      The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
      that can be used to tune the parameter.  It counts how many HLT
      instructions received an interrupt during the polling period; each
      successful poll avoids that Linux schedules the VCPU thread out and back
      in, and may also avoid a likely trip to C1 and back for the physical CPU.
      
      While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
      Of these halts, almost all are failed polls.  During the benchmark,
      instead, basically all halts end within the polling period, except a more
      or less constant stream of 50 per second coming from vCPUs that are not
      running the benchmark.  The wasted time is thus very low.  Things may
      be slightly different for Windows VMs, which have a ~10 ms timer tick.
      
      The effect is also visible on Marcelo's recently-introduced latency
      test for the TSC deadline timer.  Though of course a non-RT kernel has
      awful latency bounds, the latency of the timer is around 8000-10000 clock
      cycles compared to 20000-120000 without setting halt_poll_ns.  For the TSC
      deadline timer, thus, the effect is both a smaller average latency and
      a smaller variance.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7819512
  6. 05 2月, 2015 1 次提交
  7. 03 2月, 2015 1 次提交
  8. 29 1月, 2015 3 次提交
  9. 21 1月, 2015 2 次提交
    • B
      kvm: Fix CR3_PCID_INVD type on 32-bit · cfaa790a
      Borislav Petkov 提交于
      arch/x86/kvm/emulate.c: In function ‘check_cr_write’:
      arch/x86/kvm/emulate.c:3552:4: warning: left shift count >= width of type
          rsvd = CR3_L_MODE_RESERVED_BITS & ~CR3_PCID_INVD;
      
      happens because sizeof(UL) on 32-bit is 4 bytes but we shift it 63 bits
      to the left.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cfaa790a
    • M
      KVM: x86: workaround SuSE's 2.6.16 pvclock vs masterclock issue · 54750f2c
      Marcelo Tosatti 提交于
      SuSE's 2.6.16 kernel fails to boot if the delta between tsc_timestamp
      and rdtsc is larger than a given threshold:
      
       * If we get more than the below threshold into the future, we rerequest
       * the real time from the host again which has only little offset then
       * that we need to adjust using the TSC.
       *
       * For now that threshold is 1/5th of a jiffie. That should be good
       * enough accuracy for completely broken systems, but also give us swing
       * to not call out to the host all the time.
       */
      #define PVCLOCK_DELTA_MAX ((1000000000ULL / HZ) / 5)
      
      Disable masterclock support (which increases said delta) in case the
      boot vcpu does not use MSR_KVM_SYSTEM_TIME_NEW.
      
      Upstreams kernels which support pvclock vsyscalls (and therefore make
      use of PVCLOCK_STABLE_BIT) use MSR_KVM_SYSTEM_TIME_NEW.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54750f2c
  10. 16 1月, 2015 1 次提交
  11. 09 1月, 2015 2 次提交
  12. 18 12月, 2014 1 次提交
  13. 05 12月, 2014 2 次提交
  14. 24 11月, 2014 3 次提交
  15. 22 11月, 2014 1 次提交
  16. 14 11月, 2014 1 次提交
  17. 07 11月, 2014 2 次提交
  18. 03 11月, 2014 2 次提交
    • N
      KVM: vmx: Unavailable DR4/5 is checked before CPL · 16f8a6f9
      Nadav Amit 提交于
      If DR4/5 is accessed when it is unavailable (since CR4.DE is set), then #UD
      should be generated even if CPL>0. This is according to Intel SDM Table 6-2:
      "Priority Among Simultaneous Exceptions and Interrupts".
      
      Note, that this may happen on the first DR access, even if the host does not
      sets debug breakpoints. Obviously, it occurs when the host debugs the guest.
      
      This patch moves the DR4/5 checks from __kvm_set_dr/_kvm_get_dr to handle_dr.
      The emulator already checks DR4/5 availability in check_dr_read. Nested
      virutalization related calls to kvm_set_dr/kvm_get_dr would not like to inject
      exceptions to the guest.
      
      As for SVM, the patch follows the previous logic as much as possible. Anyhow,
      it appears the DR interception code might be buggy - even if the DR access
      may cause an exception, the instruction is skipped.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      16f8a6f9
    • N
      KVM: x86: some apic broadcast modes does not work · 394457a9
      Nadav Amit 提交于
      KVM does not deliver x2APIC broadcast messages with physical mode.  Intel SDM
      (10.12.9 ICR Operation in x2APIC Mode) states: "A destination ID value of
      FFFF_FFFFH is used for broadcast of interrupts in both logical destination and
      physical destination modes."
      
      In addition, the local-apic enables cluster mode broadcast. As Intel SDM
      10.6.2.2 says: "Broadcast to all local APICs is achieved by setting all
      destination bits to one." This patch enables cluster mode broadcast.
      
      The fix tries to combine broadcast in different modes through a unified code.
      
      One rare case occurs when the source of IPI has its APIC disabled.  In such
      case, the source can still issue IPIs, but since the source is not obliged to
      have the same LAPIC mode as the enabled ones, we cannot rely on it.
      Since it is a rare case, it is unoptimized and done on the slow-path.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      [As per Radim's review, use unsigned int for X2APIC_BROADCAST, return bool from
       kvm_apic_broadcast. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      394457a9
  19. 24 10月, 2014 2 次提交
    • A
      KVM: x86: Prevent host from panicking on shared MSR writes. · 8b3c3104
      Andy Honig 提交于
      The previous patch blocked invalid writes directly when the MSR
      is written.  As a precaution, prevent future similar mistakes by
      gracefulling handle GPs caused by writes to shared MSRs.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAndrew Honig <ahonig@google.com>
      [Remove parts obsoleted by Nadav's patch. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8b3c3104
    • N
      KVM: x86: Check non-canonical addresses upon WRMSR · 854e8bb1
      Nadav Amit 提交于
      Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
      written to certain MSRs. The behavior is "almost" identical for AMD and Intel
      (ignoring MSRs that are not implemented in either architecture since they would
      anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
      non-canonical address is written on Intel but not on AMD (which ignores the top
      32-bits).
      
      Accordingly, this patch injects a #GP on the MSRs which behave identically on
      Intel and AMD.  To eliminate the differences between the architecutres, the
      value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
      canonical value before writing instead of injecting a #GP.
      
      Some references from Intel and AMD manuals:
      
      According to Intel SDM description of WRMSR instruction #GP is expected on
      WRMSR "If the source register contains a non-canonical address and ECX
      specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
      IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."
      
      According to AMD manual instruction manual:
      LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
      LSTAR and CSTAR registers.  If an RIP written by WRMSR is not in canonical
      form, a general-protection exception (#GP) occurs."
      IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
      base field must be in canonical form or a #GP fault will occur."
      IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
      be in canonical form."
      
      This patch fixes CVE-2014-3610.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      854e8bb1
  20. 24 9月, 2014 5 次提交