1. 08 4月, 2015 21 次提交
    • W
      kvm: mmu: lazy collapse small sptes into large sptes · 3ea3b7fa
      Wanpeng Li 提交于
      Dirty logging tracks sptes in 4k granularity, meaning that large sptes
      have to be split.  If live migration is successful, the guest in the
      source machine will be destroyed and large sptes will be created in the
      destination. However, the guest continues to run in the source machine
      (for example if live migration fails), small sptes will remain around
      and cause bad performance.
      
      This patch introduce lazy collapsing of small sptes into large sptes.
      The rmap will be scanned in ioctl context when dirty logging is stopped,
      dropping those sptes which can be collapsed into a single large-page spte.
      Later page faults will create the large-page sptes.
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Message-Id: <1428046825-6905-1-git-send-email-wanpeng.li@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3ea3b7fa
    • N
      KVM: x86: Clear CR2 on VCPU reset · 1119022c
      Nadav Amit 提交于
      CR2 is not cleared as it should after reset.  See Intel SDM table named "IA-32
      Processor States Following Power-up, Reset, or INIT".
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1427933438-12782-5-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1119022c
    • N
      KVM: x86: DR0-DR3 are not clear on reset · ae561ede
      Nadav Amit 提交于
      DR0-DR3 are not cleared as they should during reset and when they are set from
      userspace.  It appears to be caused by c77fb5fe ("KVM: x86: Allow the guest
      to run with dirty debug registers").
      
      Force their reload on these situations.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1427933438-12782-4-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ae561ede
    • N
      KVM: x86: BSP in MSR_IA32_APICBASE is writable · 58d269d8
      Nadav Amit 提交于
      After reset, the CPU can change the BSP, which will be used upon INIT.  Reset
      should return the BSP which QEMU asked for, and therefore handled accordingly.
      
      To quote: "If the MP protocol has completed and a BSP is chosen, subsequent
      INITs (either to a specific processor or system wide) do not cause the MP
      protocol to be repeated."
      [Intel SDM 8.4.2: MP Initialization Protocol Requirements and Restrictions]
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1427933438-12782-3-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      58d269d8
    • R
      KVM: x86: simplify kvm_apic_map · 3b5a5ffa
      Radim Krčmář 提交于
      recalculate_apic_map() uses two passes over all VCPUs.  This is a relic
      from time when we selected a global mode in the first pass and set up
      the optimized table in the second pass (to have a consistent mode).
      
      Recent changes made mixed mode unoptimized and we can do it in one pass.
      Format of logical MDA is a function of the mode, so we encode it in
      apic_logical_id() and drop obsoleted variables from the struct.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-5-git-send-email-rkrcmar@redhat.com>
      [Add lid_bits temporary in apic_logical_id. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3b5a5ffa
    • R
      KVM: x86: avoid logical_map when it is invalid · 3548a259
      Radim Krčmář 提交于
      We want to support mixed modes and the easiest solution is to avoid
      optimizing those weird and unlikely scenarios.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-4-git-send-email-rkrcmar@redhat.com>
      [Add comment above KVM_APIC_MODE_* defines. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3548a259
    • R
      KVM: x86: fix mixed APIC mode broadcast · 9ea369b0
      Radim Krčmář 提交于
      Broadcast allowed only one global APIC mode, but mixed modes are
      theoretically possible.  x2APIC IPI doesn't mean 0xff as broadcast,
      the rest does.
      
      x2APIC broadcasts are accepted by xAPIC.  If we take SDM to be logical,
      even addreses beginning with 0xff should be accepted, but real hardware
      disagrees.  This patch aims for simple code by considering most of real
      behavior as undefined.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-3-git-send-email-rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9ea369b0
    • R
      KVM: x86: use MDA for interrupt matching · 03d2249e
      Radim Krčmář 提交于
      In mixed modes, we musn't deliver xAPIC IPIs like x2APIC and vice versa.
      Instead of preserving the information in apic_send_ipi(), we regain it
      by converting all destinations into correct MDA in the slow path.
      This allows easier reasoning about subsequent matching.
      
      Our kvm_apic_broadcast() had an interesting design decision: it didn't
      consider IOxAPIC 0xff as broadcast in x2APIC mode ...
      everything worked because IOxAPIC can't set that in physical mode and
      logical mode considered it as a message for first 8 VCPUs.
      This patch interprets IOxAPIC 0xff as x2APIC broadcast.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-2-git-send-email-rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      03d2249e
    • A
      kvm/ppc/mpic: drop unused IRQ_testbit · 19456060
      Arseny Solokha 提交于
      Drop unused static procedure which doesn't have callers within its
      translation unit. It had been already removed independently in QEMU[1]
      from the OpenPIC implementation borrowed from the kernel.
      
      [1] https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg01812.htmlSigned-off-by: NArseny Solokha <asolokha@kb.kras.ru>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Message-Id: <1424768706-23150-3-git-send-email-asolokha@kb.kras.ru>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      19456060
    • E
      KVM: nVMX: remove unnecessary double caching of MAXPHYADDR · 92d71bc6
      Eugene Korenevsky 提交于
      After speed-up of cpuid_maxphyaddr() it can be called frequently:
      instead of heavyweight enumeration of CPUID entries it returns a cached
      pre-computed value. It is also inlined now. So caching its result became
      unnecessary and can be removed.
      Signed-off-by: NEugene Korenevsky <ekorenevsky@gmail.com>
      Message-Id: <20150329205644.GA1258@gnote>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      92d71bc6
    • E
      KVM: nVMX: checks for address bits beyond MAXPHYADDR on VM-entry · 9090422f
      Eugene Korenevsky 提交于
      On each VM-entry CPU should check the following VMCS fields for zero bits
      beyond physical address width:
      -  APIC-access address
      -  virtual-APIC address
      -  posted-interrupt descriptor address
      This patch adds these checks required by Intel SDM.
      Signed-off-by: NEugene Korenevsky <ekorenevsky@gmail.com>
      Message-Id: <20150329205627.GA1244@gnote>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9090422f
    • E
      KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu · 5a4f55cd
      Eugene Korenevsky 提交于
      cpuid_maxphyaddr(), which performs lot of memory accesses is called
      extensively across KVM, especially in nVMX code.
      
      This patch adds a cached value of maxphyaddr to vcpu.arch to reduce the
      pressure onto CPU cache and simplify the code of cpuid_maxphyaddr()
      callers. The cached value is initialized in kvm_arch_vcpu_init() and
      reloaded every time CPUID is updated by usermode. It is obvious that
      these reloads occur infrequently.
      Signed-off-by: NEugene Korenevsky <ekorenevsky@gmail.com>
      Message-Id: <20150329205612.GA1223@gnote>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5a4f55cd
    • R
      KVM: vmx: pass error code with internal error #2 · 80f0e95d
      Radim Krčmář 提交于
      Exposing the on-stack error code with internal error is cheap and
      potentially useful.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1428001865-32280-1-git-send-email-rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      80f0e95d
    • R
      x86: vdso: fix pvclock races with task migration · 80f7fdb1
      Radim Krčmář 提交于
      If we were migrated right after __getcpu, but before reading the
      migration_count, we wouldn't notice that we read TSC of a different
      VCPU, nor that KVM's bug made pvti invalid, as only migration_count
      on source VCPU is increased.
      
      Change vdso instead of updating migration_count on destination.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Fixes: 0a4e6be9 ("x86: kvm: Revert "remove sched notifier for cross-cpu migrations"")
      Message-Id: <1428000263-11892-1-git-send-email-rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      80f7fdb1
    • P
      KVM: remove kvm_read_hva and kvm_read_hva_atomic · 3180a7fc
      Paolo Bonzini 提交于
      The corresponding write functions just use __copy_to_user.  Do the
      same on the read side.
      
      This reverts what's left of commit 86ab8cff (KVM: introduce
      gfn_to_hva_read/kvm_read_hva/kvm_read_hva_atomic, 2012-08-21)
      
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <1427976500-28533-1-git-send-email-pbonzini@redhat.com>
      3180a7fc
    • P
      KVM: x86: optimize delivery of TSC deadline timer interrupt · 9c8fd1ba
      Paolo Bonzini 提交于
      The newly-added tracepoint shows the following results on
      the tscdeadline_latency test:
      
              qemu-kvm-8387  [002]  6425.558974: kvm_vcpu_wakeup:      poll time 10407 ns
              qemu-kvm-8387  [002]  6425.558984: kvm_vcpu_wakeup:      poll time 0 ns
              qemu-kvm-8387  [002]  6425.561242: kvm_vcpu_wakeup:      poll time 10477 ns
              qemu-kvm-8387  [002]  6425.561251: kvm_vcpu_wakeup:      poll time 0 ns
      
      and so on.  This is because we need to go through kvm_vcpu_block again
      after the timer IRQ is injected.  Avoid it by polling once before
      entering kvm_vcpu_block.
      
      On my machine (Xeon E5 Sandy Bridge) this removes about 500 cycles (7%)
      from the latency of the TSC deadline timer.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9c8fd1ba
    • P
      KVM: x86: extract blocking logic from __vcpu_run · 362c698f
      Paolo Bonzini 提交于
      Rename the old __vcpu_run to vcpu_run, and extract part of it to a new
      function vcpu_block.
      
      The next patch will add a new condition in vcpu_block, avoid extra
      indentation.
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      362c698f
    • W
      kvm: x86: fix x86 eflags fixed bit · 35fd68a3
      Wanpeng Li 提交于
      Guest can't be booted w/ ept=0, there is a message dumped as below:
      
      If you're running a guest on an Intel machine without unrestricted mode
      support, the failure can be most likely due to the guest entering an invalid
      state for Intel VT. For example, the guest maybe running in big real mode
      which is not supported on less recent Intel processors.
      
      EAX=00000011 EBX=f000d2f6 ECX=00006cac EDX=000f8956
      ESI=bffbdf62 EDI=00000000 EBP=00006c68 ESP=00006c68
      EIP=0000d187 EFL=00000004 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
      ES =e000 000e0000 ffffffff 00809300 DPL=0 DS16 [-WA]
      CS =f000 000f0000 ffffffff 00809b00 DPL=0 CS16 [-RA]
      SS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA]
      DS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA]
      FS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA]
      GS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA]
      LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
      TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
      GDT=     000f6a80 00000037
      IDT=     000f6abe 00000000
      CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
      DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
      DR6=00000000ffff0ff0 DR7=0000000000000400
      EFER=0000000000000000
      Code=01 1e b8 6a 2e 0f 01 16 74 6a 0f 20 c0 66 83 c8 01 0f 22 c0 <66> ea 8f d1 0f 00 08 00 b8 10 00 00 00 8e d8 8e c0 8e d0 8e e0 8e e8 89 c8 ff e2 89 c1 b8X
      
      X86 eflags bit 1 is fixed set, which means that 1 << 1 is set instead of 1,
      this patch fix it.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Message-Id: <1428473294-6633-1-git-send-email-wanpeng.li@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      35fd68a3
    • P
      Merge tag 'kvm-s390-next-20150331' of... · 7f22b45d
      Paolo Bonzini 提交于
      Merge tag 'kvm-s390-next-20150331' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      Features and fixes for 4.1 (kvm/next)
      
      1. Assorted changes
      1.1 allow more feature bits for the guest
      1.2 Store breaking event address on program interrupts
      
      2. Interrupt handling rework
      2.1 Fix copy_to_user while holding a spinlock (cc stable)
      2.2 Rework floating interrupts to follow the priorities
      2.3 Allow to inject all local interrupts via new ioctl
      2.4 allow to get/set the full local irq state, e.g. for migration
          and introspection
      7f22b45d
    • P
      Merge tag 'kvm-arm-for-4.1' of... · bf0fb67c
      Paolo Bonzini 提交于
      Merge tag 'kvm-arm-for-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into 'kvm-next'
      
      KVM/ARM changes for v4.1:
      
      - fixes for live migration
      - irqfd support
      - kvm-io-bus & vgic rework to enable ioeventfd
      - page ageing for stage-2 translation
      - various cleanups
      bf0fb67c
    • P
      Merge tag 'kvm-arm-fixes-4.0-rc5' of... · 8999602d
      Paolo Bonzini 提交于
      Merge tag 'kvm-arm-fixes-4.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into 'kvm-next'
      
      Fixes for KVM/ARM for 4.0-rc5.
      
      Fixes page refcounting issues in our Stage-2 page table management code,
      fixes a missing unlock in a gicv3 error path, and fixes a race that can
      cause lost interrupts if signals are pending just prior to entering the
      guest.
      8999602d
  2. 01 4月, 2015 7 次提交
  3. 31 3月, 2015 6 次提交
    • C
      KVM: s390: enable more features that need no hypervisor changes · a3ed8dae
      Christian Borntraeger 提交于
      After some review about what these facilities do, the following
      facilities will work under KVM and can, therefore, be reported
      to the guest if the cpu model and the host cpu provide this bit.
      
      There are plans underway to make the whole bit thing more readable,
      but its not yet finished. So here are some last bit changes and
      we enhance the KVM mask with:
      
      9 The sense-running-status facility is installed in the
        z/Architecture architectural mode.
        ---> handled by SIE or KVM
      
      10 The conditional-SSKE facility is installed in the
         z/Architecture architectural mode.
        ---> handled by SIE. KVM will retry SIE
      
      13 The IPTE-range facility is installed in the
         z/Architecture architectural mode.
        ---> handled by SIE. KVM will retry SIE
      
      36 The enhanced-monitor facility is installed in the
         z/Architecture architectural mode.
        ---> handled by SIE
      
      47 The CMPSC-enhancement facility is installed in the
         z/Architecture architectural mode.
        ---> handled by SIE
      
      48 The decimal-floating-point zoned-conversion facility
         is installed in the z/Architecture architectural mode.
        ---> handled by SIE
      
      49 The execution-hint, load-and-trap, miscellaneous-
         instruction-extensions and processor-assist
        ---> handled by SIE
      
      51 The local-TLB-clearing facility is installed in the
         z/Architecture architectural mode.
        ---> handled by SIE
      
      52 The interlocked-access facility 2 is installed.
        ---> handled by SIE
      
      53 The load/store-on-condition facility 2 and load-and-
         zero-rightmost-byte facility are installed in the
         z/Architecture architectural mode.
        ---> handled by SIE
      
      57 The message-security-assist-extension-5 facility is
        installed in the z/Architecture architectural mode.
        ---> handled by SIE
      
      66 The reset-reference-bits-multiple facility is installed
        in the z/Architecture architectural mode.
        ---> handled by SIE. KVM will retry SIE
      
      80 The decimal-floating-point packed-conversion
         facility is installed in the z/Architecture architectural
         mode.
        ---> handled by SIE
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Tested-by: NMichael Mueller <mimu@linux.vnet.ibm.com>
      Acked-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      a3ed8dae
    • D
      KVM: s390: store the breaking-event address on pgm interrupts · 2ba45968
      David Hildenbrand 提交于
      If the PER-3 facility is installed, the breaking-event address is to be
      stored in the low core.
      
      There is no facility bit for PER-3 in stfl(e) and Linux always uses the
      value at address 272 no matter if PER-3 is available or not.
      We can't hide its existence from the guest. All program interrupts
      injected via the SIE automatically store this information if the PER-3
      facility is available in the hypervisor. Also the itdb contains the
      address automatically.
      
      As there is no switch to turn this mechanism off, let's simply make it
      consistent and also store the breaking event address in case of manual
      program interrupt injection.
      Reviewed-by: NJens Freimann <jfrei@linux.vnet.ibm.com>
      Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      2ba45968
    • N
      KVM: arm/arm64: enable KVM_CAP_IOEVENTFD · d44758c0
      Nikolay Nikolaev 提交于
      As the infrastructure for eventfd has now been merged, report the
      ioeventfd capability as being supported.
      Signed-off-by: NNikolay Nikolaev <n.nikolaev@virtualopensystems.com>
      [maz: grouped the case entry with the others, fixed commit log]
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      d44758c0
    • A
      KVM: arm/arm64: rework MMIO abort handling to use KVM MMIO bus · 950324ab
      Andre Przywara 提交于
      Currently we have struct kvm_exit_mmio for encapsulating MMIO abort
      data to be passed on from syndrome decoding all the way down to the
      VGIC register handlers. Now as we switch the MMIO handling to be
      routed through the KVM MMIO bus, it does not make sense anymore to
      use that structure already from the beginning. So we keep the data in
      local variables until we put them into the kvm_io_bus framework.
      Then we fill kvm_exit_mmio in the VGIC only, making it a VGIC private
      structure. On that way we replace the data buffer in that structure
      with a pointer pointing to a single location in a local variable, so
      we get rid of some copying on the way.
      With all of the virtual GIC emulation code now being registered with
      the kvm_io_bus, we can remove all of the old MMIO handling code and
      its dispatching functionality.
      
      I didn't bother to rename kvm_exit_mmio (to vgic_mmio or something),
      because that touches a lot of code lines without any good reason.
      
      This is based on an original patch by Nikolay.
      Signed-off-by: NAndre Przywara <andre.przywara@arm.com>
      Cc: Nikolay Nikolaev <n.nikolaev@virtualopensystems.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      950324ab
    • A
      KVM: arm/arm64: prepare GICv3 emulation to use kvm_io_bus MMIO handling · fb8f61ab
      Andre Przywara 提交于
      Using the framework provided by the recent vgic.c changes, we
      register a kvm_io_bus device on mapping the virtual GICv3 resources.
      The distributor mapping is pretty straight forward, but the
      redistributors need some more love, since they need to be tagged with
      the respective redistributor (read: VCPU) they are connected with.
      We use the kvm_io_bus framework to register one devices per VCPU.
      Signed-off-by: NAndre Przywara <andre.przywara@arm.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      fb8f61ab
    • A
      KVM: arm/arm64: merge GICv3 RD_base and SGI_base register frames · 0ba10d53
      Andre Przywara 提交于
      Currently we handle the redistributor registers in two separate MMIO
      regions, one for the overall behaviour and SPIs and one for the
      SGIs/PPIs. That latter forces the creation of _two_ KVM I/O bus
      devices for each redistributor.
      Since the spec mandates those two pages to be contigious, we could as
      well merge them and save the churn with the second KVM I/O bus device.
      Signed-off-by: NAndre Przywara <andre.przywara@arm.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      0ba10d53
  4. 30 3月, 2015 6 次提交