1. 20 5月, 2015 4 次提交
    • X
      KVM: MMU: fix SMAP virtualization · edc90b7d
      Xiao Guangrong 提交于
      KVM may turn a user page to a kernel page when kernel writes a readonly
      user page if CR0.WP = 1. This shadow page entry will be reused after
      SMAP is enabled so that kernel is allowed to access this user page
      
      Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu
      once CR4.SMAP is updated
      Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      edc90b7d
    • N
      KVM: x86: Fix zero iterations REP-string · 428e3d08
      Nadav Amit 提交于
      When a REP-string is executed in 64-bit mode with an address-size prefix,
      ECX/EDI/ESI are used as counter and pointers. When ECX is initially zero, Intel
      CPUs clear the high 32-bits of RCX, and recent Intel CPUs update the high bits
      of the pointers in MOVS/STOS. This behavior is specific to Intel according to
      few experiments.
      
      As one may guess, this is an undocumented behavior. Yet, it is observable in
      the guest, since at least VMX traps REP-INS/OUTS even when ECX=0. Note that
      VMware appears to get it right.  The behavior can be observed using the
      following code:
      
       #include <stdio.h>
      
       #define LOW_MASK	(0xffffffff00000000ull)
       #define ALL_MASK	(0xffffffffffffffffull)
       #define TEST(opcode)							\
      	do {								\
      	asm volatile(".byte 0xf2 \n\t .byte 0x67 \n\t .byte " opcode "\n\t" \
      			: "=S"(s), "=c"(c), "=D"(d) 			\
      			: "S"(ALL_MASK), "c"(LOW_MASK), "D"(ALL_MASK));	\
      	printf("opcode %s rcx=%llx rsi=%llx rdi=%llx\n",		\
      		opcode, c, s, d);					\
      	} while(0)
      
      void main()
      {
      	unsigned long long s, d, c;
      	iopl(3);
      	TEST("0x6c");
      	TEST("0x6d");
      	TEST("0x6e");
      	TEST("0x6f");
      	TEST("0xa4");
      	TEST("0xa5");
      	TEST("0xa6");
      	TEST("0xa7");
      	TEST("0xaa");
      	TEST("0xab");
      	TEST("0xae");
      	TEST("0xaf");
      }
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      428e3d08
    • N
      KVM: x86: Fix update RCX/RDI/RSI on REP-string · ee122a71
      Nadav Amit 提交于
      When REP-string instruction is preceded with an address-size prefix,
      ECX/EDI/ESI are used as the operation counter and pointers.  When they are
      updated, the high 32-bits of RCX/RDI/RSI are cleared, similarly to the way they
      are updated on every 32-bit register operation.  Fix it.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ee122a71
    • N
      KVM: x86: Fix DR7 mask on task-switch while debugging · 3db176d5
      Nadav Amit 提交于
      If the host sets hardware breakpoints to debug the guest, and a task-switch
      occurs in the guest, the architectural DR7 will not be updated. The effective
      DR7 would be updated instead.
      
      This fix puts the DR7 update during task-switch emulation, so it now uses the
      standard DR setting mechanism instead of the one that was previously used. As a
      bonus, the update of DR7 will now be effective for AMD as well.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3db176d5
  2. 08 5月, 2015 13 次提交
  3. 07 5月, 2015 14 次提交
    • P
      KVM: x86: dump VMCS on invalid entry · 4eb64dce
      Paolo Bonzini 提交于
      Code and format roughly based on Xen's vmcs_dump_vcpu.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4eb64dce
    • M
      x86: kvmclock: drop rdtsc_barrier() · a3eb97bd
      Marcelo Tosatti 提交于
      Drop unnecessary rdtsc_barrier(), as has been determined empirically,
      see 057e6a8c for details.
      
      Noticed by Andy Lutomirski.
      
      Improves clock_gettime() by approximately 15% on
      Intel i7-3520M @ 2.90GHz.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a3eb97bd
    • J
      KVM: x86: drop unneeded null test · d90e3a35
      Julia Lawall 提交于
      If the null test is needed, the call to cancel_delayed_work_sync would have
      already crashed.  Normally, the destroy function should only be called
      if the init function has succeeded, in which case ioapic is not null.
      
      Problem found using Coccinelle.
      Suggested-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d90e3a35
    • R
      KVM: x86: fix initial PAT value · 74545705
      Radim Krčmář 提交于
      PAT should be 0007_0406_0007_0406h on RESET and not modified on INIT.
      VMX used a wrong value (host's PAT) and while SVM used the right one,
      it never got to arch.pat.
      
      This is not an issue with QEMU as it will force the correct value.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      74545705
    • R
      kvm,x86: load guest FPU context more eagerly · 653f52c3
      Rik van Riel 提交于
      Currently KVM will clear the FPU bits in CR0.TS in the VMCS, and trap to
      re-load them every time the guest accesses the FPU after a switch back into
      the guest from the host.
      
      This patch copies the x86 task switch semantics for FPU loading, with the
      FPU loaded eagerly after first use if the system uses eager fpu mode,
      or if the guest uses the FPU frequently.
      
      In the latter case, after loading the FPU for 255 times, the fpu_counter
      will roll over, and we will revert to loading the FPU on demand, until
      it has been established that the guest is still actively using the FPU.
      
      This mirrors the x86 task switch policy, which seems to work.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      653f52c3
    • J
      kvm: x86: Deliver MSI IRQ to only lowest prio cpu if msi_redir_hint is true · d1ebdbf9
      James Sullivan 提交于
      An MSI interrupt should only be delivered to the lowest priority CPU
      when it has RH=1, regardless of the delivery mode. Modified
      kvm_is_dm_lowest_prio() to check for either irq->delivery_mode == APIC_DM_LOWPRI
      or irq->msi_redir_hint.
      
      Moved kvm_is_dm_lowest_prio() into lapic.h and renamed to
      kvm_lowest_prio_delivery().
      
      Changed a check in kvm_irq_delivery_to_apic_fast() from
      irq->delivery_mode == APIC_DM_LOWPRI to kvm_is_dm_lowest_prio().
      Signed-off-by: NJames Sullivan <sullivan.james.f@gmail.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1ebdbf9
    • J
      kvm: x86: Extended struct kvm_lapic_irq with msi_redir_hint for MSI delivery · 93bbf0b8
      James Sullivan 提交于
      Extended struct kvm_lapic_irq with bool msi_redir_hint, which will
      be used to determine if the delivery of the MSI should target only
      the lowest priority CPU in the logical group specified for delivery.
      (In physical dest mode, the RH bit is not relevant). Initialized the value
      of msi_redir_hint to true when RH=1 in kvm_set_msi_irq(), and initialized
      to false in all other cases.
      
      Added value of msi_redir_hint to a debug message dump of an IRQ in
      apic_send_ipi().
      Signed-off-by: NJames Sullivan <sullivan.james.f@gmail.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      93bbf0b8
    • P
      KVM: x86: tweak types of fields in kvm_lapic_irq · b7cb2231
      Paolo Bonzini 提交于
      Change to u16 if they only contain data in the low 16 bits.
      
      Change the level field to bool, since we assign 1 sometimes, but
      just mask icr_low with APIC_INT_ASSERT in apic_send_ipi.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b7cb2231
    • N
      KVM: x86: INIT and reset sequences are different · d28bc9dd
      Nadav Amit 提交于
      x86 architecture defines differences between the reset and INIT sequences.
      INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
      MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.
      
      References (from Intel SDM):
      
      "If the MP protocol has completed and a BSP is chosen, subsequent INITs (either
      to a specific processor or system wide) do not cause the MP protocol to be
      repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions]
      
      [Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]
      
      "If the processor is reset by asserting the INIT# pin, the x87 FPU state is not
      changed." [9.2: X87 FPU INITIALIZATION]
      
      "The state of the local APIC following an INIT reset is the same as it is after
      a power-up or hardware reset, except that the APIC ID and arbitration ID
      registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
      ("Wait-for-SIPI" State)]
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d28bc9dd
    • N
      KVM: x86: Support for disabling quirks · 90de4a18
      Nadav Amit 提交于
      Introducing KVM_CAP_DISABLE_QUIRKS for disabling x86 quirks that were previous
      created in order to overcome QEMU issues. Those issue were mostly result of
      invalid VM BIOS.  Currently there are two quirks that can be disabled:
      
      1. KVM_QUIRK_LINT0_REENABLED - LINT0 was enabled after boot
      2. KVM_QUIRK_CD_NW_CLEARED - CD and NW are cleared after boot
      
      These two issues are already resolved in recent releases of QEMU, and would
      therefore be disabled by QEMU.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1428879221-29996-1-git-send-email-namit@cs.technion.ac.il>
      [Report capability from KVM_CHECK_EXTENSION too. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      90de4a18
    • P
      KVM: booke: use __kvm_guest_exit · e233d54d
      Paolo Bonzini 提交于
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e233d54d
    • C
      KVM: arm/mips/x86/power use __kvm_guest_{enter|exit} · ccf73aaf
      Christian Borntraeger 提交于
      Use __kvm_guest_{enter|exit} instead of kvm_guest_{enter|exit}
      where interrupts are disabled.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ccf73aaf
    • C
      KVM: provide irq_unsafe kvm_guest_{enter|exit} · 0097d12e
      Christian Borntraeger 提交于
      Several kvm architectures disable interrupts before kvm_guest_enter.
      kvm_guest_enter then uses local_irq_save/restore to disable interrupts
      again or for the first time. Lets provide underscore versions of
      kvm_guest_{enter|exit} that assume being called locked.
      kvm_guest_enter now disables interrupts for the full function and
      thus we can remove the check for preemptible.
      
      This patch then adopts s390/kvm to use local_irq_disable/enable calls
      which are slighty cheaper that local_irq_save/restore and call these
      new functions.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0097d12e
    • L
      kvmclock: set scheduler clock stable · ff7bbb9c
      Luiz Capitulino 提交于
      If you try to enable NOHZ_FULL on a guest today, you'll get
      the following error when the guest tries to deactivate the
      scheduler tick:
      
       WARNING: CPU: 3 PID: 2182 at kernel/time/tick-sched.c:192 can_stop_full_tick+0xb9/0x290()
       NO_HZ FULL will not work with unstable sched clock
       CPU: 3 PID: 2182 Comm: kworker/3:1 Not tainted 4.0.0-10545-gb9bb6fb7 #204
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       Workqueue: events flush_to_ldisc
        ffffffff8162a0c7 ffff88011f583e88 ffffffff814e6ba0 0000000000000002
        ffff88011f583ed8 ffff88011f583ec8 ffffffff8104d095 ffff88011f583eb8
        0000000000000000 0000000000000003 0000000000000001 0000000000000001
       Call Trace:
        <IRQ>  [<ffffffff814e6ba0>] dump_stack+0x4f/0x7b
        [<ffffffff8104d095>] warn_slowpath_common+0x85/0xc0
        [<ffffffff8104d146>] warn_slowpath_fmt+0x46/0x50
        [<ffffffff810bd2a9>] can_stop_full_tick+0xb9/0x290
        [<ffffffff810bd9ed>] tick_nohz_irq_exit+0x8d/0xb0
        [<ffffffff810511c5>] irq_exit+0xc5/0x130
        [<ffffffff814f180a>] smp_apic_timer_interrupt+0x4a/0x60
        [<ffffffff814eff5e>] apic_timer_interrupt+0x6e/0x80
        <EOI>  [<ffffffff814ee5d1>] ? _raw_spin_unlock_irqrestore+0x31/0x60
        [<ffffffff8108bbc8>] __wake_up+0x48/0x60
        [<ffffffff8134836c>] n_tty_receive_buf_common+0x49c/0xba0
        [<ffffffff8134a6bf>] ? tty_ldisc_ref+0x1f/0x70
        [<ffffffff81348a84>] n_tty_receive_buf2+0x14/0x20
        [<ffffffff8134b390>] flush_to_ldisc+0xe0/0x120
        [<ffffffff81064d05>] process_one_work+0x1d5/0x540
        [<ffffffff81064c81>] ? process_one_work+0x151/0x540
        [<ffffffff81065191>] worker_thread+0x121/0x470
        [<ffffffff81065070>] ? process_one_work+0x540/0x540
        [<ffffffff8106b4df>] kthread+0xef/0x110
        [<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
        [<ffffffff814ef4f2>] ret_from_fork+0x42/0x70
        [<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
       ---[ end trace 06e3507544a38866 ]---
      
      However, it turns out that kvmclock does provide a stable
      sched_clock callback. So, let the scheduler know this which
      in turn makes NOHZ_FULL work in the guest.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ff7bbb9c
  4. 01 5月, 2015 4 次提交
    • S
      powerpc/powernv: Restore non-volatile CRs after nap · 0aab3747
      Sam Bobroff 提交于
      Patches 7cba160a "powernv/cpuidle: Redesign idle states management"
      and 77b54e9f "powernv/powerpc: Add winkle support for offline cpus"
      use non-volatile condition registers (cr2, cr3 and cr4) early in the system
      reset interrupt handler (system_reset_pSeries()) before it has been determined
      if state loss has occurred. If state loss has not occurred, control returns via
      the power7_wakeup_noloss() path which does not restore those condition
      registers, leaving them corrupted.
      
      Fix this by restoring the condition registers in the power7_wakeup_noloss()
      case.
      
      This is apparent when running a KVM guest on hardware that does not
      support winkle or sleep and the guest makes use of secondary threads. In
      practice this means Power7 machines, though some early unreleased Power8
      machines may also be susceptible.
      
      The secondary CPUs are taken off line before the guest is started and
      they call pnv_smp_cpu_kill_self(). This checks support for sleep
      states (in this case there is no support) and power7_nap() is called.
      
      When the CPU is woken, power7_nap() returns and because the CPU is
      still off line, the main while loop executes again. The sleep states
      support test is executed again, but because the tested values cannot
      have changed, the compiler has optimized the test away and instead we
      rely on the result of the first test, which has been left in cr3
      and/or cr4. With the result overwritten, the wrong branch is taken and
      power7_winkle() is called on a CPU that does not support it, leading
      to it stalling.
      
      Fixes: 7cba160a ("powernv/cpuidle: Redesign idle states management")
      Fixes: 77b54e9f ("powernv/powerpc: Add winkle support for offline cpus")
      [mpe: Massage change log a bit more]
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0aab3747
    • G
      powerpc/eeh: Delay probing EEH device during hotplug · d91dafc0
      Gavin Shan 提交于
      Commit 1c509148b ("powerpc/eeh: Do probe on pci_dn") probes EEH
      devices in early stage, which is reasonable to pSeries platform.
      However, it's wrong for PowerNV platform because the PE# isn't
      determined until the resources (IO and MMIO) are assigned to
      PE in hotplug case. So we have to delay probing EEH devices
      for PowerNV platform until the PE# is assigned.
      
      Fixes: ff57b454 ("powerpc/eeh: Do probe on pci_dn")
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d91dafc0
    • G
      powerpc/eeh: Fix race condition in pcibios_set_pcie_reset_state() · 1ae79b78
      Gavin Shan 提交于
      When asserting reset in pcibios_set_pcie_reset_state(), the PE
      is enforced to (hardware) frozen state in order to drop unexpected
      PCI transactions (except PCI config read/write) automatically by
      hardware during reset, which would cause recursive EEH error.
      However, the (software) frozen state EEH_PE_ISOLATED is missed.
      When users get 0xFF from PCI config or MMIO read, EEH_PE_ISOLATED
      is set in PE state retrival backend. Unfortunately, nobody (the
      reset handler or the EEH recovery functinality in host) will clear
      EEH_PE_ISOLATED when the PE has been passed through to guest.
      
      The patch sets and clears EEH_PE_ISOLATED properly during reset
      in function pcibios_set_pcie_reset_state() to fix the issue.
      
      Fixes: 28158cd1 ("Enhance pcibios_set_pcie_reset_state()")
      Reported-by: NCarol L. Soto <clsoto@us.ibm.com>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Tested-by: NCarol L. Soto <clsoto@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1ae79b78
    • N
      powerpc/pseries: Correct cpu affinity for dlpar added cpus · f32393c9
      Nathan Fontenot 提交于
      The incorrect ordering of operations during cpu dlpar add results in invalid
      affinity for the cpu being added. The ibm,associativity property in the
      device tree is populated with all zeroes for the added cpu which results in
      invalid affinity mappings and all cpus appear to belong to node 0.
      
      This occurs because rtas configure-connector is called prior to making the
      rtas set-indicator calls. Phyp does not assign affinity information
      for a cpu until the rtas set-indicator calls are made to set the isolation
      and allocation state.
      
      Correct the order of operations to make the rtas set-indicator
      calls (done in dlpar_acquire_drc) before calling rtas configure-connector.
      
      Fixes: 1a8061c4 ("powerpc/pseries: Add kernel based CPU DLPAR handling")
      Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f32393c9
  5. 30 4月, 2015 4 次提交
  6. 29 4月, 2015 1 次提交