1. 23 3月, 2020 1 次提交
    • H
      KVM: LAPIC: Mark hrtimer for period or oneshot mode to expire in hard interrupt context · edec6e01
      He Zhe 提交于
      apic->lapic_timer.timer was initialized with HRTIMER_MODE_ABS_HARD but
      started later with HRTIMER_MODE_ABS, which may cause the following warning
      in PREEMPT_RT kernel.
      
      WARNING: CPU: 1 PID: 2957 at kernel/time/hrtimer.c:1129 hrtimer_start_range_ns+0x348/0x3f0
      CPU: 1 PID: 2957 Comm: qemu-system-x86 Not tainted 5.4.23-rt11 #1
      Hardware name: Supermicro SYS-E300-9A-8C/A2SDi-8C-HLN4F, BIOS 1.1a 09/18/2018
      RIP: 0010:hrtimer_start_range_ns+0x348/0x3f0
      Code: 4d b8 0f 94 c1 0f b6 c9 e8 35 f1 ff ff 4c 8b 45
            b0 e9 3b fd ff ff e8 d7 3f fa ff 48 98 4c 03 34
            c5 a0 26 bf 93 e9 a1 fd ff ff <0f> 0b e9 fd fc ff
            ff 65 8b 05 fa b7 90 6d 89 c0 48 0f a3 05 60 91
      RSP: 0018:ffffbc60026ffaf8 EFLAGS: 00010202
      RAX: 0000000000000001 RBX: ffff9d81657d4110 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000006cc7987bcf RDI: ffff9d81657d4110
      RBP: ffffbc60026ffb58 R08: 0000000000000001 R09: 0000000000000010
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000006cc7987bcf
      R13: 0000000000000000 R14: 0000006cc7987bcf R15: ffffbc60026d6a00
      FS: 00007f401daed700(0000) GS:ffff9d81ffa40000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000ffffffff CR3: 0000000fa7574000 CR4: 00000000003426e0
      Call Trace:
      ? kvm_release_pfn_clean+0x22/0x60 [kvm]
      start_sw_timer+0x85/0x230 [kvm]
      ? vmx_vmexit+0x1b/0x30 [kvm_intel]
      kvm_lapic_switch_to_sw_timer+0x72/0x80 [kvm]
      vmx_pre_block+0x1cb/0x260 [kvm_intel]
      ? vmx_vmexit+0xf/0x30 [kvm_intel]
      ? vmx_vmexit+0x1b/0x30 [kvm_intel]
      ? vmx_vmexit+0xf/0x30 [kvm_intel]
      ? vmx_vmexit+0x1b/0x30 [kvm_intel]
      ? vmx_vmexit+0xf/0x30 [kvm_intel]
      ? vmx_vmexit+0x1b/0x30 [kvm_intel]
      ? vmx_vmexit+0xf/0x30 [kvm_intel]
      ? vmx_vmexit+0xf/0x30 [kvm_intel]
      ? vmx_vmexit+0x1b/0x30 [kvm_intel]
      ? vmx_vmexit+0xf/0x30 [kvm_intel]
      ? vmx_vmexit+0x1b/0x30 [kvm_intel]
      ? vmx_vmexit+0xf/0x30 [kvm_intel]
      ? vmx_vmexit+0x1b/0x30 [kvm_intel]
      ? vmx_vmexit+0xf/0x30 [kvm_intel]
      ? vmx_vmexit+0x1b/0x30 [kvm_intel]
      ? vmx_vmexit+0xf/0x30 [kvm_intel]
      ? vmx_sync_pir_to_irr+0x9e/0x100 [kvm_intel]
      ? kvm_apic_has_interrupt+0x46/0x80 [kvm]
      kvm_arch_vcpu_ioctl_run+0x85b/0x1fa0 [kvm]
      ? _raw_spin_unlock_irqrestore+0x18/0x50
      ? _copy_to_user+0x2c/0x30
      kvm_vcpu_ioctl+0x235/0x660 [kvm]
      ? rt_spin_unlock+0x2c/0x50
      do_vfs_ioctl+0x3e4/0x650
      ? __fget+0x7a/0xa0
      ksys_ioctl+0x67/0x90
      __x64_sys_ioctl+0x1a/0x20
      do_syscall_64+0x4d/0x120
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7f4027cc54a7
      Code: 00 00 90 48 8b 05 e9 59 0c 00 64 c7 00 26 00 00
            00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00
            00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff
            73 01 c3 48 8b 0d b9 59 0c 00 f7 d8 64 89 01 48
      RSP: 002b:00007f401dae9858 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 00005558bd029690 RCX: 00007f4027cc54a7
      RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000d
      RBP: 00007f4028b72000 R08: 00005558bc829ad0 R09: 00000000ffffffff
      R10: 00005558bcf90ca0 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000000000000 R15: 00005558bce1c840
      --[ end trace 0000000000000002 ]--
      Signed-off-by: NHe Zhe <zhe.he@windriver.com>
      Message-Id: <1584687967-332859-1-git-send-email-zhe.he@windriver.com>
      Reviewed-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      edec6e01
  2. 22 2月, 2020 2 次提交
  3. 13 2月, 2020 1 次提交
  4. 05 2月, 2020 1 次提交
  5. 28 1月, 2020 3 次提交
  6. 21 1月, 2020 2 次提交
  7. 09 1月, 2020 3 次提交
  8. 15 11月, 2019 3 次提交
  9. 23 10月, 2019 1 次提交
  10. 26 9月, 2019 1 次提交
  11. 24 9月, 2019 1 次提交
  12. 12 9月, 2019 2 次提交
    • L
      KVM: x86: Fix INIT signal handling in various CPU states · 4b9852f4
      Liran Alon 提交于
      Commit cd7764fe ("KVM: x86: latch INITs while in system management mode")
      changed code to latch INIT while vCPU is in SMM and process latched INIT
      when leaving SMM. It left a subtle remark in commit message that similar
      treatment should also be done while vCPU is in VMX non-root-mode.
      
      However, INIT signals should actually be latched in various vCPU states:
      (*) For both Intel and AMD, INIT signals should be latched while vCPU
      is in SMM.
      (*) For Intel, INIT should also be latched while vCPU is in VMX
      operation and later processed when vCPU leaves VMX operation by
      executing VMXOFF.
      (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
      or in guest-mode with intercept defined on INIT signal.
      
      To fix this:
      1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
      can define the various CPU states in which INIT signals should be
      blocked and modify kvm_apic_accept_events() to use it.
      2) Modify vmx_check_nested_events() to check for pending INIT signal
      while vCPU in guest-mode. If so, emualte vmexit on
      EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
      but is currently left as a TODO comment to implement in the future
      because nSVM don't yet implement svm_check_nested_events().
      
      Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
      activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
      guest (See nested_vmx_setup_ctls_msrs()).
      If and when support for this activity state will be implemented,
      kvm_check_nested_events() would need to avoid emulating vmexit on
      INIT signal in case activity-state is wait-for-SIPI. In addition,
      kvm_apic_accept_events() would need to be modified to avoid discarding
      SIPI in case VMX activity-state is wait-for-SIPI but instead delay
      SIPI processing to vmx_check_nested_events() that would clear
      pending APIC events and emulate vmexit on SIPI.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Co-developed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4b9852f4
    • W
      KVM: LAPIC: Micro optimize IPI latency · 2b0911d1
      Wanpeng Li 提交于
      This patch optimizes the virtual IPI emulation sequence:
      
      write ICR2                     write ICR2
      write ICR                      read ICR2
      read ICR            ==>        send virtual IPI
      read ICR2                      write ICR
      send virtual IPI
      
      It can reduce kvm-unit-tests/vmexit.flat IPI testing latency(from sender
      send IPI to sender receive the ACK) from 3319 cycles to 3203 cycles on
      SKylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2b0911d1
  13. 14 8月, 2019 1 次提交
    • R
      kvm: x86: skip populating logical dest map if apic is not sw enabled · b14c876b
      Radim Krcmar 提交于
      recalculate_apic_map does not santize ldr and it's possible that
      multiple bits are set. In that case, a previous valid entry
      can potentially be overwritten by an invalid one.
      
      This condition is hit when booting a 32 bit, >8 CPU, RHEL6 guest and then
      triggering a crash to boot a kdump kernel. This is the sequence of
      events:
      1. Linux boots in bigsmp mode and enables PhysFlat, however, it still
      writes to the LDR which probably will never be used.
      2. However, when booting into kdump, the stale LDR values remain as
      they are not cleared by the guest and there isn't a apic reset.
      3. kdump boots with 1 cpu, and uses Logical Destination Mode but the
      logical map has been overwritten and points to an inactive vcpu.
      Signed-off-by: NRadim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b14c876b
  14. 05 8月, 2019 1 次提交
  15. 02 8月, 2019 1 次提交
  16. 20 7月, 2019 1 次提交
    • W
      KVM: LAPIC: Inject timer interrupt via posted interrupt · 0c5f81da
      Wanpeng Li 提交于
      Dedicated instances are currently disturbed by unnecessary jitter due
      to the emulated lapic timers firing on the same pCPUs where the
      vCPUs reside.  There is no hardware virtual timer on Intel for guest
      like ARM, so both programming timer in guest and the emulated timer fires
      incur vmexits.  This patch tries to avoid vmexit when the emulated timer
      fires, at least in dedicated instance scenario when nohz_full is enabled.
      
      In that case, the emulated timers can be offload to the nearest busy
      housekeeping cpus since APICv has been found for several years in server
      processors. The guest timer interrupt can then be injected via posted interrupts,
      which are delivered by the housekeeping cpu once the emulated timer fires.
      
      The host should tuned so that vCPUs are placed on isolated physical
      processors, and with several pCPUs surplus for busy housekeeping.
      If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
      ~3% redis performance benefit can be observed on Skylake server, and the
      number of external interrupt vmexits drops substantially.  Without patch
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
      EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )
      
      While with patch:
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
      EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0c5f81da
  17. 18 7月, 2019 1 次提交
    • W
      KVM: LAPIC: Make lapic timer unpinned · 4d151bf3
      Wanpeng Li 提交于
      Commit 61abdbe0 ("kvm: x86: make lapic hrtimer pinned") pinned the
      lapic timer to avoid to wait until the next kvm exit for the guest to
      see KVM_REQ_PENDING_TIMER set. There is another solution to give a kick
      after setting the KVM_REQ_PENDING_TIMER bit, make lapic timer unpinned
      will be used in follow up patches.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d151bf3
  18. 16 7月, 2019 1 次提交
  19. 06 7月, 2019 1 次提交
  20. 05 7月, 2019 2 次提交
  21. 03 7月, 2019 1 次提交
    • W
      KVM: LAPIC: Fix pending interrupt in IRR blocked by software disable LAPIC · bb34e690
      Wanpeng Li 提交于
      Thomas reported that:
      
       | Background:
       |
       |    In preparation of supporting IPI shorthands I changed the CPU offline
       |    code to software disable the local APIC instead of just masking it.
       |    That's done by clearing the APIC_SPIV_APIC_ENABLED bit in the APIC_SPIV
       |    register.
       |
       | Failure:
       |
       |    When the CPU comes back online the startup code triggers occasionally
       |    the warning in apic_pending_intr_clear(). That complains that the IRRs
       |    are not empty.
       |
       |    The offending vector is the local APIC timer vector who's IRR bit is set
       |    and stays set.
       |
       | It took me quite some time to reproduce the issue locally, but now I can
       | see what happens.
       |
       | It requires apicv_enabled=0, i.e. full apic emulation. With apicv_enabled=1
       | (and hardware support) it behaves correctly.
       |
       | Here is the series of events:
       |
       |     Guest CPU
       |
       |     goes down
       |
       |       native_cpu_disable()
       |
       | 			apic_soft_disable();
       |
       |     play_dead()
       |
       |     ....
       |
       |     startup()
       |
       |       if (apic_enabled())
       |         apic_pending_intr_clear()	<- Not taken
       |
       |      enable APIC
       |
       |         apic_pending_intr_clear()	<- Triggers warning because IRR is stale
       |
       | When this happens then the deadline timer or the regular APIC timer -
       | happens with both, has fired shortly before the APIC is disabled, but the
       | interrupt was not serviced because the guest CPU was in an interrupt
       | disabled region at that point.
       |
       | The state of the timer vector ISR/IRR bits:
       |
       |     	     	       	        ISR     IRR
       | before apic_soft_disable()    0	      1
       | after apic_soft_disable()     0	      1
       |
       | On startup		      		 0	      1
       |
       | Now one would assume that the IRR is cleared after the INIT reset, but this
       | happens only on CPU0.
       |
       | Why?
       |
       | Because our CPU0 hotplug is just for testing to make sure nothing breaks
       | and goes through an NMI wakeup vehicle because INIT would send it through
       | the boots-trap code which is not really working if that CPU was not
       | physically unplugged.
       |
       | Now looking at a real world APIC the situation in that case is:
       |
       |     	     	       	      	ISR     IRR
       | before apic_soft_disable()    0	      1
       | after apic_soft_disable()     0	      1
       |
       | On startup		      		 0	      0
       |
       | Why?
       |
       | Once the dying CPU reenables interrupts the pending interrupt gets
       | delivered as a spurious interupt and then the state is clear.
       |
       | While that CPU0 hotplug test case is surely an esoteric issue, the APIC
       | emulation is still wrong, Even if the play_dead() code would not enable
       | interrupts then the pending IRR bit would turn into an ISR .. interrupt
       | when the APIC is reenabled on startup.
      
      From SDM 10.4.7.2 Local APIC State After It Has Been Software Disabled
      * Pending interrupts in the IRR and ISR registers are held and require
        masking or handling by the CPU.
      
      In Thomas's testing, hardware cpu will not respect soft disable LAPIC
      when IRR has already been set or APICv posted-interrupt is in flight,
      so we can skip soft disable APIC checking when clearing IRR and set ISR,
      continue to respect soft disable APIC when attempting to set IRR.
      Reported-by: NRong Chen <rong.a.chen@intel.com>
      Reported-by: NFeng Tang <feng.tang@intel.com>
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rong Chen <rong.a.chen@intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb34e690
  22. 20 6月, 2019 1 次提交
    • S
      KVM: x86: Fix apic dangling pointer in vcpu · a251fb90
      Saar Amar 提交于
      The function kvm_create_lapic() attempts to allocate the apic structure
      and sets a pointer to it in the virtual processor structure. However, if
      get_zeroed_page() failed, the function frees the apic chunk, but forgets
      to set the pointer in the vcpu to NULL. It's not a security issue since
      there isn't a use of that pointer if kvm_create_lapic() returns error,
      but it's more accurate that way.
      Signed-off-by: NSaar Amar <saaramar@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a251fb90
  23. 19 6月, 2019 1 次提交
  24. 18 6月, 2019 2 次提交
  25. 05 6月, 2019 3 次提交
    • W
      KVM: LAPIC: Optimize timer latency further · b6c4bc65
      Wanpeng Li 提交于
      Advance lapic timer tries to hidden the hypervisor overhead between the
      host emulated timer fires and the guest awares the timer is fired. However,
      it just hidden the time between apic_timer_fn/handle_preemption_timer ->
      wait_lapic_expire, instead of the real position of vmentry which is
      mentioned in the orignial commit d0659d94 ("KVM: x86: add option to
      advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between
      the end of wait_lapic_expire and before world switch on my haswell desktop.
      
      This patch tries to narrow the last gap(wait_lapic_expire -> world switch),
      it takes the real overhead time between apic_timer_fn/handle_preemption_timer
      and before world switch into consideration when adaptively tuning timer
      advancement. The patch can reduce 40% latency (~1600+ cycles to ~1000+ cycles
      on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
      busy waits.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b6c4bc65
    • W
      KVM: LAPIC: Delay trace_kvm_wait_lapic_expire tracepoint to after vmexit · ec0671d5
      Wanpeng Li 提交于
      wait_lapic_expire() call was moved above guest_enter_irqoff() because of
      its tracepoint, which violated the RCU extended quiescent state invoked
      by guest_enter_irqoff()[1][2]. This patch simply moves the tracepoint
      below guest_exit_irqoff() in vcpu_enter_guest(). Snapshot the delta before
      VM-Enter, but trace it after VM-Exit. This can help us to move
      wait_lapic_expire() just before vmentry in the later patch.
      
      [1] Commit 8b89fe1f ("kvm: x86: move tracepoints outside extended quiescent state")
      [2] https://patchwork.kernel.org/patch/7821111/
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Track whether wait_lapic_expire was called, and do not invoke the tracepoint
       if not. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ec0671d5
    • W
      KVM: LAPIC: Extract adaptive tune timer advancement logic · 84ea3aca
      Wanpeng Li 提交于
      Extract adaptive tune timer advancement logic to a single function.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Rename new function. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      84ea3aca
  26. 01 5月, 2019 2 次提交