1. 15 11月, 2019 2 次提交
    • L
      KVM: x86/vPMU: Add lazy mechanism to release perf_event per vPMC · b35e5548
      Like Xu 提交于
      Currently, a host perf_event is created for a vPMC functionality emulation.
      It’s unpredictable to determine if a disabled perf_event will be reused.
      If they are disabled and are not reused for a considerable period of time,
      those obsolete perf_events would increase host context switch overhead that
      could have been avoided.
      
      If the guest doesn't WRMSR any of the vPMC's MSRs during an entire vcpu
      sched time slice, and its independent enable bit of the vPMC isn't set,
      we can predict that the guest has finished the use of this vPMC, and then
      do request KVM_REQ_PMU in kvm_arch_sched_in and release those perf_events
      in the first call of kvm_pmu_handle_event() after the vcpu is scheduled in.
      
      This lazy mechanism delays the event release time to the beginning of the
      next scheduled time slice if vPMC's MSRs aren't changed during this time
      slice. If guest comes back to use this vPMC in next time slice, a new perf
      event would be re-created via perf_event_create_kernel_counter() as usual.
      Suggested-by: NWei Wang <wei.w.wang@intel.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b35e5548
    • L
      KVM: x86/vPMU: Reuse perf_event to avoid unnecessary pmc_reprogram_counter · a6da0d77
      Like Xu 提交于
      The perf_event_create_kernel_counter() in the pmc_reprogram_counter() is
      a heavyweight and high-frequency operation, especially when host disables
      the watchdog (maximum 21000000 ns) which leads to an unacceptable latency
      of the guest NMI handler. It limits the use of vPMUs in the guest.
      
      When a vPMC is fully enabled, the legacy reprogram_*_counter() would stop
      and release its existing perf_event (if any) every time EVEN in most cases
      almost the same requested perf_event will be created and configured again.
      
      For each vPMC, if the reuqested config ('u64 eventsel' for gp and 'u8 ctrl'
      for fixed) is the same as its current config AND a new sample period based
      on pmc->counter is accepted by host perf interface, the current event could
      be reused safely as a new created one does. Otherwise, do release the
      undesirable perf_event and reprogram a new one as usual.
      
      It's light-weight to call pmc_pause_counter (disable, read and reset event)
      and pmc_resume_counter (recalibrate period and re-enable event) as guest
      expects instead of release-and-create again on any condition. Compared to
      use the filterable event->attr or hw.config, a new 'u64 current_config'
      field is added to save the last original programed config for each vPMC.
      
      Based on this implementation, the number of calls to pmc_reprogram_counter
      is reduced by ~82.5% for a gp sampling event and ~99.9% for a fixed event.
      In the usage of multiplexing perf sampling mode, the average latency of the
      guest NMI handler is reduced from 104923 ns to 48393 ns (~2.16x speed up).
      If host disables watchdog, the minimum latecy of guest NMI handler could be
      speed up at ~3413x (from 20407603 to 5979 ns) and at ~786x in the average.
      Suggested-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a6da0d77
  2. 22 10月, 2019 5 次提交
  3. 27 9月, 2019 1 次提交
    • P
      KVM: x86: assign two bits to track SPTE kinds · 6eeb4ef0
      Paolo Bonzini 提交于
      Currently, we are overloading SPTE_SPECIAL_MASK to mean both
      "A/D bits unavailable" and MMIO, where the difference between the
      two is determined by mio_mask and mmio_value.
      
      However, the next patch will need two bits to distinguish
      availability of A/D bits from write protection.  So, while at
      it give MMIO its own bit pattern, and move the two bits from
      bit 62 to bits 52..53 since Intel is allocating EPT page table
      bits from the top.
      Reviewed-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6eeb4ef0
  4. 25 9月, 2019 3 次提交
  5. 24 9月, 2019 8 次提交
  6. 14 9月, 2019 1 次提交
    • S
      KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslot · 002c5f73
      Sean Christopherson 提交于
      James Harvey reported a livelock that was introduced by commit
      d012a06a ("Revert "KVM: x86/mmu: Zap only the relevant pages when
      removing a memslot"").
      
      The livelock occurs because kvm_mmu_zap_all() as it exists today will
      voluntarily reschedule and drop KVM's mmu_lock, which allows other vCPUs
      to add shadow pages.  With enough vCPUs, kvm_mmu_zap_all() can get stuck
      in an infinite loop as it can never zap all pages before observing lock
      contention or the need to reschedule.  The equivalent of kvm_mmu_zap_all()
      that was in use at the time of the reverted commit (4e103134, "KVM:
      x86/mmu: Zap only the relevant pages when removing a memslot") employed
      a fast invalidate mechanism and was not susceptible to the above livelock.
      
      There are three ways to fix the livelock:
      
      - Reverting the revert (commit d012a06a) is not a viable option as
        the revert is needed to fix a regression that occurs when the guest has
        one or more assigned devices.  It's unlikely we'll root cause the device
        assignment regression soon enough to fix the regression timely.
      
      - Remove the conditional reschedule from kvm_mmu_zap_all().  However, although
        removing the reschedule would be a smaller code change, it's less safe
        in the sense that the resulting kvm_mmu_zap_all() hasn't been used in
        the wild for flushing memslots since the fast invalidate mechanism was
        introduced by commit 6ca18b69 ("KVM: x86: use the fast way to
        invalidate all pages"), back in 2013.
      
      - Reintroduce the fast invalidate mechanism and use it when zapping shadow
        pages in response to a memslot being deleted/moved, which is what this
        patch does.
      
      For all intents and purposes, this is a revert of commit ea145aac
      ("Revert "KVM: MMU: fast invalidate all pages"") and a partial revert of
      commit 7390de1e ("Revert "KVM: x86: use the fast way to invalidate
      all pages""), i.e. restores the behavior of commit 5304b8d3 ("KVM:
      MMU: fast invalidate all pages") and commit 6ca18b69 ("KVM: x86:
      use the fast way to invalidate all pages") respectively.
      
      Fixes: d012a06a ("Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"")
      Reported-by: NJames Harvey <jamespharvey20@gmail.com>
      Cc: Alex Willamson <alex.williamson@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      002c5f73
  7. 12 9月, 2019 1 次提交
    • L
      KVM: x86: Fix INIT signal handling in various CPU states · 4b9852f4
      Liran Alon 提交于
      Commit cd7764fe ("KVM: x86: latch INITs while in system management mode")
      changed code to latch INIT while vCPU is in SMM and process latched INIT
      when leaving SMM. It left a subtle remark in commit message that similar
      treatment should also be done while vCPU is in VMX non-root-mode.
      
      However, INIT signals should actually be latched in various vCPU states:
      (*) For both Intel and AMD, INIT signals should be latched while vCPU
      is in SMM.
      (*) For Intel, INIT should also be latched while vCPU is in VMX
      operation and later processed when vCPU leaves VMX operation by
      executing VMXOFF.
      (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
      or in guest-mode with intercept defined on INIT signal.
      
      To fix this:
      1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
      can define the various CPU states in which INIT signals should be
      blocked and modify kvm_apic_accept_events() to use it.
      2) Modify vmx_check_nested_events() to check for pending INIT signal
      while vCPU in guest-mode. If so, emualte vmexit on
      EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
      but is currently left as a TODO comment to implement in the future
      because nSVM don't yet implement svm_check_nested_events().
      
      Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
      activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
      guest (See nested_vmx_setup_ctls_msrs()).
      If and when support for this activity state will be implemented,
      kvm_check_nested_events() would need to avoid emulating vmexit on
      INIT signal in case activity-state is wait-for-SIPI. In addition,
      kvm_apic_accept_events() would need to be modified to avoid discarding
      SIPI in case VMX activity-state is wait-for-SIPI but instead delay
      SIPI processing to vmx_check_nested_events() that would clear
      pending APIC events and emulate vmexit on SIPI.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Co-developed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4b9852f4
  8. 11 9月, 2019 2 次提交
  9. 10 9月, 2019 1 次提交
  10. 22 8月, 2019 2 次提交
  11. 05 8月, 2019 2 次提交
    • P
      KVM: remove kvm_arch_has_vcpu_debugfs() · 741cbbae
      Paolo Bonzini 提交于
      There is no need for this function as all arches have to implement
      kvm_arch_create_vcpu_debugfs() no matter what.  A #define symbol
      let us actually simplify the code.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      741cbbae
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li 提交于
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
  12. 22 7月, 2019 2 次提交
    • W
      KVM: X86: Dynamically allocate user_fpu · d9a710e5
      Wanpeng Li 提交于
      After reverting commit 240c35a3 (kvm: x86: Use task structs fpu field
      for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
      is the order at which allocations are deemed costly to service. In serveless
      scenario, one host can service hundreds/thoudands firecracker/kata-container
      instances, howerver, new instance will fail to launch after memory is too
      fragmented to allocate kvm_vcpu struct on host, this was observed in some
      cloud provider product environments.
      
      This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
      Skylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d9a710e5
    • P
      Revert "kvm: x86: Use task structs fpu field for user" · ec269475
      Paolo Bonzini 提交于
      This reverts commit 240c35a3
      ("kvm: x86: Use task structs fpu field for user", 2018-11-06).
      The commit is broken and causes QEMU's FPU state to be destroyed
      when KVM_RUN is preempted.
      
      Fixes: 240c35a3 ("kvm: x86: Use task structs fpu field for user")
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ec269475
  13. 19 7月, 2019 1 次提交
    • J
      x86/kvm: Don't call kvm_spurious_fault() from .fixup · 3901336e
      Josh Poimboeuf 提交于
      After making a change to improve objtool's sibling call detection, it
      started showing the following warning:
      
        arch/x86/kvm/vmx/nested.o: warning: objtool: .fixup+0x15: sibling call from callable instruction with modified stack frame
      
      The problem is the ____kvm_handle_fault_on_reboot() macro.  It does a
      fake call by pushing a fake RIP and doing a jump.  That tricks the
      unwinder into printing the function which triggered the exception,
      rather than the .fixup code.
      
      Instead of the hack to make it look like the original function made the
      call, just change the macro so that the original function actually does
      make the call.  This allows removal of the hack, and also makes objtool
      happy.
      
      I triggered a vmx instruction exception and verified that the stack
      trace is still sane:
      
        kernel BUG at arch/x86/kvm/x86.c:358!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 28 PID: 4096 Comm: qemu-kvm Not tainted 5.2.0+ #16
        Hardware name: Lenovo THINKSYSTEM SD530 -[7X2106Z000]-/-[7X2106Z000]-, BIOS -[TEE113Z-1.00]- 07/17/2017
        RIP: 0010:kvm_spurious_fault+0x5/0x10
        Code: 00 00 00 00 00 8b 44 24 10 89 d2 45 89 c9 48 89 44 24 10 8b 44 24 08 48 89 44 24 08 e9 d4 40 22 00 0f 1f 40 00 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
        RSP: 0018:ffffbf91c683bd00 EFLAGS: 00010246
        RAX: 000061f040000000 RBX: ffff9e159c77bba0 RCX: ffff9e15a5c87000
        RDX: 0000000665c87000 RSI: ffff9e15a5c87000 RDI: ffff9e159c77bba0
        RBP: 0000000000000000 R08: 0000000000000000 R09: ffff9e15a5c87000
        R10: 0000000000000000 R11: fffff8f2d99721c0 R12: ffff9e159c77bba0
        R13: ffffbf91c671d960 R14: ffff9e159c778000 R15: 0000000000000000
        FS:  00007fa341cbe700(0000) GS:ffff9e15b7400000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fdd38356804 CR3: 00000006759de003 CR4: 00000000007606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        PKRU: 55555554
        Call Trace:
         loaded_vmcs_init+0x4f/0xe0
         alloc_loaded_vmcs+0x38/0xd0
         vmx_create_vcpu+0xf7/0x600
         kvm_vm_ioctl+0x5e9/0x980
         ? __switch_to_asm+0x40/0x70
         ? __switch_to_asm+0x34/0x70
         ? __switch_to_asm+0x40/0x70
         ? __switch_to_asm+0x34/0x70
         ? free_one_page+0x13f/0x4e0
         do_vfs_ioctl+0xa4/0x630
         ksys_ioctl+0x60/0x90
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x55/0x1c0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7fa349b1ee5b
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/64a9b64d127e87b6920a97afde8e96ea76f6524e.1563413318.git.jpoimboe@redhat.com
      3901336e
  14. 11 7月, 2019 1 次提交
    • E
      KVM: x86: PMU Event Filter · 66bb8a06
      Eric Hankland 提交于
      Some events can provide a guest with information about other guests or the
      host (e.g. L3 cache stats); providing the capability to restrict access
      to a "safe" set of events would limit the potential for the PMU to be used
      in any side channel attacks. This change introduces a new VM ioctl that
      sets an event filter. If the guest attempts to program a counter for
      any blacklisted or non-whitelisted event, the kernel counter won't be
      created, so any RDPMC/RDMSR will show 0 instances of that event.
      Signed-off-by: NEric Hankland <ehankland@google.com>
      [Lots of changes. All remaining bugs are probably mine. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      66bb8a06
  15. 19 6月, 2019 1 次提交
  16. 18 6月, 2019 3 次提交
  17. 05 6月, 2019 3 次提交
  18. 01 5月, 2019 1 次提交
    • B
      x86/kvm: Implement HWCR support · 191c8137
      Borislav Petkov 提交于
      The hardware configuration register has some useful bits which can be
      used by guests. Implement McStatusWrEn which can be used by guests when
      injecting MCEs with the in-kernel mce-inject module.
      
      For that, we need to set bit 18 - McStatusWrEn - first, before writing
      the MCi_STATUS registers (otherwise we #GP).
      
      Add the required machinery to do so.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: KVM <kvm@vger.kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      191c8137