1. 12 11月, 2019 3 次提交
    • J
      KVM: VMX: Do not change PID.NDST when loading a blocked vCPU · 132194ff
      Joao Martins 提交于
      When vCPU enters block phase, pi_pre_block() inserts vCPU to a per pCPU
      linked list of all vCPUs that are blocked on this pCPU. Afterwards, it
      changes PID.NV to POSTED_INTR_WAKEUP_VECTOR which its handler
      (wakeup_handler()) is responsible to kick (unblock) any vCPU on that
      linked list that now has pending posted interrupts.
      
      While vCPU is blocked (in kvm_vcpu_block()), it may be preempted which
      will cause vmx_vcpu_pi_put() to set PID.SN.  If later the vCPU will be
      scheduled to run on a different pCPU, vmx_vcpu_pi_load() will clear
      PID.SN but will also *overwrite PID.NDST to this different pCPU*.
      Instead of keeping it with original pCPU which vCPU had entered block
      phase on.
      
      This results in an issue because when a posted interrupt is delivered, as
      the wakeup_handler() will be executed and fail to find blocked vCPU on
      its per pCPU linked list of all vCPUs that are blocked on this pCPU.
      Which is due to the vCPU being placed on a *different* per pCPU
      linked list i.e. the original pCPU in which it entered block phase.
      
      The regression is introduced by commit c112b5f5 ("KVM: x86:
      Recompute PID.ON when clearing PID.SN"). Therefore, partially revert
      it and reintroduce the condition in vmx_vcpu_pi_load() responsible for
      avoiding changing PID.NDST when loading a blocked vCPU.
      
      Fixes: c112b5f5 ("KVM: x86: Recompute PID.ON when clearing PID.SN")
      Tested-by: NNathan Ni <nathan.ni@oracle.com>
      Co-developed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      132194ff
    • J
      KVM: VMX: Consider PID.PIR to determine if vCPU has pending interrupts · 9482ae45
      Joao Martins 提交于
      Commit 17e433b5 ("KVM: Fix leak vCPU's VMCS value into other pCPU")
      introduced vmx_dy_apicv_has_pending_interrupt() in order to determine
      if a vCPU have a pending posted interrupt. This routine is used by
      kvm_vcpu_on_spin() when searching for a a new runnable vCPU to schedule
      on pCPU instead of a vCPU doing busy loop.
      
      vmx_dy_apicv_has_pending_interrupt() determines if a
      vCPU has a pending posted interrupt solely based on PID.ON. However,
      when a vCPU is preempted, vmx_vcpu_pi_put() sets PID.SN which cause
      raised posted interrupts to only set bit in PID.PIR without setting
      PID.ON (and without sending notification vector), as depicted in VT-d
      manual section 5.2.3 "Interrupt-Posting Hardware Operation".
      
      Therefore, checking PID.ON is insufficient to determine if a vCPU has
      pending posted interrupts and instead we should also check if there is
      some bit set on PID.PIR if PID.SN=1.
      
      Fixes: 17e433b5 ("KVM: Fix leak vCPU's VMCS value into other pCPU")
      Reviewed-by: NJagannathan Raman <jag.raman@oracle.com>
      Co-developed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9482ae45
    • L
      KVM: VMX: Fix comment to specify PID.ON instead of PIR.ON · d9ff2744
      Liran Alon 提交于
      The Outstanding Notification (ON) bit is part of the Posted Interrupt
      Descriptor (PID) as opposed to the Posted Interrupts Register (PIR).
      The latter is a bitmap for pending vectors.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d9ff2744
  2. 31 10月, 2019 1 次提交
    • P
      KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active · 9167ab79
      Paolo Bonzini 提交于
      VMX already does so if the host has SMEP, in order to support the combination of
      CR0.WP=1 and CR4.SMEP=1.  However, it is perfectly safe to always do so, and in
      fact VMX already ends up running with EFER.NXE=1 on old processors that lack the
      "load EFER" controls, because it may help avoiding a slow MSR write.  Removing
      all the conditionals simplifies the code.
      
      SVM does not have similar code, but it should since recent AMD processors do
      support SMEP.  So this patch also makes the code for the two vendors more similar
      while fixing NPT=0, CR0.WP=1 and CR4.SMEP=1 on AMD processors.
      
      Cc: stable@vger.kernel.org
      Cc: Joerg Roedel <jroedel@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9167ab79
  3. 22 10月, 2019 1 次提交
    • L
      KVM: VMX: Remove specialized handling of unexpected exit-reasons · 1a8211c7
      Liran Alon 提交于
      Commit bf653b78 ("KVM: vmx: Introduce handle_unexpected_vmexit
      and handle WAITPKG vmexit") introduced specialized handling of
      specific exit-reasons that should not be raised by CPU because
      KVM configures VMCS such that they should never be raised.
      
      However, since commit 7396d337 ("KVM: x86: Return to userspace
      with internal error on unexpected exit reason"), VMX & SVM
      exit handlers were modified to generically handle all unexpected
      exit-reasons by returning to userspace with internal error.
      
      Therefore, there is no need for specialized handling of specific
      unexpected exit-reasons (This specialized handling also introduced
      inconsistency for these exit-reasons to silently skip guest instruction
      instead of return to userspace on internal-error).
      
      Fixes: bf653b78 ("KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit")
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1a8211c7
  4. 28 9月, 2019 1 次提交
    • W
      KVM: VMX: Set VMENTER_L1D_FLUSH_NOT_REQUIRED if !X86_BUG_L1TF · 19a36d32
      Waiman Long 提交于
      The l1tf_vmx_mitigation is only set to VMENTER_L1D_FLUSH_NOT_REQUIRED
      when the ARCH_CAPABILITIES MSR indicates that L1D flush is not required.
      However, if the CPU is not affected by L1TF, l1tf_vmx_mitigation will
      still be set to VMENTER_L1D_FLUSH_AUTO. This is certainly not the best
      option for a !X86_BUG_L1TF CPU.
      
      So force l1tf_vmx_mitigation to VMENTER_L1D_FLUSH_NOT_REQUIRED to make it
      more explicit in case users are checking the vmentry_l1d_flush parameter.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      [Patch rewritten accoring to Borislav Petkov's suggestion. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      19a36d32
  5. 25 9月, 2019 3 次提交
  6. 24 9月, 2019 12 次提交
  7. 12 9月, 2019 2 次提交
    • L
      KVM: x86: Fix INIT signal handling in various CPU states · 4b9852f4
      Liran Alon 提交于
      Commit cd7764fe ("KVM: x86: latch INITs while in system management mode")
      changed code to latch INIT while vCPU is in SMM and process latched INIT
      when leaving SMM. It left a subtle remark in commit message that similar
      treatment should also be done while vCPU is in VMX non-root-mode.
      
      However, INIT signals should actually be latched in various vCPU states:
      (*) For both Intel and AMD, INIT signals should be latched while vCPU
      is in SMM.
      (*) For Intel, INIT should also be latched while vCPU is in VMX
      operation and later processed when vCPU leaves VMX operation by
      executing VMXOFF.
      (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
      or in guest-mode with intercept defined on INIT signal.
      
      To fix this:
      1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
      can define the various CPU states in which INIT signals should be
      blocked and modify kvm_apic_accept_events() to use it.
      2) Modify vmx_check_nested_events() to check for pending INIT signal
      while vCPU in guest-mode. If so, emualte vmexit on
      EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
      but is currently left as a TODO comment to implement in the future
      because nSVM don't yet implement svm_check_nested_events().
      
      Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
      activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
      guest (See nested_vmx_setup_ctls_msrs()).
      If and when support for this activity state will be implemented,
      kvm_check_nested_events() would need to avoid emulating vmexit on
      INIT signal in case activity-state is wait-for-SIPI. In addition,
      kvm_apic_accept_events() would need to be modified to avoid discarding
      SIPI in case VMX activity-state is wait-for-SIPI but instead delay
      SIPI processing to vmx_check_nested_events() that would clear
      pending APIC events and emulate vmexit on SIPI.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Co-developed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4b9852f4
    • W
      KVM: VMX: Stop the preemption timer during vCPU reset · 95c06540
      Wanpeng Li 提交于
      The hrtimer which is used to emulate lapic timer is stopped during
      vcpu reset, preemption timer should do the same.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      95c06540
  8. 11 9月, 2019 5 次提交
  9. 10 9月, 2019 1 次提交
  10. 28 8月, 2019 1 次提交
  11. 22 8月, 2019 3 次提交
    • S
      KVM: Assert that struct kvm_vcpu is always as offset zero · 12b58f4e
      Sean Christopherson 提交于
      KVM implementations that wrap struct kvm_vcpu with a vendor specific
      struct, e.g. struct vcpu_vmx, must place the vcpu member at offset 0,
      otherwise the usercopy region intended to encompass struct kvm_vcpu_arch
      will instead overlap random chunks of the vendor specific struct.
      E.g. padding a large number of bytes before struct kvm_vcpu triggers
      a usercopy warn when running with CONFIG_HARDENED_USERCOPY=y.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      12b58f4e
    • S
      KVM: x86/mmu: Add explicit access mask for MMIO SPTEs · 4af77151
      Sean Christopherson 提交于
      When shadow paging is enabled, KVM tracks the allowed access type for
      MMIO SPTEs so that it can do a permission check on a MMIO GVA cache hit
      without having to walk the guest's page tables.  The tracking is done
      by retaining the WRITE and USER bits of the access when inserting the
      MMIO SPTE (read access is implicitly allowed), which allows the MMIO
      page fault handler to retrieve and cache the WRITE/USER bits from the
      SPTE.
      
      Unfortunately for EPT, the mask used to retain the WRITE/USER bits is
      hardcoded using the x86 paging versions of the bits.  This funkiness
      happens to work because KVM uses a completely different mask/value for
      MMIO SPTEs when EPT is enabled, and the EPT mask/value just happens to
      overlap exactly with the x86 WRITE/USER bits[*].
      
      Explicitly define the access mask for MMIO SPTEs to accurately reflect
      that EPT does not want to incorporate any access bits into the SPTE, and
      so that KVM isn't subtly relying on EPT's WX bits always being set in
      MMIO SPTEs, e.g. attempting to use other bits for experimentation breaks
      horribly.
      
      Note, vcpu_match_mmio_gva() explicits prevents matching GVA==0, and all
      TDP flows explicit set mmio_gva to 0, i.e. zeroing vcpu->arch.access for
      EPT has no (known) functional impact.
      
      [*] Using WX to generate EPT misconfigurations (equivalent to reserved
          bit page fault) ensures KVM can employ its MMIO page fault tricks
          even platforms without reserved address bits.
      
      Fixes: ce88decf ("KVM: MMU: mmio page fault support")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4af77151
    • V
      x86: kvm: svm: propagate errors from skip_emulated_instruction() · f8ea7c60
      Vitaly Kuznetsov 提交于
      On AMD, kvm_x86_ops->skip_emulated_instruction(vcpu) can, in theory,
      fail: in !nrips case we call kvm_emulate_instruction(EMULTYPE_SKIP).
      Currently, we only do printk(KERN_DEBUG) when this happens and this
      is not ideal. Propagate the error up the stack.
      
      On VMX, skip_emulated_instruction() doesn't fail, we have two call
      sites calling it explicitly: handle_exception_nmi() and
      handle_task_switch(), we can just ignore the result.
      
      On SVM, we also have two explicit call sites:
      svm_queue_exception() and it seems we don't need to do anything there as
      we check if RIP was advanced or not. In task_switch_interception(),
      however, we are better off not proceeding to kvm_task_switch() in case
      skip_emulated_instruction() failed.
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f8ea7c60
  12. 05 8月, 2019 1 次提交
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li 提交于
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
  13. 22 7月, 2019 1 次提交
    • W
      KVM: X86: Dynamically allocate user_fpu · d9a710e5
      Wanpeng Li 提交于
      After reverting commit 240c35a3 (kvm: x86: Use task structs fpu field
      for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
      is the order at which allocations are deemed costly to service. In serveless
      scenario, one host can service hundreds/thoudands firecracker/kata-container
      instances, howerver, new instance will fail to launch after memory is too
      fragmented to allocate kvm_vcpu struct on host, this was observed in some
      cloud provider product environments.
      
      This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
      Skylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d9a710e5
  14. 20 7月, 2019 2 次提交
    • P
      KVM: VMX: dump VMCS on failed entry · 3b20e03a
      Paolo Bonzini 提交于
      This is useful for debugging, and is ratelimited nowadays.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3b20e03a
    • W
      KVM: LAPIC: Inject timer interrupt via posted interrupt · 0c5f81da
      Wanpeng Li 提交于
      Dedicated instances are currently disturbed by unnecessary jitter due
      to the emulated lapic timers firing on the same pCPUs where the
      vCPUs reside.  There is no hardware virtual timer on Intel for guest
      like ARM, so both programming timer in guest and the emulated timer fires
      incur vmexits.  This patch tries to avoid vmexit when the emulated timer
      fires, at least in dedicated instance scenario when nohz_full is enabled.
      
      In that case, the emulated timers can be offload to the nearest busy
      housekeeping cpus since APICv has been found for several years in server
      processors. The guest timer interrupt can then be injected via posted interrupts,
      which are delivered by the housekeeping cpu once the emulated timer fires.
      
      The host should tuned so that vCPUs are placed on isolated physical
      processors, and with several pCPUs surplus for busy housekeeping.
      If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
      ~3% redis performance benefit can be observed on Skylake server, and the
      number of external interrupt vmexits drops substantially.  Without patch
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
      EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )
      
      While with patch:
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
      EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0c5f81da
  15. 15 7月, 2019 1 次提交
  16. 02 7月, 2019 1 次提交
    • P
      KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LIST · 95c5c7c7
      Paolo Bonzini 提交于
      This allows userspace to know which MSRs are supported by the hypervisor.
      Unfortunately userspace must resort to tricks for everything except
      MSR_IA32_VMX_VMFUNC (which was just added in the previous patch).
      One possibility is to use the feature control MSR, which is tied to nested
      VMX as well and is present on all KVM versions that support feature MSRs.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      95c5c7c7
  17. 20 6月, 2019 1 次提交