1. 22 10月, 2019 3 次提交
  2. 28 9月, 2019 1 次提交
    • W
      KVM: VMX: Set VMENTER_L1D_FLUSH_NOT_REQUIRED if !X86_BUG_L1TF · 19a36d32
      Waiman Long 提交于
      The l1tf_vmx_mitigation is only set to VMENTER_L1D_FLUSH_NOT_REQUIRED
      when the ARCH_CAPABILITIES MSR indicates that L1D flush is not required.
      However, if the CPU is not affected by L1TF, l1tf_vmx_mitigation will
      still be set to VMENTER_L1D_FLUSH_AUTO. This is certainly not the best
      option for a !X86_BUG_L1TF CPU.
      
      So force l1tf_vmx_mitigation to VMENTER_L1D_FLUSH_NOT_REQUIRED to make it
      more explicit in case users are checking the vmentry_l1d_flush parameter.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      [Patch rewritten accoring to Borislav Petkov's suggestion. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      19a36d32
  3. 25 9月, 2019 3 次提交
  4. 24 9月, 2019 12 次提交
  5. 12 9月, 2019 2 次提交
    • L
      KVM: x86: Fix INIT signal handling in various CPU states · 4b9852f4
      Liran Alon 提交于
      Commit cd7764fe ("KVM: x86: latch INITs while in system management mode")
      changed code to latch INIT while vCPU is in SMM and process latched INIT
      when leaving SMM. It left a subtle remark in commit message that similar
      treatment should also be done while vCPU is in VMX non-root-mode.
      
      However, INIT signals should actually be latched in various vCPU states:
      (*) For both Intel and AMD, INIT signals should be latched while vCPU
      is in SMM.
      (*) For Intel, INIT should also be latched while vCPU is in VMX
      operation and later processed when vCPU leaves VMX operation by
      executing VMXOFF.
      (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
      or in guest-mode with intercept defined on INIT signal.
      
      To fix this:
      1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
      can define the various CPU states in which INIT signals should be
      blocked and modify kvm_apic_accept_events() to use it.
      2) Modify vmx_check_nested_events() to check for pending INIT signal
      while vCPU in guest-mode. If so, emualte vmexit on
      EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
      but is currently left as a TODO comment to implement in the future
      because nSVM don't yet implement svm_check_nested_events().
      
      Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
      activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
      guest (See nested_vmx_setup_ctls_msrs()).
      If and when support for this activity state will be implemented,
      kvm_check_nested_events() would need to avoid emulating vmexit on
      INIT signal in case activity-state is wait-for-SIPI. In addition,
      kvm_apic_accept_events() would need to be modified to avoid discarding
      SIPI in case VMX activity-state is wait-for-SIPI but instead delay
      SIPI processing to vmx_check_nested_events() that would clear
      pending APIC events and emulate vmexit on SIPI.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Co-developed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4b9852f4
    • W
      KVM: VMX: Stop the preemption timer during vCPU reset · 95c06540
      Wanpeng Li 提交于
      The hrtimer which is used to emulate lapic timer is stopped during
      vcpu reset, preemption timer should do the same.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      95c06540
  6. 11 9月, 2019 5 次提交
  7. 10 9月, 2019 1 次提交
  8. 28 8月, 2019 1 次提交
  9. 22 8月, 2019 3 次提交
    • S
      KVM: Assert that struct kvm_vcpu is always as offset zero · 12b58f4e
      Sean Christopherson 提交于
      KVM implementations that wrap struct kvm_vcpu with a vendor specific
      struct, e.g. struct vcpu_vmx, must place the vcpu member at offset 0,
      otherwise the usercopy region intended to encompass struct kvm_vcpu_arch
      will instead overlap random chunks of the vendor specific struct.
      E.g. padding a large number of bytes before struct kvm_vcpu triggers
      a usercopy warn when running with CONFIG_HARDENED_USERCOPY=y.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      12b58f4e
    • S
      KVM: x86/mmu: Add explicit access mask for MMIO SPTEs · 4af77151
      Sean Christopherson 提交于
      When shadow paging is enabled, KVM tracks the allowed access type for
      MMIO SPTEs so that it can do a permission check on a MMIO GVA cache hit
      without having to walk the guest's page tables.  The tracking is done
      by retaining the WRITE and USER bits of the access when inserting the
      MMIO SPTE (read access is implicitly allowed), which allows the MMIO
      page fault handler to retrieve and cache the WRITE/USER bits from the
      SPTE.
      
      Unfortunately for EPT, the mask used to retain the WRITE/USER bits is
      hardcoded using the x86 paging versions of the bits.  This funkiness
      happens to work because KVM uses a completely different mask/value for
      MMIO SPTEs when EPT is enabled, and the EPT mask/value just happens to
      overlap exactly with the x86 WRITE/USER bits[*].
      
      Explicitly define the access mask for MMIO SPTEs to accurately reflect
      that EPT does not want to incorporate any access bits into the SPTE, and
      so that KVM isn't subtly relying on EPT's WX bits always being set in
      MMIO SPTEs, e.g. attempting to use other bits for experimentation breaks
      horribly.
      
      Note, vcpu_match_mmio_gva() explicits prevents matching GVA==0, and all
      TDP flows explicit set mmio_gva to 0, i.e. zeroing vcpu->arch.access for
      EPT has no (known) functional impact.
      
      [*] Using WX to generate EPT misconfigurations (equivalent to reserved
          bit page fault) ensures KVM can employ its MMIO page fault tricks
          even platforms without reserved address bits.
      
      Fixes: ce88decf ("KVM: MMU: mmio page fault support")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4af77151
    • V
      x86: kvm: svm: propagate errors from skip_emulated_instruction() · f8ea7c60
      Vitaly Kuznetsov 提交于
      On AMD, kvm_x86_ops->skip_emulated_instruction(vcpu) can, in theory,
      fail: in !nrips case we call kvm_emulate_instruction(EMULTYPE_SKIP).
      Currently, we only do printk(KERN_DEBUG) when this happens and this
      is not ideal. Propagate the error up the stack.
      
      On VMX, skip_emulated_instruction() doesn't fail, we have two call
      sites calling it explicitly: handle_exception_nmi() and
      handle_task_switch(), we can just ignore the result.
      
      On SVM, we also have two explicit call sites:
      svm_queue_exception() and it seems we don't need to do anything there as
      we check if RIP was advanced or not. In task_switch_interception(),
      however, we are better off not proceeding to kvm_task_switch() in case
      skip_emulated_instruction() failed.
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f8ea7c60
  10. 05 8月, 2019 1 次提交
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li 提交于
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
  11. 22 7月, 2019 1 次提交
    • W
      KVM: X86: Dynamically allocate user_fpu · d9a710e5
      Wanpeng Li 提交于
      After reverting commit 240c35a3 (kvm: x86: Use task structs fpu field
      for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
      is the order at which allocations are deemed costly to service. In serveless
      scenario, one host can service hundreds/thoudands firecracker/kata-container
      instances, howerver, new instance will fail to launch after memory is too
      fragmented to allocate kvm_vcpu struct on host, this was observed in some
      cloud provider product environments.
      
      This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
      Skylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d9a710e5
  12. 20 7月, 2019 2 次提交
    • P
      KVM: VMX: dump VMCS on failed entry · 3b20e03a
      Paolo Bonzini 提交于
      This is useful for debugging, and is ratelimited nowadays.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3b20e03a
    • W
      KVM: LAPIC: Inject timer interrupt via posted interrupt · 0c5f81da
      Wanpeng Li 提交于
      Dedicated instances are currently disturbed by unnecessary jitter due
      to the emulated lapic timers firing on the same pCPUs where the
      vCPUs reside.  There is no hardware virtual timer on Intel for guest
      like ARM, so both programming timer in guest and the emulated timer fires
      incur vmexits.  This patch tries to avoid vmexit when the emulated timer
      fires, at least in dedicated instance scenario when nohz_full is enabled.
      
      In that case, the emulated timers can be offload to the nearest busy
      housekeeping cpus since APICv has been found for several years in server
      processors. The guest timer interrupt can then be injected via posted interrupts,
      which are delivered by the housekeeping cpu once the emulated timer fires.
      
      The host should tuned so that vCPUs are placed on isolated physical
      processors, and with several pCPUs surplus for busy housekeeping.
      If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
      ~3% redis performance benefit can be observed on Skylake server, and the
      number of external interrupt vmexits drops substantially.  Without patch
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
      EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )
      
      While with patch:
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
      EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0c5f81da
  13. 15 7月, 2019 1 次提交
  14. 02 7月, 2019 1 次提交
    • P
      KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LIST · 95c5c7c7
      Paolo Bonzini 提交于
      This allows userspace to know which MSRs are supported by the hypervisor.
      Unfortunately userspace must resort to tricks for everything except
      MSR_IA32_VMX_VMFUNC (which was just added in the previous patch).
      One possibility is to use the feature control MSR, which is tied to nested
      VMX as well and is present on all KVM versions that support feature MSRs.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      95c5c7c7
  15. 20 6月, 2019 1 次提交
  16. 19 6月, 2019 1 次提交
  17. 18 6月, 2019 1 次提交
    • S
      KVM: VMX: Leave preemption timer running when it's disabled · 804939ea
      Sean Christopherson 提交于
      VMWRITEs to the major VMCS controls, pin controls included, are
      deceptively expensive.  CPUs with VMCS caching (Westmere and later) also
      optimize away consistency checks on VM-Entry, i.e. skip consistency
      checks if the relevant fields have not changed since the last successful
      VM-Entry (of the cached VMCS).  Because uops are a precious commodity,
      uCode's dirty VMCS field tracking isn't as precise as software would
      prefer.  Notably, writing any of the major VMCS fields effectively marks
      the entire VMCS dirty, i.e. causes the next VM-Entry to perform all
      consistency checks, which consumes several hundred cycles.
      
      As it pertains to KVM, toggling PIN_BASED_VMX_PREEMPTION_TIMER more than
      doubles the latency of the next VM-Entry (and again when/if the flag is
      toggled back).  In a non-nested scenario, running a "standard" guest
      with the preemption timer enabled, toggling the timer flag is uncommon
      but not rare, e.g. roughly 1 in 10 entries.  Disabling the preemption
      timer can change these numbers due to its use for "immediate exits",
      even when explicitly disabled by userspace.
      
      Nested virtualization in particular is painful, as the timer flag is set
      for the majority of VM-Enters, but prepare_vmcs02() initializes vmcs02's
      pin controls to *clear* the flag since its the timer's final state isn't
      known until vmx_vcpu_run().  I.e. the majority of nested VM-Enters end
      up unnecessarily writing pin controls *twice*.
      
      Rather than toggle the timer flag in pin controls, set the timer value
      itself to the largest allowed value to put it into a "soft disabled"
      state, and ignore any spurious preemption timer exits.
      
      Sadly, the timer is a 32-bit value and so theoretically it can fire
      before the head death of the universe, i.e. spurious exits are possible.
      But because KVM does *not* save the timer value on VM-Exit and because
      the timer runs at a slower rate than the TSC, the maximuma timer value
      is still sufficiently large for KVM's purposes.  E.g. on a modern CPU
      with a timer that runs at 1/32 the frequency of a 2.4ghz constant-rate
      TSC, the timer will fire after ~55 seconds of *uninterrupted* guest
      execution.  In other words, spurious VM-Exits are effectively only
      possible if the host is completely tickless on the logical CPU, the
      guest is not using the preemption timer, and the guest is not generating
      VM-Exits for any other reason.
      
      To be safe from bad/weird hardware, disable the preemption timer if its
      maximum delay is less than ten seconds.  Ten seconds is mostly arbitrary
      and was selected in no small part because it's a nice round number.
      For simplicity and paranoia, fall back to __kvm_request_immediate_exit()
      if the preemption timer is disabled by KVM or userspace.  Previously
      KVM continued to use the preemption timer to force immediate exits even
      when the timer was disabled by userspace.  Now that KVM leaves the timer
      running instead of truly disabling it, allow userspace to kill it
      entirely in the unlikely event the timer (or KVM) malfunctions.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      804939ea