1. 26 4月, 2021 2 次提交
    • W
      KVM: X86: Fix failure to boost kernel lock holder candidate in SEV-ES guests · b86bb11e
      Wanpeng Li 提交于
      Commit f1c6366e ("KVM: SVM: Add required changes to support intercepts under
      SEV-ES") prevents hypervisor accesses guest register state when the guest is
      running under SEV-ES. The initial value of vcpu->arch.guest_state_protected
      is false, it will not be updated in preemption notifiers after this commit which
      means that the kernel spinlock lock holder will always be skipped to boost. Let's
      fix it by always treating preempted is in the guest kernel mode, false positive
      is better than skip completely.
      
      Fixes: f1c6366e (KVM: SVM: Add required changes to support intercepts under SEV-ES)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1619080459-30032-1-git-send-email-wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b86bb11e
    • V
      KVM: x86: Properly handle APF vs disabled LAPIC situation · 2f15d027
      Vitaly Kuznetsov 提交于
      Async PF 'page ready' event may happen when LAPIC is (temporary) disabled.
      In particular, Sebastien reports that when Linux kernel is directly booted
      by Cloud Hypervisor, LAPIC is 'software disabled' when APF mechanism is
      initialized. On initialization KVM tries to inject 'wakeup all' event and
      puts the corresponding token to the slot. It is, however, failing to inject
      an interrupt (kvm_apic_set_irq() -> __apic_accept_irq() -> !apic_enabled())
      so the guest never gets notified and the whole APF mechanism gets stuck.
      The same issue is likely to happen if the guest temporary disables LAPIC
      and a previously unavailable page becomes available.
      
      Do two things to resolve the issue:
      - Avoid dequeuing 'page ready' events from APF queue when LAPIC is
        disabled.
      - Trigger an attempt to deliver pending 'page ready' events when LAPIC
        becomes enabled (SPIV or MSR_IA32_APICBASE).
      Reported-by: NSebastien Boeuf <sebastien.boeuf@intel.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210422092948.568327-1-vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2f15d027
  2. 22 4月, 2021 2 次提交
    • W
      KVM: Boost vCPU candidate in user mode which is delivering interrupt · 52acd22f
      Wanpeng Li 提交于
      Both lock holder vCPU and IPI receiver that has halted are condidate for
      boost. However, the PLE handler was originally designed to deal with the
      lock holder preemption problem. The Intel PLE occurs when the spinlock
      waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
      they can be in either kernel or user mode. the vCPU candidate in user mode
      will not be boosted even if they should respond to IPIs. Some benchmarks
      like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
      of the time they are running in user mode. It can lead to a large number
      of continuous PLE events because the IPI sender causes PLE events
      repeatedly until the receiver is scheduled while the receiver is not
      candidate for a boost.
      
      This patch boosts the vCPU candidiate in user mode which is delivery
      interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
      VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
      96 HTs Intel CLX box). There is no performance regression for other
      benchmarks like Unixbench spawn (most of the time contend read/write
      lock in kernel mode), ebizzy (most of the time contend read/write sem
      and TLB shoodtdown in kernel mode).
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1618542490-14756-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      52acd22f
    • N
      KVM: x86: Support KVM VMs sharing SEV context · 54526d1f
      Nathan Tempelman 提交于
      Add a capability for userspace to mirror SEV encryption context from
      one vm to another. On our side, this is intended to support a
      Migration Helper vCPU, but it can also be used generically to support
      other in-guest workloads scheduled by the host. The intention is for
      the primary guest and the mirror to have nearly identical memslots.
      
      The primary benefits of this are that:
      1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
      can't accidentally clobber each other.
      2) The VMs can have different memory-views, which is necessary for post-copy
      migration (the migration vCPUs on the target need to read and write to
      pages, when the primary guest would VMEXIT).
      
      This does not change the threat model for AMD SEV. Any memory involved
      is still owned by the primary guest and its initial state is still
      attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
      to circumvent SEV, they could achieve the same effect by simply attaching
      a vCPU to the primary VM.
      This patch deliberately leaves userspace in charge of the memslots for the
      mirror, as it already has the power to mess with them in the primary guest.
      
      This patch does not support SEV-ES (much less SNP), as it does not
      handle handing off attested VMSAs to the mirror.
      
      For additional context, we need a Migration Helper because SEV PSP
      migration is far too slow for our live migration on its own. Using
      an in-guest migrator lets us speed this up significantly.
      Signed-off-by: NNathan Tempelman <natet@google.com>
      Message-Id: <20210408223214.2582277-1-natet@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54526d1f
  3. 20 4月, 2021 4 次提交
  4. 17 4月, 2021 2 次提交
  5. 01 4月, 2021 3 次提交
    • V
      KVM: x86: Prevent 'hv_clock->system_time' from going negative in kvm_guest_time_update() · 77fcbe82
      Vitaly Kuznetsov 提交于
      When guest time is reset with KVM_SET_CLOCK(0), it is possible for
      'hv_clock->system_time' to become a small negative number. This happens
      because in KVM_SET_CLOCK handling we set 'kvm->arch.kvmclock_offset' based
      on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled,
      kvm_guest_time_update() does (masterclock in use case):
      
      hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset;
      
      And 'master_kernel_ns' represents the last time when masterclock
      got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a
      problem, the difference is very small, e.g. I'm observing
      hv_clock.system_time = -70 ns. The issue comes from the fact that
      'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in
      compute_tsc_page_parameters() becomes a very big number.
      
      Use 'master_kernel_ns' instead of get_kvmclock_ns() when masterclock is in
      use and get_kvmclock_base_ns() when it's not to prevent 'system_time' from
      going negative.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210331124130.337992-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      77fcbe82
    • P
      KVM: x86: disable interrupts while pvclock_gtod_sync_lock is taken · a83829f5
      Paolo Bonzini 提交于
      pvclock_gtod_sync_lock can be taken with interrupts disabled if the
      preempt notifier calls get_kvmclock_ns to update the Xen
      runstate information:
      
         spin_lock include/linux/spinlock.h:354 [inline]
         get_kvmclock_ns+0x25/0x390 arch/x86/kvm/x86.c:2587
         kvm_xen_update_runstate+0x3d/0x2c0 arch/x86/kvm/xen.c:69
         kvm_xen_update_runstate_guest+0x74/0x320 arch/x86/kvm/xen.c:100
         kvm_xen_runstate_set_preempted arch/x86/kvm/xen.h:96 [inline]
         kvm_arch_vcpu_put+0x2d8/0x5a0 arch/x86/kvm/x86.c:4062
      
      So change the users of the spinlock to spin_lock_irqsave and
      spin_unlock_irqrestore.
      
      Reported-by: syzbot+b282b65c2c68492df769@syzkaller.appspotmail.com
      Fixes: 30b5c851 ("KVM: x86/xen: Add support for vCPU runstate information")
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a83829f5
    • P
      KVM: x86: reduce pvclock_gtod_sync_lock critical sections · c2c647f9
      Paolo Bonzini 提交于
      There is no need to include changes to vcpu->requests into
      the pvclock_gtod_sync_lock critical section.  The changes to
      the shared data structures (in pvclock_update_vm_gtod_copy)
      already occur under the lock.
      
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c2c647f9
  6. 31 3月, 2021 1 次提交
  7. 19 3月, 2021 2 次提交
    • W
      KVM: X86: Fix missing local pCPU when executing wbinvd on all dirty pCPUs · c2162e13
      Wanpeng Li 提交于
      In order to deal with noncoherent DMA, we should execute wbinvd on
      all dirty pCPUs when guest wbinvd exits to maintain data consistency.
      smp_call_function_many() does not execute the provided function on the
      local core, therefore replace it by on_each_cpu_mask().
      Reported-by: NNadav Amit <namit@vmware.com>
      Cc: Nadav Amit <namit@vmware.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1615517151-7465-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c2162e13
    • S
      KVM: x86: Protect userspace MSR filter with SRCU, and set atomically-ish · b318e8de
      Sean Christopherson 提交于
      Fix a plethora of issues with MSR filtering by installing the resulting
      filter as an atomic bundle instead of updating the live filter one range
      at a time.  The KVM_X86_SET_MSR_FILTER ioctl() isn't truly atomic, as
      the hardware MSR bitmaps won't be updated until the next VM-Enter, but
      the relevant software struct is atomically updated, which is what KVM
      really needs.
      
      Similar to the approach used for modifying memslots, make arch.msr_filter
      a SRCU-protected pointer, do all the work configuring the new filter
      outside of kvm->lock, and then acquire kvm->lock only when the new filter
      has been vetted and created.  That way vCPU readers either see the old
      filter or the new filter in their entirety, not some half-baked state.
      
      Yuan Yao pointed out a use-after-free in ksm_msr_allowed() due to a
      TOCTOU bug, but that's just the tip of the iceberg...
      
        - Nothing is __rcu annotated, making it nigh impossible to audit the
          code for correctness.
        - kvm_add_msr_filter() has an unpaired smp_wmb().  Violation of kernel
          coding style aside, the lack of a smb_rmb() anywhere casts all code
          into doubt.
        - kvm_clear_msr_filter() has a double free TOCTOU bug, as it grabs
          count before taking the lock.
        - kvm_clear_msr_filter() also has memory leak due to the same TOCTOU bug.
      
      The entire approach of updating the live filter is also flawed.  While
      installing a new filter is inherently racy if vCPUs are running, fixing
      the above issues also makes it trivial to ensure certain behavior is
      deterministic, e.g. KVM can provide deterministic behavior for MSRs with
      identical settings in the old and new filters.  An atomic update of the
      filter also prevents KVM from getting into a half-baked state, e.g. if
      installing a filter fails, the existing approach would leave the filter
      in a half-baked state, having already committed whatever bits of the
      filter were already processed.
      
      [*] https://lkml.kernel.org/r/20210312083157.25403-1-yaoyuan0329os@gmail.com
      
      Fixes: 1a155254 ("KVM: x86: Introduce MSR filtering")
      Cc: stable@vger.kernel.org
      Cc: Alexander Graf <graf@amazon.com>
      Reported-by: NYuan Yao <yaoyuan0329os@gmail.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210316184436.2544875-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b318e8de
  8. 17 3月, 2021 1 次提交
    • V
      KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs · e880c6ea
      Vitaly Kuznetsov 提交于
      When KVM_REQ_MASTERCLOCK_UPDATE request is issued (e.g. after migration)
      we need to make sure no vCPU sees stale values in PV clock structures and
      thus all vCPUs are kicked with KVM_REQ_CLOCK_UPDATE. Hyper-V TSC page
      clocksource is global and kvm_guest_time_update() only updates in on vCPU0
      but this is not entirely correct: nothing blocks some other vCPU from
      entering the guest before we finish the update on CPU0 and it can read
      stale values from the page.
      
      Invalidate TSC page in kvm_gen_update_masterclock() to switch all vCPUs
      to using MSR based clocksource (HV_X64_MSR_TIME_REF_COUNT).
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210316143736.964151-3-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e880c6ea
  9. 15 3月, 2021 8 次提交
  10. 06 3月, 2021 1 次提交
  11. 03 3月, 2021 2 次提交
  12. 26 2月, 2021 1 次提交
  13. 19 2月, 2021 5 次提交
  14. 17 2月, 2021 1 次提交
  15. 09 2月, 2021 5 次提交
    • V
      KVM: x86: hyper-v: Allocate Hyper-V context lazily · fc08b628
      Vitaly Kuznetsov 提交于
      Hyper-V context is only needed for guests which use Hyper-V emulation in
      KVM (e.g. Windows/Hyper-V guests) so we don't actually need to allocate
      it in kvm_arch_vcpu_create(), we can postpone the action until Hyper-V
      specific MSRs are accessed or SynIC is enabled.
      
      Once allocated, let's keep the context alive for the lifetime of the vCPU
      as an attempt to free it would require additional synchronization with
      other vCPUs and normally it is not supposed to happen.
      
      Note, Hyper-V style hypercall enablement is done by writing to
      HV_X64_MSR_GUEST_OS_ID so we don't need to worry about allocating Hyper-V
      context from kvm_hv_hypercall().
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210126134816.1880136-15-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc08b628
    • V
      KVM: x86: hyper-v: Make Hyper-V emulation enablement conditional · 8f014550
      Vitaly Kuznetsov 提交于
      Hyper-V emulation is enabled in KVM unconditionally. This is bad at least
      from security standpoint as it is an extra attack surface. Ideally, there
      should be a per-VM capability explicitly enabled by VMM but currently it
      is not the case and we can't mandate one without breaking backwards
      compatibility. We can, however, check guest visible CPUIDs and only enable
      Hyper-V emulation when "Hv#1" interface was exposed in
      HYPERV_CPUID_INTERFACE.
      
      Note, VMMs are free to act in any sequence they like, e.g. they can try
      to set MSRs first and CPUIDs later so we still need to allow the host
      to read/write Hyper-V specific MSRs unconditionally.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210126134816.1880136-14-vkuznets@redhat.com>
      [Add selftest vcpu_set_hv_cpuid API to avoid breaking xen_vmcall_test. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8f014550
    • V
      KVM: x86: hyper-v: Allocate 'struct kvm_vcpu_hv' dynamically · 4592b7ea
      Vitaly Kuznetsov 提交于
      Hyper-V context is only needed for guests which use Hyper-V emulation in
      KVM (e.g. Windows/Hyper-V guests). 'struct kvm_vcpu_hv' is, however, quite
      big, it accounts for more than 1/4 of the total 'struct kvm_vcpu_arch'
      which is also quite big already. This all looks like a waste.
      
      Allocate 'struct kvm_vcpu_hv' dynamically. This patch does not bring any
      (intentional) functional change as we still allocate the context
      unconditionally but it paves the way to doing that only when needed.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210126134816.1880136-13-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4592b7ea
    • V
      KVM: x86: hyper-v: Prepare to meet unallocated Hyper-V context · f2bc14b6
      Vitaly Kuznetsov 提交于
      Currently, Hyper-V context is part of 'struct kvm_vcpu_arch' and is always
      available. As a preparation to allocating it dynamically, check that it is
      not NULL at call sites which can normally proceed without it i.e. the
      behavior is identical to the situation when Hyper-V emulation is not being
      used by the guest.
      
      When Hyper-V context for a particular vCPU is not allocated, we may still
      need to get 'vp_index' from there. E.g. in a hypothetical situation when
      Hyper-V emulation was enabled on one CPU and wasn't on another, Hyper-V
      style send-IPI hypercall may still be used. Luckily, vp_index is always
      initialized to kvm_vcpu_get_idx() and can only be changed when Hyper-V
      context is present. Introduce kvm_hv_get_vpindex() helper for
      simplification.
      
      No functional change intended.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210126134816.1880136-12-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f2bc14b6
    • V
      KVM: x86: hyper-v: Always use to_hv_vcpu() accessor to get to 'struct kvm_vcpu_hv' · 9ff5e030
      Vitaly Kuznetsov 提交于
      As a preparation to allocating Hyper-V context dynamically, make it clear
      who's the user of the said context.
      
      No functional change intended.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210126134816.1880136-11-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9ff5e030