1. 10 7月, 2020 1 次提交
  2. 09 7月, 2020 6 次提交
  3. 23 6月, 2020 2 次提交
    • S
      KVM: nVMX: Plumb L2 GPA through to PML emulation · 2dbebf7a
      Sean Christopherson 提交于
      Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all
      intents and purposes is vmx_write_pml_buffer(), instead of having the
      latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS.  If the dirty bit
      update is the result of KVM emulation (rare for L2), then the GPA in the
      VMCS may be stale and/or hold a completely unrelated GPA.
      
      Fixes: c5f983f6 ("nVMX: Implement emulated Page Modification Logging")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2dbebf7a
    • P
      KVM: LAPIC: ensure APIC map is up to date on concurrent update requests · 44d52717
      Paolo Bonzini 提交于
      The following race can cause lost map update events:
      
               cpu1                            cpu2
      
                                      apic_map_dirty = true
        ------------------------------------------------------------
                                      kvm_recalculate_apic_map:
                                           pass check
                                               mutex_lock(&kvm->arch.apic_map_lock);
                                               if (!kvm->arch.apic_map_dirty)
                                           and in process of updating map
        -------------------------------------------------------------
          other calls to
             apic_map_dirty = true         might be too late for affected cpu
        -------------------------------------------------------------
                                           apic_map_dirty = false
        -------------------------------------------------------------
          kvm_recalculate_apic_map:
          bail out on
            if (!kvm->arch.apic_map_dirty)
      
      To fix it, record the beginning of an update of the APIC map in
      apic_map_dirty.  If another APIC map change switches apic_map_dirty
      back to DIRTY during the update, kvm_recalculate_apic_map should not
      make it CLEAN, and the other caller will go through the slow path.
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      44d52717
  4. 12 6月, 2020 1 次提交
  5. 09 6月, 2020 1 次提交
  6. 03 6月, 2020 1 次提交
    • C
      mm: remove the pgprot argument to __vmalloc · 88dca4ca
      Christoph Hellwig 提交于
      The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv]
      Acked-by: Gao Xiang <xiang@kernel.org> [erofs]
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWei Liu <wei.liu@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88dca4ca
  7. 01 6月, 2020 8 次提交
    • J
      x86/kvm/hyper-v: Add support for synthetic debugger interface · f97f5a56
      Jon Doron 提交于
      Add support for Hyper-V synthetic debugger (syndbg) interface.
      The syndbg interface is using MSRs to emulate a way to send/recv packets
      data.
      
      The debug transport dll (kdvm/kdnet) will identify if Hyper-V is enabled
      and if it supports the synthetic debugger interface it will attempt to
      use it, instead of trying to initialize a network adapter.
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NJon Doron <arilou@gmail.com>
      Message-Id: <20200529134543.1127440-4-arilou@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f97f5a56
    • L
      KVM: x86/pmu: Support full width counting · 27461da3
      Like Xu 提交于
      Intel CPUs have a new alternative MSR range (starting from MSR_IA32_PMC0)
      for GP counters that allows writing the full counter width. Enable this
      range from a new capability bit (IA32_PERF_CAPABILITIES.FW_WRITE[bit 13]).
      
      The guest would query CPUID to get the counter width, and sign extends
      the counter values as needed. The traditional MSRs always limit to 32bit,
      even though the counter internally is larger (48 or 57 bits).
      
      When the new capability is set, use the alternative range which do not
      have these restrictions. This lowers the overhead of perf stat slightly
      because it has to do less interrupts to accumulate the counter value.
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Message-Id: <20200529074347.124619-3-like.xu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      27461da3
    • V
      KVM: x86: acknowledgment mechanism for async pf page ready notifications · 557a961a
      Vitaly Kuznetsov 提交于
      If two page ready notifications happen back to back the second one is not
      delivered and the only mechanism we currently have is
      kvm_check_async_pf_completion() check in vcpu_run() loop. The check will
      only be performed with the next vmexit when it happens and in some cases
      it may take a while. With interrupt based page ready notification delivery
      the situation is even worse: unlike exceptions, interrupts are not handled
      immediately so we must check if the slot is empty. This is slow and
      unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate
      the fact that the slot is free and host should check its notification
      queue. Mandate using it for interrupt based 'page ready' APF event
      delivery.
      
      As kvm_check_async_pf_completion() is going away from vcpu_run() we need
      a way to communicate the fact that vcpu->async_pf.done queue has
      transitioned from empty to non-empty state. Introduce
      kvm_arch_async_page_present_queued() and KVM_REQ_APF_READY to do the job.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200525144125.143875-7-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      557a961a
    • V
      KVM: x86: interrupt based APF 'page ready' event delivery · 2635b5c4
      Vitaly Kuznetsov 提交于
      Concerns were expressed around APF delivery via synthetic #PF exception as
      in some cases such delivery may collide with real page fault. For 'page
      ready' notifications we can easily switch to using an interrupt instead.
      Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the legacy one.
      
      One notable difference between the two mechanisms is that interrupt may not
      get handled immediately so whenever we would like to deliver next event
      (regardless of its type) we must be sure the guest had read and cleared
      previous event in the slot.
      
      While on it, get rid on 'type 1/type 2' names for APF events in the
      documentation as they are causing confusion. Use 'page not present'
      and 'page ready' everywhere instead.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200525144125.143875-6-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2635b5c4
    • V
      KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present() · 7c0ade6c
      Vitaly Kuznetsov 提交于
      An innocent reader of the following x86 KVM code:
      
      bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
      {
              if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
                      return true;
      ...
      
      may get very confused: if APF mechanism is not enabled, why do we report
      that we 'can inject async page present'? In reality, upon injection
      kvm_arch_async_page_present() will check the same condition again and,
      in case APF is disabled, will just drop the item. This is fine as the
      guest which deliberately disabled APF doesn't expect to get any APF
      notifications.
      
      Rename kvm_arch_can_inject_async_page_present() to
      kvm_arch_can_dequeue_async_page_present() to make it clear what we are
      checking: if the item can be dequeued (meaning either injected or just
      dropped).
      
      On s390 kvm_arch_can_inject_async_page_present() always returns 'true' so
      the rename doesn't matter much.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200525144125.143875-4-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7c0ade6c
    • V
      KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info · 68fd66f1
      Vitaly Kuznetsov 提交于
      Currently, APF mechanism relies on the #PF abuse where the token is being
      passed through CR2. If we switch to using interrupts to deliver page-ready
      notifications we need a different way to pass the data. Extent the existing
      'struct kvm_vcpu_pv_apf_data' with token information for page-ready
      notifications.
      
      While on it, rename 'reason' to 'flags'. This doesn't change the semantics
      as we only have reasons '1' and '2' and these can be treated as bit flags
      but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery
      making 'reason' name misleading.
      
      The newly introduced apf_put_user_ready() temporary puts both flags and
      token information, this will be changed to put token only when we switch
      to interrupt based notifications.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200525144125.143875-3-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      68fd66f1
    • P
      KVM: nSVM: remove HF_HIF_MASK · 08245e6d
      Paolo Bonzini 提交于
      The L1 flags can be found in the save area of svm->nested.hsave, fish
      it from there so that there is one fewer thing to migrate.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      08245e6d
    • P
      KVM: nSVM: remove HF_VINTR_MASK · e9fd761a
      Paolo Bonzini 提交于
      Now that the int_ctl field is stored in svm->nested.ctl.int_ctl, we can
      use it instead of vcpu->arch.hflags to check whether L2 is running
      in V_INTR_MASKING mode.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e9fd761a
  8. 28 5月, 2020 3 次提交
    • P
      KVM: nSVM: inject exceptions via svm_check_nested_events · 7c86663b
      Paolo Bonzini 提交于
      This allows exceptions injected by the emulator to be properly delivered
      as vmexits.  The code also becomes simpler, because we can just let all
      L0-intercepted exceptions go through the usual path.  In particular, our
      emulation of the VMX #DB exit qualification is very much simplified,
      because the vmexit injection path can use kvm_deliver_exception_payload
      to update DR6.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7c86663b
    • P
      KVM: x86: enable event window in inject_pending_event · c9d40913
      Paolo Bonzini 提交于
      In case an interrupt arrives after nested.check_events but before the
      call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt
      window even if the interrupt is actually going to be a vmexit.  This is
      useless rather than harmful, but it really complicates reasoning about
      SVM's handling of the VINTR intercept.  We'd like to never bother with
      the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in
      that case there is no interrupt window and we can just exit the nested
      guest whenever we want.
      
      This patch moves the opening of the interrupt window inside
      inject_pending_event.  This consolidates the check for pending
      interrupt/NMI/SMI in one place, and makes KVM's usage of immediate
      exits more consistent, extending it beyond just nested virtualization.
      
      There are two functional changes here.  They only affect corner cases,
      but overall they simplify the inject_pending_event.
      
      - re-injection of still-pending events will also use req_immediate_exit
      instead of using interrupt-window intercepts.  This should have no impact
      on performance on Intel since it simply replaces an interrupt-window
      or NMI-window exit for a preemption-timer exit.  On AMD, which has no
      equivalent of the preemption time, it may incur some overhead but an
      actual effect on performance should only be visible in pathological cases.
      
      - kvm_arch_interrupt_allowed and kvm_vcpu_has_events will return true
      if an interrupt, NMI or SMI is blocked by nested_run_pending.  This
      makes sense because entering the VM will allow it to make progress
      and deliver the event.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c9d40913
    • S
      KVM: x86: Take an unsigned 32-bit int for has_emulated_msr()'s index · cb97c2d6
      Sean Christopherson 提交于
      Take a u32 for the index in has_emulated_msr() to match hardware, which
      treats MSR indices as unsigned 32-bit values.  Functionally, taking a
      signed int doesn't cause problems with the current code base, but could
      theoretically cause problems with 32-bit KVM, e.g. if the index were
      checked via a less-than statement, which would evaluate incorrectly for
      MSR indices with bit 31 set.
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200218234012.7110-3-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb97c2d6
  9. 20 5月, 2020 1 次提交
  10. 16 5月, 2020 7 次提交
  11. 14 5月, 2020 7 次提交
    • S
      KVM: x86/mmu: Capture TDP level when updating CPUID · e93fd3b3
      Sean Christopherson 提交于
      Snapshot the TDP level now that it's invariant (SVM) or dependent only
      on host capabilities and guest CPUID (VMX).  This avoids having to call
      kvm_x86_ops.get_tdp_level() when initializing a TDP MMU and/or
      calculating the page role, and thus avoids the associated retpoline.
      
      Drop the WARN in vmx_get_tdp_level() as updating CPUID while L2 is
      active is legal, if dodgy.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200502043234.12481-11-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e93fd3b3
    • S
      KVM: VMX: Add proper cache tracking for CR0 · bd31fe49
      Sean Christopherson 提交于
      Move CR0 caching into the standard register caching mechanism in order
      to take advantage of the availability checks provided by regs_avail.
      This avoids multiple VMREADs in the (uncommon) case where kvm_read_cr0()
      is called multiple times in a single VM-Exit, and more importantly
      eliminates a kvm_x86_ops hook, saves a retpoline on SVM when reading
      CR0, and squashes the confusing naming discrepancy of "cache_reg" vs.
      "decache_cr0_guest_bits".
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200502043234.12481-8-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bd31fe49
    • S
      KVM: VMX: Add proper cache tracking for CR4 · f98c1e77
      Sean Christopherson 提交于
      Move CR4 caching into the standard register caching mechanism in order
      to take advantage of the availability checks provided by regs_avail.
      This avoids multiple VMREADs and retpolines (when configured) during
      nested VMX transitions as kvm_read_cr4_bits() is invoked multiple times
      on each transition, e.g. when stuffing CR0 and CR3.
      
      As an added bonus, this eliminates a kvm_x86_ops hook, saves a retpoline
      on SVM when reading CR4, and squashes the confusing naming discrepancy
      of "cache_reg" vs. "decache_cr4_guest_bits".
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200502043234.12481-7-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f98c1e77
    • S
      KVM: x86: Save L1 TSC offset in 'struct kvm_vcpu_arch' · 56ba77a4
      Sean Christopherson 提交于
      Save L1's TSC offset in 'struct kvm_vcpu_arch' and drop the kvm_x86_ops
      hook read_l1_tsc_offset().  This avoids a retpoline (when configured)
      when reading L1's effective TSC, which is done at least once on every
      VM-Exit.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200502043234.12481-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      56ba77a4
    • P
      KVM: x86: Replace late check_nested_events() hack with more precise fix · c300ab9f
      Paolo Bonzini 提交于
      Add an argument to interrupt_allowed and nmi_allowed, to checking if
      interrupt injection is blocked.  Use the hook to handle the case where
      an interrupt arrives between check_nested_events() and the injection
      logic.  Drop the retry of check_nested_events() that hack-a-fixed the
      same condition.
      
      Blocking injection is also a bit of a hack, e.g. KVM should do exiting
      and non-exiting interrupt processing in a single pass, but it's a more
      precise hack.  The old comment is also misleading, e.g. KVM_REQ_EVENT is
      purely an optimization, setting it on every run loop (which KVM doesn't
      do) should not affect functionality, only performance.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200423022550.15113-13-sean.j.christopherson@intel.com>
      [Extend to SVM, add SMI and NMI.  Even though NMI and SMI cannot come
       asynchronously right now, making the fix generic is easy and removes a
       special case. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c300ab9f
    • S
      KVM: x86: Make return for {interrupt_nmi,smi}_allowed() a bool instead of int · 88c604b6
      Sean Christopherson 提交于
      Return an actual bool for kvm_x86_ops' {interrupt_nmi}_allowed() hook to
      better reflect the return semantics, and to avoid creating an even
      bigger mess when the related VMX code is refactored in upcoming patches.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200423022550.15113-5-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      88c604b6
    • S
      KVM: nVMX: Open a window for pending nested VMX preemption timer · d2060bd4
      Sean Christopherson 提交于
      Add a kvm_x86_ops hook to detect a nested pending "hypervisor timer" and
      use it to effectively open a window for servicing the expired timer.
      Like pending SMIs on VMX, opening a window simply means requesting an
      immediate exit.
      
      This fixes a bug where an expired VMX preemption timer (for L2) will be
      delayed and/or lost if a pending exception is injected into L2.  The
      pending exception is rightly prioritized by vmx_check_nested_events()
      and injected into L2, with the preemption timer left pending.  Because
      no window opened, L2 is free to run uninterrupted.
      
      Fixes: f4124500 ("KVM: nVMX: Fully emulate preemption timer")
      Reported-by: NJim Mattson <jmattson@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Peter Shier <pshier@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200423022550.15113-3-sean.j.christopherson@intel.com>
      [Check it in kvm_vcpu_has_events too, to ensure that the preemption
       timer is serviced promptly even if the vCPU is halted and L1 is not
       intercepting HLT. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d2060bd4
  12. 13 5月, 2020 1 次提交
    • B
      KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c · 37486135
      Babu Moger 提交于
      Though rdpkru and wrpkru are contingent upon CR4.PKE, the PKRU
      resource isn't. It can be read with XSAVE and written with XRSTOR.
      So, if we don't set the guest PKRU value here(kvm_load_guest_xsave_state),
      the guest can read the host value.
      
      In case of kvm_load_host_xsave_state, guest with CR4.PKE clear could
      potentially use XRSTOR to change the host PKRU value.
      
      While at it, move pkru state save/restore to common code and the
      host_pkru field to kvm_vcpu_arch.  This will let SVM support protection keys.
      
      Cc: stable@vger.kernel.org
      Reported-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Message-Id: <158932794619.44260.14508381096663848853.stgit@naples-babu.amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37486135
  13. 08 5月, 2020 1 次提交
    • P
      KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6 · d67668e9
      Paolo Bonzini 提交于
      There are two issues with KVM_EXIT_DEBUG on AMD, whose root cause is the
      different handling of DR6 on intercepted #DB exceptions on Intel and AMD.
      
      On Intel, #DB exceptions transmit the DR6 value via the exit qualification
      field of the VMCS, and the exit qualification only contains the description
      of the precise event that caused a vmexit.
      
      On AMD, instead the DR6 field of the VMCB is filled in as if the #DB exception
      was to be injected into the guest.  This has two effects when guest debugging
      is in use:
      
      * the guest DR6 is clobbered
      
      * the kvm_run->debug.arch.dr6 field can accumulate more debug events, rather
      than just the last one that happened (the testcase in the next patch covers
      this issue).
      
      This patch fixes both issues by emulating, so to speak, the Intel behavior
      on AMD processors.  The important observation is that (after the previous
      patches) the VMCB value of DR6 is only ever observable from the guest is
      KVM_DEBUGREG_WONT_EXIT is set.  Therefore we can actually set vmcb->save.dr6
      to any value we want as long as KVM_DEBUGREG_WONT_EXIT is clear, which it
      will be if guest debugging is enabled.
      
      Therefore it is possible to enter the guest with an all-zero DR6,
      reconstruct the #DB payload from the DR6 we get at exit time, and let
      kvm_deliver_exception_payload move the newly set bits into vcpu->arch.dr6.
      Some extra bits may be included in the payload if KVM_DEBUGREG_WONT_EXIT
      is set, but this is harmless.
      
      This may not be the most optimized way to deal with this, but it is
      simple and, being confined within SVM code, it gets rid of the set_dr6
      callback and kvm_update_dr6.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d67668e9