1. 28 9月, 2020 2 次提交
    • S
      KVM: x86: Add intr/vectoring info and error code to kvm_exit tracepoint · 235ba74f
      Sean Christopherson 提交于
      Extend the kvm_exit tracepoint to align it with kvm_nested_vmexit in
      terms of what information is captured.  On SVM, add interrupt info and
      error code, while on VMX it add IDT vectoring and error code.  This
      sets the stage for macrofying the kvm_exit tracepoint definition so that
      it can be reused for kvm_nested_vmexit without loss of information.
      
      Opportunistically stuff a zero for VM_EXIT_INTR_INFO if the VM-Enter
      failed, as the field is guaranteed to be invalid.  Note, it'd be
      possible to further filter the interrupt/exception fields based on the
      VM-Exit reason, but the helper is intended only for tracepoints, i.e.
      an extra VMREAD or two is a non-issue, the failed VM-Enter case is just
      low hanging fruit.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923201349.16097-5-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      235ba74f
    • S
      KVM: x86: Add kvm_x86_ops hook to short circuit emulation · 09e3e2a1
      Sean Christopherson 提交于
      Replace the existing kvm_x86_ops.need_emulation_on_page_fault() with a
      more generic is_emulatable(), and unconditionally call the new function
      in x86_emulate_instruction().
      
      KVM will use the generic hook to support multiple security related
      technologies that prevent emulation in one way or another.  Similar to
      the existing AMD #NPF case where emulation of the current instruction is
      not possible due to lack of information, AMD's SEV-ES and Intel's SGX
      and TDX will introduce scenarios where emulation is impossible due to
      the guest's register state being inaccessible.  And again similar to the
      existing #NPF case, emulation can be initiated by kvm_mmu_page_fault(),
      i.e. outside of the control of vendor-specific code.
      
      While the cause and architecturally visible behavior of the various
      cases are different, e.g. SGX will inject a #UD, AMD #NPF is a clean
      resume or complete shutdown, and SEV-ES and TDX "return" an error, the
      impact on the common emulation code is identical: KVM must stop
      emulation immediately and resume the guest.
      
      Query is_emulatable() in handle_ud() as well so that the
      force_emulation_prefix code doesn't incorrectly modify RIP before
      calling emulate_instruction() in the absurdly unlikely scenario that
      KVM encounters forced emulation in conjunction with "do not emulate".
      
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200915232702.15945-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      09e3e2a1
  2. 22 8月, 2020 1 次提交
    • W
      KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() · fdfe7cbd
      Will Deacon 提交于
      The 'flags' field of 'struct mmu_notifier_range' is used to indicate
      whether invalidate_range_{start,end}() are permitted to block. In the
      case of kvm_mmu_notifier_invalidate_range_start(), this field is not
      forwarded on to the architecture-specific implementation of
      kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
      whether or not to block.
      
      Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
      architectures are aware as to whether or not they are permitted to block.
      
      Cc: <stable@vger.kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Message-Id: <20200811102725.7121-2-will@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fdfe7cbd
  3. 31 7月, 2020 3 次提交
  4. 11 7月, 2020 3 次提交
    • M
      KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support · 3edd6839
      Mohammed Gamal 提交于
      This patch adds a new capability KVM_CAP_SMALLER_MAXPHYADDR which
      allows userspace to query if the underlying architecture would
      support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly
      (e.g. qemu can decide if it should warn for -cpu ..,phys-bits=X)
      
      The complications in this patch are due to unexpected (but documented)
      behaviour we see with NPF vmexit handling in AMD processor.  If
      SVM is modified to add guest physical address checks in the NPF
      and guest #PF paths, we see the followning error multiple times in
      the 'access' test in kvm-unit-tests:
      
                  test pte.p pte.36 pde.p: FAIL: pte 2000021 expected 2000001
                  Dump mapping: address: 0x123400000000
                  ------L4: 24c3027
                  ------L3: 24c4027
                  ------L2: 24c5021
                  ------L1: 1002000021
      
      This is because the PTE's accessed bit is set by the CPU hardware before
      the NPF vmexit. This is handled completely by hardware and cannot be fixed
      in software.
      
      Therefore, availability of the new capability depends on a boolean variable
      allow_smaller_maxphyaddr which is set individually by VMX and SVM init
      routines. On VMX it's always set to true, on SVM it's only set to true
      when NPT is not enabled.
      
      CC: Tom Lendacky <thomas.lendacky@amd.com>
      CC: Babu Moger <babu.moger@amd.com>
      Signed-off-by: NMohammed Gamal <mgamal@redhat.com>
      Message-Id: <20200710154811.418214-10-mgamal@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3edd6839
    • P
      KVM: x86: rename update_bp_intercept to update_exception_bitmap · 6986982f
      Paolo Bonzini 提交于
      We would like to introduce a callback to update the #PF intercept
      when CPUID changes.  Just reuse update_bp_intercept since VMX is
      already using update_exception_bitmap instead of a bespoke function.
      
      While at it, remove an unnecessary assignment in the SVM version,
      which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug)
      and has nothing to do with the exception bitmap.
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6986982f
    • M
      KVM: x86: mmu: Move translate_gpa() to mmu.c · cd313569
      Mohammed Gamal 提交于
      Also no point of it being inline since it's always called through
      function pointers. So remove that.
      Signed-off-by: NMohammed Gamal <mgamal@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20200710154811.418214-3-mgamal@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cd313569
  5. 10 7月, 2020 4 次提交
  6. 09 7月, 2020 6 次提交
  7. 23 6月, 2020 2 次提交
    • S
      KVM: nVMX: Plumb L2 GPA through to PML emulation · 2dbebf7a
      Sean Christopherson 提交于
      Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all
      intents and purposes is vmx_write_pml_buffer(), instead of having the
      latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS.  If the dirty bit
      update is the result of KVM emulation (rare for L2), then the GPA in the
      VMCS may be stale and/or hold a completely unrelated GPA.
      
      Fixes: c5f983f6 ("nVMX: Implement emulated Page Modification Logging")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2dbebf7a
    • P
      KVM: LAPIC: ensure APIC map is up to date on concurrent update requests · 44d52717
      Paolo Bonzini 提交于
      The following race can cause lost map update events:
      
               cpu1                            cpu2
      
                                      apic_map_dirty = true
        ------------------------------------------------------------
                                      kvm_recalculate_apic_map:
                                           pass check
                                               mutex_lock(&kvm->arch.apic_map_lock);
                                               if (!kvm->arch.apic_map_dirty)
                                           and in process of updating map
        -------------------------------------------------------------
          other calls to
             apic_map_dirty = true         might be too late for affected cpu
        -------------------------------------------------------------
                                           apic_map_dirty = false
        -------------------------------------------------------------
          kvm_recalculate_apic_map:
          bail out on
            if (!kvm->arch.apic_map_dirty)
      
      To fix it, record the beginning of an update of the APIC map in
      apic_map_dirty.  If another APIC map change switches apic_map_dirty
      back to DIRTY during the update, kvm_recalculate_apic_map should not
      make it CLEAN, and the other caller will go through the slow path.
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      44d52717
  8. 12 6月, 2020 1 次提交
  9. 09 6月, 2020 1 次提交
  10. 03 6月, 2020 1 次提交
    • C
      mm: remove the pgprot argument to __vmalloc · 88dca4ca
      Christoph Hellwig 提交于
      The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv]
      Acked-by: Gao Xiang <xiang@kernel.org> [erofs]
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWei Liu <wei.liu@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88dca4ca
  11. 01 6月, 2020 8 次提交
    • J
      x86/kvm/hyper-v: Add support for synthetic debugger interface · f97f5a56
      Jon Doron 提交于
      Add support for Hyper-V synthetic debugger (syndbg) interface.
      The syndbg interface is using MSRs to emulate a way to send/recv packets
      data.
      
      The debug transport dll (kdvm/kdnet) will identify if Hyper-V is enabled
      and if it supports the synthetic debugger interface it will attempt to
      use it, instead of trying to initialize a network adapter.
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NJon Doron <arilou@gmail.com>
      Message-Id: <20200529134543.1127440-4-arilou@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f97f5a56
    • L
      KVM: x86/pmu: Support full width counting · 27461da3
      Like Xu 提交于
      Intel CPUs have a new alternative MSR range (starting from MSR_IA32_PMC0)
      for GP counters that allows writing the full counter width. Enable this
      range from a new capability bit (IA32_PERF_CAPABILITIES.FW_WRITE[bit 13]).
      
      The guest would query CPUID to get the counter width, and sign extends
      the counter values as needed. The traditional MSRs always limit to 32bit,
      even though the counter internally is larger (48 or 57 bits).
      
      When the new capability is set, use the alternative range which do not
      have these restrictions. This lowers the overhead of perf stat slightly
      because it has to do less interrupts to accumulate the counter value.
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Message-Id: <20200529074347.124619-3-like.xu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      27461da3
    • V
      KVM: x86: acknowledgment mechanism for async pf page ready notifications · 557a961a
      Vitaly Kuznetsov 提交于
      If two page ready notifications happen back to back the second one is not
      delivered and the only mechanism we currently have is
      kvm_check_async_pf_completion() check in vcpu_run() loop. The check will
      only be performed with the next vmexit when it happens and in some cases
      it may take a while. With interrupt based page ready notification delivery
      the situation is even worse: unlike exceptions, interrupts are not handled
      immediately so we must check if the slot is empty. This is slow and
      unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate
      the fact that the slot is free and host should check its notification
      queue. Mandate using it for interrupt based 'page ready' APF event
      delivery.
      
      As kvm_check_async_pf_completion() is going away from vcpu_run() we need
      a way to communicate the fact that vcpu->async_pf.done queue has
      transitioned from empty to non-empty state. Introduce
      kvm_arch_async_page_present_queued() and KVM_REQ_APF_READY to do the job.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200525144125.143875-7-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      557a961a
    • V
      KVM: x86: interrupt based APF 'page ready' event delivery · 2635b5c4
      Vitaly Kuznetsov 提交于
      Concerns were expressed around APF delivery via synthetic #PF exception as
      in some cases such delivery may collide with real page fault. For 'page
      ready' notifications we can easily switch to using an interrupt instead.
      Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the legacy one.
      
      One notable difference between the two mechanisms is that interrupt may not
      get handled immediately so whenever we would like to deliver next event
      (regardless of its type) we must be sure the guest had read and cleared
      previous event in the slot.
      
      While on it, get rid on 'type 1/type 2' names for APF events in the
      documentation as they are causing confusion. Use 'page not present'
      and 'page ready' everywhere instead.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200525144125.143875-6-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2635b5c4
    • V
      KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present() · 7c0ade6c
      Vitaly Kuznetsov 提交于
      An innocent reader of the following x86 KVM code:
      
      bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
      {
              if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
                      return true;
      ...
      
      may get very confused: if APF mechanism is not enabled, why do we report
      that we 'can inject async page present'? In reality, upon injection
      kvm_arch_async_page_present() will check the same condition again and,
      in case APF is disabled, will just drop the item. This is fine as the
      guest which deliberately disabled APF doesn't expect to get any APF
      notifications.
      
      Rename kvm_arch_can_inject_async_page_present() to
      kvm_arch_can_dequeue_async_page_present() to make it clear what we are
      checking: if the item can be dequeued (meaning either injected or just
      dropped).
      
      On s390 kvm_arch_can_inject_async_page_present() always returns 'true' so
      the rename doesn't matter much.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200525144125.143875-4-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7c0ade6c
    • V
      KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info · 68fd66f1
      Vitaly Kuznetsov 提交于
      Currently, APF mechanism relies on the #PF abuse where the token is being
      passed through CR2. If we switch to using interrupts to deliver page-ready
      notifications we need a different way to pass the data. Extent the existing
      'struct kvm_vcpu_pv_apf_data' with token information for page-ready
      notifications.
      
      While on it, rename 'reason' to 'flags'. This doesn't change the semantics
      as we only have reasons '1' and '2' and these can be treated as bit flags
      but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery
      making 'reason' name misleading.
      
      The newly introduced apf_put_user_ready() temporary puts both flags and
      token information, this will be changed to put token only when we switch
      to interrupt based notifications.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200525144125.143875-3-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      68fd66f1
    • P
      KVM: nSVM: remove HF_HIF_MASK · 08245e6d
      Paolo Bonzini 提交于
      The L1 flags can be found in the save area of svm->nested.hsave, fish
      it from there so that there is one fewer thing to migrate.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      08245e6d
    • P
      KVM: nSVM: remove HF_VINTR_MASK · e9fd761a
      Paolo Bonzini 提交于
      Now that the int_ctl field is stored in svm->nested.ctl.int_ctl, we can
      use it instead of vcpu->arch.hflags to check whether L2 is running
      in V_INTR_MASKING mode.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e9fd761a
  12. 28 5月, 2020 3 次提交
    • P
      KVM: nSVM: inject exceptions via svm_check_nested_events · 7c86663b
      Paolo Bonzini 提交于
      This allows exceptions injected by the emulator to be properly delivered
      as vmexits.  The code also becomes simpler, because we can just let all
      L0-intercepted exceptions go through the usual path.  In particular, our
      emulation of the VMX #DB exit qualification is very much simplified,
      because the vmexit injection path can use kvm_deliver_exception_payload
      to update DR6.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7c86663b
    • P
      KVM: x86: enable event window in inject_pending_event · c9d40913
      Paolo Bonzini 提交于
      In case an interrupt arrives after nested.check_events but before the
      call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt
      window even if the interrupt is actually going to be a vmexit.  This is
      useless rather than harmful, but it really complicates reasoning about
      SVM's handling of the VINTR intercept.  We'd like to never bother with
      the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in
      that case there is no interrupt window and we can just exit the nested
      guest whenever we want.
      
      This patch moves the opening of the interrupt window inside
      inject_pending_event.  This consolidates the check for pending
      interrupt/NMI/SMI in one place, and makes KVM's usage of immediate
      exits more consistent, extending it beyond just nested virtualization.
      
      There are two functional changes here.  They only affect corner cases,
      but overall they simplify the inject_pending_event.
      
      - re-injection of still-pending events will also use req_immediate_exit
      instead of using interrupt-window intercepts.  This should have no impact
      on performance on Intel since it simply replaces an interrupt-window
      or NMI-window exit for a preemption-timer exit.  On AMD, which has no
      equivalent of the preemption time, it may incur some overhead but an
      actual effect on performance should only be visible in pathological cases.
      
      - kvm_arch_interrupt_allowed and kvm_vcpu_has_events will return true
      if an interrupt, NMI or SMI is blocked by nested_run_pending.  This
      makes sense because entering the VM will allow it to make progress
      and deliver the event.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c9d40913
    • S
      KVM: x86: Take an unsigned 32-bit int for has_emulated_msr()'s index · cb97c2d6
      Sean Christopherson 提交于
      Take a u32 for the index in has_emulated_msr() to match hardware, which
      treats MSR indices as unsigned 32-bit values.  Functionally, taking a
      signed int doesn't cause problems with the current code base, but could
      theoretically cause problems with 32-bit KVM, e.g. if the index were
      checked via a less-than statement, which would evaluate incorrectly for
      MSR indices with bit 31 set.
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200218234012.7110-3-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb97c2d6
  13. 20 5月, 2020 1 次提交
  14. 16 5月, 2020 4 次提交