1. 28 1月, 2020 2 次提交
    • P
      KVM: X86: Drop x86_set_memory_region() · 6a3c623b
      Peter Xu 提交于
      The helper x86_set_memory_region() is only used in vmx_set_tss_addr()
      and kvm_arch_destroy_vm().  Push the lock upper in both cases.  With
      that, drop x86_set_memory_region().
      
      This prepares to allow __x86_set_memory_region() to return a HVA
      mapped, because the HVA will need to be protected by the lock too even
      after __x86_set_memory_region() returns.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6a3c623b
    • J
      kvm/svm: PKU not currently supported · a47970ed
      John Allen 提交于
      Current SVM implementation does not have support for handling PKU. Guests
      running on a host with future AMD cpus that support the feature will read
      garbage from the PKRU register and will hit segmentation faults on boot as
      memory is getting marked as protected that should not be. Ensure that cpuid
      from SVM does not advertise the feature.
      Signed-off-by: NJohn Allen <john.allen@amd.com>
      Cc: stable@vger.kernel.org
      Fixes: 0556cbdc ("x86/pkeys: Don't check if PKRU is zero before writing it")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a47970ed
  2. 24 1月, 2020 2 次提交
  3. 21 1月, 2020 2 次提交
    • M
      KVM: Fix some writing mistakes · 311497e0
      Miaohe Lin 提交于
      Fix some writing mistakes in the comments.
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      311497e0
    • W
      KVM: VMX: FIXED+PHYSICAL mode single target IPI fastpath · 1e9e2622
      Wanpeng Li 提交于
      ICR and TSCDEADLINE MSRs write cause the main MSRs write vmexits in our
      product observation, multicast IPIs are not as common as unicast IPI like
      RESCHEDULE_VECTOR and CALL_FUNCTION_SINGLE_VECTOR etc.
      
      This patch introduce a mechanism to handle certain performance-critical
      WRMSRs in a very early stage of KVM VMExit handler.
      
      This mechanism is specifically used for accelerating writes to x2APIC ICR
      that attempt to send a virtual IPI with physical destination-mode, fixed
      delivery-mode and single target. Which was found as one of the main causes
      of VMExits for Linux workloads.
      
      The reason this mechanism significantly reduce the latency of such virtual
      IPIs is by sending the physical IPI to the target vCPU in a very early stage
      of KVM VMExit handler, before host interrupts are enabled and before expensive
      operations such as reacquiring KVM’s SRCU lock.
      Latency is reduced even more when KVM is able to use APICv posted-interrupt
      mechanism (which allows to deliver the virtual IPI directly to target vCPU
      without the need to kick it to host).
      
      Testing on Xeon Skylake server:
      
      The virtual IPI latency from sender send to receiver receive reduces
      more than 200+ cpu cycles.
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1e9e2622
  4. 09 1月, 2020 2 次提交
    • S
      KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM · 736c291c
      Sean Christopherson 提交于
      Convert a plethora of parameters and variables in the MMU and page fault
      flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.
      
      Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
      addresses.  When TDP is enabled, the fault address is a guest physical
      address and thus can be a 64-bit value, even when both KVM and its guest
      are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
      64-bit field, not a natural width field.
      
      Using a gva_t for the fault address means KVM will incorrectly drop the
      upper 32-bits of the GPA.  Ditto for gva_to_gpa() when it is used to
      translate L2 GPAs to L1 GPAs.
      
      Opportunistically rename variables and parameters to better reflect the
      dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
      "addr" instead of "vaddr" when the address may be either a GVA or an L2
      GPA.  Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
      a confusing "gpa_t gva" declaration; this also sets the stage for a
      future patch to combing nonpaging_page_fault() and tdp_page_fault() with
      minimal churn.
      
      Sprinkle in a few comments to document flows where an address is known
      to be a GVA and thus can be safely truncated to a 32-bit value.  Add
      WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
      document such cases and detect bugs.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      736c291c
    • P
      KVM: X86: Use APIC_DEST_* macros properly in kvm_lapic_irq.dest_mode · c96001c5
      Peter Xu 提交于
      We were using either APIC_DEST_PHYSICAL|APIC_DEST_LOGICAL or 0|1 to
      fill in kvm_lapic_irq.dest_mode.  It's fine only because in most cases
      when we check against dest_mode it's against APIC_DEST_PHYSICAL (which
      equals to 0).  However, that's not consistent.  We'll have problem
      when we want to start checking against APIC_DEST_LOGICAL, which does
      not equals to 1.
      
      This patch firstly introduces kvm_lapic_irq_dest_mode() helper to take
      any boolean of destination mode and return the APIC_DEST_* macro.
      Then, it replaces the 0|1 settings of irq.dest_mode with the helper.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c96001c5
  5. 21 11月, 2019 1 次提交
  6. 15 11月, 2019 3 次提交
    • N
      KVM: x86: deliver KVM IOAPIC scan request to target vCPUs · 7ee30bc1
      Nitesh Narayan Lal 提交于
      In IOAPIC fixed delivery mode instead of flushing the scan
      requests to all vCPUs, we should only send the requests to
      vCPUs specified within the destination field.
      
      This patch introduces kvm_get_dest_vcpus_mask() API which
      retrieves an array of target vCPUs by using
      kvm_apic_map_get_dest_lapic() and then based on the
      vcpus_idx, it sets the bit in a bitmap. However, if the above
      fails kvm_get_dest_vcpus_mask() finds the target vCPUs by
      traversing all available vCPUs. Followed by setting the
      bits in the bitmap.
      
      If we had different vCPUs in the previous request for the
      same redirection table entry then bits corresponding to
      these vCPUs are also set. This to done to keep
      ioapic_handled_vectors synchronized.
      
      This bitmap is then eventually passed on to
      kvm_make_vcpus_request_mask() to generate a masked request
      only for the target vCPUs.
      
      This would enable us to reduce the latency overhead on isolated
      vCPUs caused by the IPI to process due to KVM_REQ_IOAPIC_SCAN.
      Suggested-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NNitesh Narayan Lal <nitesh@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7ee30bc1
    • L
      KVM: x86/vPMU: Add lazy mechanism to release perf_event per vPMC · b35e5548
      Like Xu 提交于
      Currently, a host perf_event is created for a vPMC functionality emulation.
      It’s unpredictable to determine if a disabled perf_event will be reused.
      If they are disabled and are not reused for a considerable period of time,
      those obsolete perf_events would increase host context switch overhead that
      could have been avoided.
      
      If the guest doesn't WRMSR any of the vPMC's MSRs during an entire vcpu
      sched time slice, and its independent enable bit of the vPMC isn't set,
      we can predict that the guest has finished the use of this vPMC, and then
      do request KVM_REQ_PMU in kvm_arch_sched_in and release those perf_events
      in the first call of kvm_pmu_handle_event() after the vcpu is scheduled in.
      
      This lazy mechanism delays the event release time to the beginning of the
      next scheduled time slice if vPMC's MSRs aren't changed during this time
      slice. If guest comes back to use this vPMC in next time slice, a new perf
      event would be re-created via perf_event_create_kernel_counter() as usual.
      Suggested-by: NWei Wang <wei.w.wang@intel.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b35e5548
    • L
      KVM: x86/vPMU: Reuse perf_event to avoid unnecessary pmc_reprogram_counter · a6da0d77
      Like Xu 提交于
      The perf_event_create_kernel_counter() in the pmc_reprogram_counter() is
      a heavyweight and high-frequency operation, especially when host disables
      the watchdog (maximum 21000000 ns) which leads to an unacceptable latency
      of the guest NMI handler. It limits the use of vPMUs in the guest.
      
      When a vPMC is fully enabled, the legacy reprogram_*_counter() would stop
      and release its existing perf_event (if any) every time EVEN in most cases
      almost the same requested perf_event will be created and configured again.
      
      For each vPMC, if the reuqested config ('u64 eventsel' for gp and 'u8 ctrl'
      for fixed) is the same as its current config AND a new sample period based
      on pmc->counter is accepted by host perf interface, the current event could
      be reused safely as a new created one does. Otherwise, do release the
      undesirable perf_event and reprogram a new one as usual.
      
      It's light-weight to call pmc_pause_counter (disable, read and reset event)
      and pmc_resume_counter (recalibrate period and re-enable event) as guest
      expects instead of release-and-create again on any condition. Compared to
      use the filterable event->attr or hw.config, a new 'u64 current_config'
      field is added to save the last original programed config for each vPMC.
      
      Based on this implementation, the number of calls to pmc_reprogram_counter
      is reduced by ~82.5% for a gp sampling event and ~99.9% for a fixed event.
      In the usage of multiplexing perf sampling mode, the average latency of the
      guest NMI handler is reduced from 104923 ns to 48393 ns (~2.16x speed up).
      If host disables watchdog, the minimum latecy of guest NMI handler could be
      speed up at ~3413x (from 20407603 to 5979 ns) and at ~786x in the average.
      Suggested-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a6da0d77
  7. 05 11月, 2019 1 次提交
  8. 04 11月, 2019 1 次提交
    • P
      kvm: mmu: ITLB_MULTIHIT mitigation · b8e8c830
      Paolo Bonzini 提交于
      With some Intel processors, putting the same virtual address in the TLB
      as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
      and cause the processor to issue a machine check resulting in a CPU lockup.
      
      Unfortunately when EPT page tables use huge pages, it is possible for a
      malicious guest to cause this situation.
      
      Add a knob to mark huge pages as non-executable. When the nx_huge_pages
      parameter is enabled (and we are using EPT), all huge pages are marked as
      NX. If the guest attempts to execute in one of those pages, the page is
      broken down into 4K pages, which are then marked executable.
      
      This is not an issue for shadow paging (except nested EPT), because then
      the host is in control of TLB flushes and the problematic situation cannot
      happen.  With nested EPT, again the nested guest can cause problems shadow
      and direct EPT is treated in the same way.
      
      [ tglx: Fixup default to auto and massage wording a bit ]
      Originally-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      b8e8c830
  9. 23 10月, 2019 1 次提交
    • J
      KVM: nVMX: Don't leak L1 MMIO regions to L2 · 671ddc70
      Jim Mattson 提交于
      If the "virtualize APIC accesses" VM-execution control is set in the
      VMCS, the APIC virtualization hardware is triggered when a page walk
      in VMX non-root mode terminates at a PTE wherein the address of the 4k
      page frame matches the APIC-access address specified in the VMCS. On
      hardware, the APIC-access address may be any valid 4k-aligned physical
      address.
      
      KVM's nVMX implementation enforces the additional constraint that the
      APIC-access address specified in the vmcs12 must be backed by
      a "struct page" in L1. If not, L0 will simply clear the "virtualize
      APIC accesses" VM-execution control in the vmcs02.
      
      The problem with this approach is that the L1 guest has arranged the
      vmcs12 EPT tables--or shadow page tables, if the "enable EPT"
      VM-execution control is clear in the vmcs12--so that the L2 guest
      physical address(es)--or L2 guest linear address(es)--that reference
      the L2 APIC map to the APIC-access address specified in the
      vmcs12. Without the "virtualize APIC accesses" VM-execution control in
      the vmcs02, the APIC accesses in the L2 guest will directly access the
      APIC-access page in L1.
      
      When there is no mapping whatsoever for the APIC-access address in L1,
      the L2 VM just loses the intended APIC virtualization. However, when
      the APIC-access address is mapped to an MMIO region in L1, the L2
      guest gets direct access to the L1 MMIO device. For example, if the
      APIC-access address specified in the vmcs12 is 0xfee00000, then L2
      gets direct access to L1's APIC.
      
      Since this vmcs12 configuration is something that KVM cannot
      faithfully emulate, the appropriate response is to exit to userspace
      with KVM_INTERNAL_ERROR_EMULATION.
      
      Fixes: fe3ef05c ("KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12")
      Reported-by: NDan Cross <dcross@google.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      671ddc70
  10. 22 10月, 2019 5 次提交
  11. 27 9月, 2019 1 次提交
    • P
      KVM: x86: assign two bits to track SPTE kinds · 6eeb4ef0
      Paolo Bonzini 提交于
      Currently, we are overloading SPTE_SPECIAL_MASK to mean both
      "A/D bits unavailable" and MMIO, where the difference between the
      two is determined by mio_mask and mmio_value.
      
      However, the next patch will need two bits to distinguish
      availability of A/D bits from write protection.  So, while at
      it give MMIO its own bit pattern, and move the two bits from
      bit 62 to bits 52..53 since Intel is allocating EPT page table
      bits from the top.
      Reviewed-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6eeb4ef0
  12. 25 9月, 2019 3 次提交
  13. 24 9月, 2019 8 次提交
  14. 14 9月, 2019 1 次提交
    • S
      KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslot · 002c5f73
      Sean Christopherson 提交于
      James Harvey reported a livelock that was introduced by commit
      d012a06a ("Revert "KVM: x86/mmu: Zap only the relevant pages when
      removing a memslot"").
      
      The livelock occurs because kvm_mmu_zap_all() as it exists today will
      voluntarily reschedule and drop KVM's mmu_lock, which allows other vCPUs
      to add shadow pages.  With enough vCPUs, kvm_mmu_zap_all() can get stuck
      in an infinite loop as it can never zap all pages before observing lock
      contention or the need to reschedule.  The equivalent of kvm_mmu_zap_all()
      that was in use at the time of the reverted commit (4e103134, "KVM:
      x86/mmu: Zap only the relevant pages when removing a memslot") employed
      a fast invalidate mechanism and was not susceptible to the above livelock.
      
      There are three ways to fix the livelock:
      
      - Reverting the revert (commit d012a06a) is not a viable option as
        the revert is needed to fix a regression that occurs when the guest has
        one or more assigned devices.  It's unlikely we'll root cause the device
        assignment regression soon enough to fix the regression timely.
      
      - Remove the conditional reschedule from kvm_mmu_zap_all().  However, although
        removing the reschedule would be a smaller code change, it's less safe
        in the sense that the resulting kvm_mmu_zap_all() hasn't been used in
        the wild for flushing memslots since the fast invalidate mechanism was
        introduced by commit 6ca18b69 ("KVM: x86: use the fast way to
        invalidate all pages"), back in 2013.
      
      - Reintroduce the fast invalidate mechanism and use it when zapping shadow
        pages in response to a memslot being deleted/moved, which is what this
        patch does.
      
      For all intents and purposes, this is a revert of commit ea145aac
      ("Revert "KVM: MMU: fast invalidate all pages"") and a partial revert of
      commit 7390de1e ("Revert "KVM: x86: use the fast way to invalidate
      all pages""), i.e. restores the behavior of commit 5304b8d3 ("KVM:
      MMU: fast invalidate all pages") and commit 6ca18b69 ("KVM: x86:
      use the fast way to invalidate all pages") respectively.
      
      Fixes: d012a06a ("Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"")
      Reported-by: NJames Harvey <jamespharvey20@gmail.com>
      Cc: Alex Willamson <alex.williamson@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      002c5f73
  15. 12 9月, 2019 1 次提交
    • L
      KVM: x86: Fix INIT signal handling in various CPU states · 4b9852f4
      Liran Alon 提交于
      Commit cd7764fe ("KVM: x86: latch INITs while in system management mode")
      changed code to latch INIT while vCPU is in SMM and process latched INIT
      when leaving SMM. It left a subtle remark in commit message that similar
      treatment should also be done while vCPU is in VMX non-root-mode.
      
      However, INIT signals should actually be latched in various vCPU states:
      (*) For both Intel and AMD, INIT signals should be latched while vCPU
      is in SMM.
      (*) For Intel, INIT should also be latched while vCPU is in VMX
      operation and later processed when vCPU leaves VMX operation by
      executing VMXOFF.
      (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
      or in guest-mode with intercept defined on INIT signal.
      
      To fix this:
      1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
      can define the various CPU states in which INIT signals should be
      blocked and modify kvm_apic_accept_events() to use it.
      2) Modify vmx_check_nested_events() to check for pending INIT signal
      while vCPU in guest-mode. If so, emualte vmexit on
      EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
      but is currently left as a TODO comment to implement in the future
      because nSVM don't yet implement svm_check_nested_events().
      
      Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
      activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
      guest (See nested_vmx_setup_ctls_msrs()).
      If and when support for this activity state will be implemented,
      kvm_check_nested_events() would need to avoid emulating vmexit on
      INIT signal in case activity-state is wait-for-SIPI. In addition,
      kvm_apic_accept_events() would need to be modified to avoid discarding
      SIPI in case VMX activity-state is wait-for-SIPI but instead delay
      SIPI processing to vmx_check_nested_events() that would clear
      pending APIC events and emulate vmexit on SIPI.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Co-developed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4b9852f4
  16. 11 9月, 2019 2 次提交
  17. 10 9月, 2019 1 次提交
  18. 22 8月, 2019 2 次提交
  19. 05 8月, 2019 1 次提交