1. 12 5月, 2022 3 次提交
    • S
      KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use · 54275f74
      Sean Christopherson 提交于
      Check for A/D bits being disabled instead of the access tracking mask
      being non-zero when deciding whether or not to attempt to fix a page
      fault vian the fast path.  Originally, the access tracking mask was
      non-zero if and only if A/D bits were disabled by _KVM_ (including not
      being supported by hardware), but that hasn't been true since nVMX was
      fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
      KVM to not use A/D bits while running L2 despite KVM using them while
      running L1.
      
      In other words, don't attempt the fast path just because EPT is enabled.
      
      Note, attempting the fast path for all !PRESENT faults can "fix" a very,
      _VERY_ tiny percentage of faults out of mmu_lock by detecting that the
      fault is spurious, i.e. has been fixed by a different vCPU, but again the
      odds of that happening are vanishingly small.  E.g. booting an 8-vCPU VM
      gets less than 10 successes out of 30k+ faults, and that's likely one of
      the more favorable scenarios.  Disabling dirty logging can likely lead to
      a rash of collisions between vCPUs for some workloads that operate on a
      common set of pages, but penalizing _all_ !PRESENT faults for that one
      case is unlikely to be a net positive, not to mention that that problem
      is best solved by not zapping in the first place.
      
      The number of spurious faults does scale with the number of vCPUs, e.g. a
      255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
      path (again out of 30k), but that's all of 0.2% of faults.  Using legacy
      shadow paging does get more spurious faults, and a few more detected out
      of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
      faults that are reflected into the guest), i.e. the extra detections are
      purely due to the sheer number of faults observed.
      
      On the other hand, getting a "negative" in the fast path takes in the
      neighborhood of 150-250 cycles.  So while it is tempting to keep/extend
      the current behavior, such a change needs to come with hard numbers
      showing that it's actually a win in the grand scheme, or any scheme for
      that matter.
      
      Fixes: 995f00a6 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54275f74
    • L
      KVM: VMX: clean up pi_wakeup_handler · 91ab933f
      Li RongQing 提交于
      Passing per_cpu() to list_for_each_entry() causes the macro to be
      evaluated N+1 times for N sleeping vCPUs.  This is a very small
      inefficiency, and the code is cleaner if the address of the per-CPU
      variable is loaded earlier.  Do this for both the list and the spinlock.
      Signed-off-by: NLi RongQing <lirongqing@baidu.com>
      Message-Id: <1649244302-6777-1-git-send-email-lirongqing@baidu.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      91ab933f
    • M
      KVM: x86: fix typo in __try_cmpxchg_user causing non-atomicness · 33fbe6be
      Maxim Levitsky 提交于
      This shows up as a TDP MMU leak when running nested.  Non-working cmpxchg on L0
      relies makes L1 install two different shadow pages under same spte, and one of
      them is leaked.
      
      Fixes: 1c2361f6 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses")
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220512101420.306759-1-mlevitsk@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      33fbe6be
  2. 03 5月, 2022 7 次提交
    • P
      Merge branch 'kvm-amd-pmu-fixes' into HEAD · 99132883
      Paolo Bonzini 提交于
      99132883
    • S
      kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU · 5a1bde46
      Sandipan Das 提交于
      On some x86 processors, CPUID leaf 0xA provides information
      on Architectural Performance Monitoring features. It
      advertises a PMU version which Qemu uses to determine the
      availability of additional MSRs to manage the PMCs.
      
      Upon receiving a KVM_GET_SUPPORTED_CPUID ioctl request for
      the same, the kernel constructs return values based on the
      x86_pmu_capability irrespective of the vendor.
      
      This leaf and the additional MSRs are not supported on AMD
      and Hygon processors. If AMD PerfMonV2 is detected, the PMU
      version is set to 2 and guest startup breaks because of an
      attempt to access a non-existent MSR. Return zeros to avoid
      this.
      
      Fixes: a6c06ed1 ("KVM: Expose the architectural performance monitoring CPUID leaf")
      Reported-by: NVasant Hegde <vasant.hegde@amd.com>
      Signed-off-by: NSandipan Das <sandipan.das@amd.com>
      Message-Id: <3fef83d9c2b2f7516e8ff50d60851f29a4bcb716.1651058600.git.sandipan.das@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5a1bde46
    • K
      KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_id · 5eb84932
      Kyle Huey 提交于
      Zen renumbered some of the performance counters that correspond to the
      well known events in perf_hw_id. This code in KVM was never updated for
      that, so guest that attempt to use counters on Zen that correspond to the
      pre-Zen perf_hw_id values will silently receive the wrong values.
      
      This has been observed in the wild with rr[0] when running in Zen 3
      guests. rr uses the retired conditional branch counter 00d1 which is
      incorrectly recognized by KVM as PERF_COUNT_HW_STALLED_CYCLES_BACKEND.
      
      [0] https://rr-project.org/Signed-off-by: NKyle Huey <me@kylehuey.com>
      Message-Id: <20220503050136.86298-1-khuey@kylehuey.com>
      Cc: stable@vger.kernel.org
      [Check guest family, not host. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5eb84932
    • P
      Merge branch 'kvm-tdp-mmu-atomicity-fix' into HEAD · 6ea6581f
      Paolo Bonzini 提交于
      We are dropping A/D bits (and W bits) in the TDP MMU.  Even if mmu_lock
      is held for write, as volatile SPTEs can be written by other tasks/vCPUs
      outside of mmu_lock.
      
      Attempting to prove that bug exposed another notable goof, which has been
      lurking for a decade, give or take: KVM treats _all_ MMU-writable SPTEs
      as volatile, even though KVM never clears WRITABLE outside of MMU lock.
      As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to
      update writable SPTEs.
      
      The fix does not seem to have an easily-measurable affect on performance;
      page faults are so slow that wasting even a few hundred cycles is dwarfed
      by the base cost.
      6ea6581f
    • S
      KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits · ba3a6120
      Sean Christopherson 提交于
      Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
      if mmu_lock is held for write, as volatile SPTEs can be written by other
      tasks/vCPUs outside of mmu_lock.  If a vCPU uses the to-be-modified SPTE
      to write a page, the CPU can cache the translation as WRITABLE in the TLB
      despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
      Accessed/Dirty bits and not properly tag the backing page.
      
      Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
      non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
      KVM doesn't consume the Accessed bit of non-leaf SPTEs.
      
      Dropping the Dirty and/or Writable bits is most problematic for dirty
      logging, as doing so can result in a missed TLB flush and eventually a
      missed dirty page.  In the unlikely event that the only dirty page(s) is
      a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
      (based on the Dirty or Writable bit depending on the method) and so not
      update the SPTE and ultimately not flush.  If the SPTE is cached in the
      TLB as writable before it is clobbered, the guest can continue writing
      the associated page without ever taking a write-protect fault.
      
      For most (all?) file back memory, dropping the Dirty bit is a non-issue.
      The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
      bit is effectively ignored because the primary MMU will mark that page
      dirty when the write-protection is lifted, e.g. when KVM faults the page
      back in for write.
      
      The Accessed bit is a complete non-issue.  Aside from being unused for
      non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
      Accessed bit may be dropped anyways.
      
      Lastly, the Writable bit is also problematic as an extension of the Dirty
      bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
      !DIRTY && WRITABLE.  If KVM fixes an MMU-writable, but !WRITABLE, SPTE
      out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
      the SPTE being !WRITABLE when it is checked by KVM.  But that all depends
      on the Dirty bit being problematic in the first place.
      
      Fixes: 2f2fad08 ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Cc: Venkatesh Srinivas <venkateshs@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ba3a6120
    • S
      KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits() · 54eb3ef5
      Sean Christopherson 提交于
      Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
      and into its callers.  Well, caller, since only one of its two callers
      doesn't already do the shadow-present check.
      
      Opportunistically move the helper to spte.c/h so that it can be used by
      the TDP MMU, which is also the primary motivation for the shadow-present
      change.  Unlike the legacy MMU, the TDP MMU uses a single path for clear
      leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
      MMU will need to check is_last_spte() prior to calling
      spte_has_volatile_bits(), and calling is_last_spte() without first
      calling is_shadow_present_spte() is at best odd, and at worst a violation
      of KVM's loosely defines SPTE rules.
      
      Note, mmu_spte_clear_track_bits() could likely skip the write entirely
      for SPTEs that are not shadow-present.  Leave that cleanup for a future
      patch to avoid introducing a functional change, and because the
      shadow-present check can likely be moved further up the stack, e.g.
      drop_large_spte() appears to be the only path that doesn't already
      explicitly check for a shadow-present SPTE.
      
      No functional change intended.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54eb3ef5
    • S
      KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D) · 706c9c55
      Sean Christopherson 提交于
      Don't treat SPTEs that are truly writable, i.e. writable in hardware, as
      being volatile (unless they're volatile for other reasons, e.g. A/D bits).
      KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit
      out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get
      cleared just because the SPTE is MMU-writable.
      
      Rename the wrapper of MMU-writable to be more literal, the previous name
      of spte_can_locklessly_be_made_writable() is wrong and misleading.
      
      Fixes: c7ba5b48 ("KVM: MMU: fast path of handling guest page fault")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      706c9c55
  3. 02 5月, 2022 3 次提交
  4. 30 4月, 2022 27 次提交