1. 05 8月, 2021 1 次提交
    • P
      KVM: xen: do not use struct gfn_to_hva_cache · 319afe68
      Paolo Bonzini 提交于
      gfn_to_hva_cache is not thread-safe, so it is usually used only within
      a vCPU (whose code is protected by vcpu->mutex).  The Xen interface
      implementation has such a cache in kvm->arch, but it is not really
      used except to store the location of the shared info page.  Replace
      shinfo_set and shinfo_cache with just the value that is passed via
      KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the
      initialization value is not zero anymore and therefore kvm_xen_init_vm
      needs to be introduced.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      319afe68
  2. 04 8月, 2021 4 次提交
    • L
      KVM: x86/pmu: Introduce pmc->is_paused to reduce the call time of perf interfaces · e79f49c3
      Like Xu 提交于
      Based on our observations, after any vm-exit associated with vPMU, there
      are at least two or more perf interfaces to be called for guest counter
      emulation, such as perf_event_{pause, read_value, period}(), and each one
      will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes
      more severe when guest use counters in a multiplexed manner.
      
      Holding a lock once and completing the KVM request operations in the perf
      context would introduce a set of impractical new interfaces. So we can
      further optimize the vPMU implementation by avoiding repeated calls to
      these interfaces in the KVM context for at least one pattern:
      
      After we call perf_event_pause() once, the event will be disabled and its
      internal count will be reset to 0. So there is no need to pause it again
      or read its value. Once the event is paused, event period will not be
      updated until the next time it's resumed or reprogrammed. And there is
      also no need to call perf_event_period twice for a non-running counter,
      considering the perf_event for a running counter is never paused.
      
      Based on this implementation, for the following common usage of
      sampling 4 events using perf on a 4u8g guest:
      
        echo 0 > /proc/sys/kernel/watchdog
        echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
        echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate
        echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
        for i in `seq 1 1 10`
        do
        taskset -c 0 perf record \
        -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \
        /root/br_instr a
        done
      
      the average latency of the guest NMI handler is reduced from
      37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server.
      Also, in addition to collecting more samples, no loss of sampling
      accuracy was observed compared to before the optimization.
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20210728120705.6855-1-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      e79f49c3
    • P
      KVM: X86: Optimize zapping rmap · a75b5404
      Peter Xu 提交于
      Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
      slow.  The easy way is to travers the rmap list, collecting the a/d bits and
      free the slots along the way.
      
      Provide a pte_list_destroy() and do exactly that.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220605.26377-1-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a75b5404
    • P
      KVM: X86: Optimize pte_list_desc with per-array counter · 13236e25
      Peter Xu 提交于
      Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
      logic.  E.g., we don't need to loop over the array any more for most reasons.
      
      This will make more sense after we've switched the array size to be larger
      otherwise the counter will be a waste.
      
      Initially I wanted to store a tail pointer at the head of the array list so we
      don't need to traverse the list at least for pushing new ones (if without the
      counter we traverse both the list and the array).  However that'll need
      slightly more change without a huge lot benefit, e.g., after we grow entry
      numbers per array the list traversing is not so expensive.
      
      So let's be simple but still try to get as much benefit as we can with just
      these extra few lines of changes (not to mention the code looks easier too
      without looping over arrays).
      
      I used the same a test case to fork 500 child and recycle them ("./rmap_fork
      500" [1]), this patch further speeds up the total fork time of about 4%, which
      is a total of 33% of vanilla kernel:
      
              Vanilla:      473.90 (+-5.93%)
              3->15 slots:  366.10 (+-4.94%)
              Add counter:  351.00 (+-3.70%)
      
      [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220602.26327-1-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      13236e25
    • P
      KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger · dc1cff96
      Peter Xu 提交于
      Currently rmap array element only contains 3 entries.  However for EPT=N there
      could have a lot of guest pages that got tens of even hundreds of rmap entry.
      
      A normal distribution of a 6G guest (even if idle) shows this with rmap count
      statistics:
      
      Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
      Level=4K:       3089171 49005   14016   1363    235     212     15      7       0       0       0
      Level=2M:       5951    227     0       0       0       0       0       0       0       0       0
      Level=1G:       32      0       0       0       0       0       0       0       0       0       0
      
      If we do some more fork some pages will grow even larger rmap counts.
      
      This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general
      use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT
      will be slightly more efficient; but still not too large so less waste when
      array not full.
      
      It should not affecting EPT=Y since EPT normally only has zero or one rmap
      entry for each page, so no array is even allocated.
      
      With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]),
      this patch speeds up fork time of about 29%.
      
          Before: 473.90 (+-5.93%)
          After:  366.10 (+-4.94%)
      
      [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220455.26054-6-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dc1cff96
  3. 03 8月, 2021 3 次提交
    • H
      KVM: const-ify all relevant uses of struct kvm_memory_slot · 269e9552
      Hamza Mahfooz 提交于
      As alluded to in commit f36f3f28 ("KVM: add "new" argument to
      kvm_arch_commit_memory_region"), a bunch of other places where struct
      kvm_memory_slot is used, needs to be refactored to preserve the
      "const"ness of struct kvm_memory_slot across-the-board.
      Signed-off-by: NHamza Mahfooz <someguy@effective-light.com>
      Message-Id: <20210713023338.57108-1-someguy@effective-light.com>
      [Do not touch body of slot_rmap_walk_init. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      269e9552
    • P
      KVM: Don't take mmu_lock for range invalidation unless necessary · 071064f1
      Paolo Bonzini 提交于
      Avoid taking mmu_lock for .invalidate_range_{start,end}() notifications
      that are unrelated to KVM.  This is possible now that memslot updates are
      blocked from range_start() to range_end(); that ensures that lock elision
      happens in both or none, and therefore that mmu_notifier_count updates
      (which must occur while holding mmu_lock for write) are always paired
      across start->end.
      
      Based on patches originally written by Ben Gardon.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      071064f1
    • P
      KVM: Block memslot updates across range_start() and range_end() · 52ac8b35
      Paolo Bonzini 提交于
      We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
      notifications that are unrelated to KVM.  Because mmu_notifier_count
      must be modified while holding mmu_lock for write, and must always
      be paired across start->end to stay balanced, lock elision must
      happen in both or none.  Therefore, in preparation for this change,
      this patch prevents memslot updates across range_start() and range_end().
      
      Note, technically flag-only memslot updates could be allowed in parallel,
      but stalling a memslot update for a relatively short amount of time is
      not a scalability issue, and this is all more than complex enough.
      
      A long note on the locking: a previous version of the patch used an rwsem
      to block the memslot update while the MMU notifier run, but this resulted
      in the following deadlock involving the pseudo-lock tagged as
      "mmu_notifier_invalidate_range_start".
      
         ======================================================
         WARNING: possible circular locking dependency detected
         5.12.0-rc3+ #6 Tainted: G           OE
         ------------------------------------------------------
         qemu-system-x86/3069 is trying to acquire lock:
         ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
      
         but task is already holding lock:
         ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
      
         which lock already depends on the new lock.
      
      This corresponds to the following MMU notifier logic:
      
          invalidate_range_start
            take pseudo lock
            down_read()           (*)
            release pseudo lock
          invalidate_range_end
            take pseudo lock      (**)
            up_read()
            release pseudo lock
      
      At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
      at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
      
      This could cause a deadlock (ignoring for a second that the pseudo lock
      is not a lock):
      
      - invalidate_range_start waits on down_read(), because the rwsem is
      held by install_new_memslots
      
      - install_new_memslots waits on down_write(), because the rwsem is
      held till (another) invalidate_range_end finishes
      
      - invalidate_range_end sits waits on the pseudo lock, held by
      invalidate_range_start.
      
      Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
      it would change the *shared* rwsem readers into *shared recursive*
      readers), so open-code the wait using a readers count and a
      spinlock.  This also allows handling blockable and non-blockable
      critical section in the same way.
      
      Losing the rwsem fairness does theoretically allow MMU notifiers to
      block install_new_memslots forever.  Note that mm/mmu_notifier.c's own
      retry scheme in mmu_interval_read_begin also uses wait/wake_up
      and is likewise not fair.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      52ac8b35
  4. 02 8月, 2021 32 次提交