1. 20 1月, 2022 12 次提交
    • S
      KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks · c3e8abf0
      Sean Christopherson 提交于
      Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-10-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c3e8abf0
    • S
      KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers · b6d42bad
      Sean Christopherson 提交于
      Unexport switch_to_{hv,sw}_timer() now that common x86 handles the
      transitions.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-9-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b6d42bad
    • S
      KVM: VMX: Move preemption timer <=> hrtimer dance to common x86 · 98c25ead
      Sean Christopherson 提交于
      Handle the switch to/from the hypervisor/software timer when a vCPU is
      blocking in common x86 instead of in VMX.  Even though VMX is the only
      user of a hypervisor timer, the logic and all functions involved are
      generic x86 (unless future CPUs do something completely different and
      implement a hypervisor timer that runs regardless of mode).
      
      Handling the switch in common x86 will allow for the elimination of the
      pre/post_blocks hooks, and also lets KVM switch back to the hypervisor
      timer if and only if it was in use (without additional params).  Add a
      comment explaining why the switch cannot be deferred to kvm_sched_out()
      or kvm_vcpu_block().
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      98c25ead
    • S
      KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx · 12a8eee5
      Sean Christopherson 提交于
      Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and
      rename the list and all associated variables to clarify that it tracks
      the set of vCPU that need to be poked on a posted interrupt to the wakeup
      vector.  The list is not used to track _all_ vCPUs that are blocking, and
      the term "blocked" can be misleading as it may refer to a blocking
      condition in the host or the guest, where as the PI wakeup case is
      specifically for the vCPUs that are actively blocking from within the
      guest.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      12a8eee5
    • S
      KVM: VMX: Handle PI descriptor updates during vcpu_put/load · d76fb406
      Sean Christopherson 提交于
      Move the posted interrupt pre/post_block logic into vcpu_put/load
      respectively, using the kvm_vcpu_is_blocking() to determining whether or
      not the wakeup handler needs to be set (and unset).  This avoids updating
      the PI descriptor if halt-polling is successful, reduces the number of
      touchpoints for updating the descriptor, and eliminates the confusing
      behavior of intentionally leaving a "stale" PI.NDST when a blocking vCPU
      is scheduled back in after preemption.
      
      The downside is that KVM will do the PID update twice if the vCPU is
      preempted after prepare_to_rcuwait() but before schedule(), but that's a
      rare case (and non-existent on !PREEMPT kernels).
      
      The notable wart is the need to send a self-IPI on the wakeup vector if
      an outstanding notification is pending after configuring the wakeup
      vector.  Ideally, KVM would just do a kvm_vcpu_wake_up() in this case,
      but the scheduler doesn't support waking a task from its preemption
      notifier callback, i.e. while the task is right in the middle of
      being scheduled out.
      
      Note, setting the wakeup vector before halt-polling is not necessary:
      once the pending IRQ will be recorded in the PIR, kvm_vcpu_has_events()
      will detect this (via kvm_cpu_get_interrupt(), kvm_apic_get_interrupt(),
      apic_has_interrupt_for_ppr() and finally vmx_sync_pir_to_irr()) and
      terminate the polling.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d76fb406
    • S
      KVM: VMX: Reject KVM_RUN if emulation is required with pending exception · fc4fad79
      Sean Christopherson 提交于
      Reject KVM_RUN if emulation is required (because VMX is running without
      unrestricted guest) and an exception is pending, as KVM doesn't support
      emulating exceptions except when emulating real mode via vm86.  The vCPU
      is hosed either way, but letting KVM_RUN proceed triggers a WARN due to
      the impossible condition.  Alternatively, the WARN could be removed, but
      then userspace and/or KVM bugs would result in the vCPU silently running
      in a bad state, which isn't very friendly to users.
      
      Originally, the bug was hit by syzkaller with a nested guest as that
      doesn't require kvm_intel.unrestricted_guest=0.  That particular flavor
      is likely fixed by commit cd0e615c ("KVM: nVMX: Synthesize
      TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to
      trigger the WARN with a non-nested guest, and userspace can likely force
      bad state via ioctls() for a nested guest as well.
      
      Checking for the impossible condition needs to be deferred until KVM_RUN
      because KVM can't force specific ordering between ioctls.  E.g. clearing
      exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting
      it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with
      emulation_required would prevent userspace from queuing an exception and
      then stuffing sregs.  Note, if KVM were to try and detect/prevent the
      condition prior to KVM_RUN, handle_invalid_guest_state() and/or
      handle_emulation_failure() would need to be modified to clear the pending
      exception prior to exiting to userspace.
      
       ------------[ cut here ]------------
       WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel]
       CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279
       Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
       RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel]
       Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
       RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202
       RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006
       RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000
       RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006
       R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000
       FS:  00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0
       Call Trace:
        kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm]
        kvm_vcpu_ioctl+0x279/0x690 [kvm]
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x3b/0xc0
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Reported-by: syzbot+82112403ace4cbd780d8@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211228232437.1875318-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc4fad79
    • J
      KVM: x86/pmu: Use binary search to check filtered events · 7ff775ac
      Jim Mattson 提交于
      The PMU event filter may contain up to 300 events. Replace the linear
      search in reprogram_gp_counter() with a binary search.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220115052431.447232-2-jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7ff775ac
    • L
      KVM: x86/cpuid: Clear XFD for component i if the base feature is missing · e9737468
      Like Xu 提交于
      According to Intel extended feature disable (XFD) spec, the sub-function i
      (i > 1) of CPUID function 0DH enumerates "details for state component i.
      ECX[2] enumerates support for XFD support for this state component."
      
      If KVM does not report F(XFD) feature (e.g. due to CONFIG_X86_64),
      then the corresponding XFD support for any state component i
      should also be removed. Translate this dependency into KVM terms.
      
      Fixes: 690a757d ("kvm: x86: Add CPUID support for Intel AMX")
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20220117074531.76925-1-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e9737468
    • D
      KVM: x86/mmu: Improve TLB flush comment in kvm_mmu_slot_remove_write_access() · 6ff94f27
      David Matlack 提交于
      Rewrite the comment in kvm_mmu_slot_remove_write_access() that explains
      why it is safe to flush TLBs outside of the MMU lock after
      write-protecting SPTEs for dirty logging. The current comment is a long
      run-on sentence that was difficult to understand. In addition it was
      specific to the shadow MMU (mentioning mmu_spte_update()) when the TDP
      MMU has to handle this as well.
      
      The new comment explains:
       - Why the TLB flush is necessary at all.
       - Why it is desirable to do the TLB flush outside of the MMU lock.
       - Why it is safe to do the TLB flush outside of the MMU lock.
      
      No functional change intended.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220113233020.3986005-5-dmatlack@google.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ff94f27
    • D
      KVM: x86/mmu: Document and enforce MMU-writable and Host-writable invariants · 5f16bcac
      David Matlack 提交于
      SPTEs are tagged with software-only bits to indicate if it is
      "MMU-writable" and "Host-writable". These bits are used to determine why
      KVM has marked an SPTE as read-only.
      
      Document these bits and their invariants, and enforce the invariants
      with new WARNs in spte_can_locklessly_be_made_writable() to ensure they
      are not accidentally violated in the future.
      
      Opportunistically move DEFAULT_SPTE_{MMU,HOST}_WRITABLE next to
      EPT_SPTE_{MMU,HOST}_WRITABLE since the new documentation applies to
      both.
      
      No functional change intended.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220113233020.3986005-4-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5f16bcac
    • D
      KVM: x86/mmu: Clear MMU-writable during changed_pte notifier · f082d86e
      David Matlack 提交于
      When handling the changed_pte notifier and the new PTE is read-only,
      clear both the Host-writable and MMU-writable bits in the SPTE. This
      preserves the invariant that MMU-writable is set if-and-only-if
      Host-writable is set.
      
      No functional change intended. Nothing currently relies on the
      aforementioned invariant and technically the changed_pte notifier is
      dead code.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220113233020.3986005-3-dmatlack@google.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f082d86e
    • D
      KVM: x86/mmu: Fix write-protection of PTs mapped by the TDP MMU · 7c8a4742
      David Matlack 提交于
      When the TDP MMU is write-protection GFNs for page table protection (as
      opposed to for dirty logging, or due to the HVA not being writable), it
      checks if the SPTE is already write-protected and if so skips modifying
      the SPTE and the TLB flush.
      
      This behavior is incorrect because it fails to check if the SPTE
      is write-protected for page table protection, i.e. fails to check
      that MMU-writable is '0'.  If the SPTE was write-protected for dirty
      logging but not page table protection, the SPTE could locklessly be made
      writable, and vCPUs could still be running with writable mappings cached
      in their TLB.
      
      Fix this by only skipping setting the SPTE if the SPTE is already
      write-protected *and* MMU-writable is already clear.  Technically,
      checking only MMU-writable would suffice; a SPTE cannot be writable
      without MMU-writable being set.  But check both to be paranoid and
      because it arguably yields more readable code.
      
      Fixes: 46044f72 ("kvm: x86/mmu: Support write protection for nesting in tdp MMU")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220113233020.3986005-2-dmatlack@google.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7c8a4742
  2. 18 1月, 2022 5 次提交
    • M
      KVM: VMX: switch blocked_vcpu_on_cpu_lock to raw spinlock · 5f02ef74
      Marcelo Tosatti 提交于
      blocked_vcpu_on_cpu_lock is taken from hard interrupt context
      (pi_wakeup_handler), therefore it cannot sleep.
      
      Switch it to a raw spinlock.
      
      Fixes:
      
      [41297.066254] BUG: scheduling while atomic: CPU 0/KVM/635218/0x00010001
      [41297.066323] Preemption disabled at:
      [41297.066324] [<ffffffff902ee47f>] irq_enter_rcu+0xf/0x60
      [41297.066339] Call Trace:
      [41297.066342]  <IRQ>
      [41297.066346]  dump_stack_lvl+0x34/0x44
      [41297.066353]  ? irq_enter_rcu+0xf/0x60
      [41297.066356]  __schedule_bug.cold+0x7d/0x8b
      [41297.066361]  __schedule+0x439/0x5b0
      [41297.066365]  ? task_blocks_on_rt_mutex.constprop.0.isra.0+0x1b0/0x440
      [41297.066369]  schedule_rtlock+0x1e/0x40
      [41297.066371]  rtlock_slowlock_locked+0xf1/0x260
      [41297.066374]  rt_spin_lock+0x3b/0x60
      [41297.066378]  pi_wakeup_handler+0x31/0x90 [kvm_intel]
      [41297.066388]  sysvec_kvm_posted_intr_wakeup_ipi+0x9d/0xd0
      [41297.066392]  </IRQ>
      [41297.066392]  asm_sysvec_kvm_posted_intr_wakeup_ipi+0x12/0x20
      ...
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5f02ef74
    • L
      KVM: x86: Making the module parameter of vPMU more common · 4732f244
      Like Xu 提交于
      The new module parameter to control PMU virtualization should apply
      to Intel as well as AMD, for situations where userspace is not trusted.
      If the module parameter allows PMU virtualization, there could be a
      new KVM_CAP or guest CPUID bits whereby userspace can enable/disable
      PMU virtualization on a per-VM basis.
      
      If the module parameter does not allow PMU virtualization, there
      should be no userspace override, since we have no precedent for
      authorizing that kind of override. If it's false, other counter-based
      profiling features (such as LBR including the associated CPUID bits
      if any) will not be exposed.
      
      Change its name from "pmu" to "enable_pmu" as we have temporary
      variables with the same name in our code like "struct kvm_pmu *pmu".
      
      Fixes: b1d66dad ("KVM: x86/svm: Add module param to control PMU virtualization")
      Suggested-by : Jim Mattson <jmattson@google.com>
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20220111073823.21885-1-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4732f244
    • V
      KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN · c6617c61
      Vitaly Kuznetsov 提交于
      Commit feb627e8 ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN")
      forbade changing CPUID altogether but unfortunately this is not fully
      compatible with existing VMMs. In particular, QEMU reuses vCPU fds for
      CPU hotplug after unplug and it calls KVM_SET_CPUID2. Instead of full ban,
      check whether the supplied CPUID data is equal to what was previously set.
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Fixes: feb627e8 ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220117150542.2176196-3-vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      [Do not call kvm_find_cpuid_entry repeatedly. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c6617c61
    • V
      KVM: x86: Do runtime CPUID update before updating vcpu->arch.cpuid_entries · ee3a5f9e
      Vitaly Kuznetsov 提交于
      kvm_update_cpuid_runtime() mangles CPUID data coming from userspace
      VMM after updating 'vcpu->arch.cpuid_entries', this makes it
      impossible to compare an update with what was previously
      supplied. Introduce __kvm_update_cpuid_runtime() version which can be
      used to tweak the input before it goes to 'vcpu->arch.cpuid_entries'
      so the upcoming update check can compare tweaked data.
      
      No functional change intended.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220117150542.2176196-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ee3a5f9e
    • L
      KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event · a2186448
      Like Xu 提交于
      According to CPUID 0x0A.EBX bit vector, the event [7] should be the
      unrealized event "Topdown Slots" instead of the *kernel* generalized
      common hardware event "REF_CPU_CYCLES", so we need to skip the cpuid
      unavaliblity check in the intel_pmc_perf_hw_id() for the last
      REF_CPU_CYCLES event and update the confusing comment.
      
      If the event is marked as unavailable in the Intel guest CPUID
      0AH.EBX leaf, we need to avoid any perf_event creation, whether
      it's a gp or fixed counter. To distinguish whether it is a rejected
      event or an event that needs to be programmed with PERF_TYPE_RAW type,
      a new special returned value of "PERF_COUNT_HW_MAX + 1" is introduced.
      
      Fixes: 62079d8a ("KVM: PMU: add proper support for fixed counter 2")
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20220105051509.69437-1-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a2186448
  3. 15 1月, 2022 9 次提交
  4. 08 1月, 2022 4 次提交
  5. 07 1月, 2022 10 次提交
    • M
      KVM: SVM: include CR3 in initial VMSA state for SEV-ES guests · 405329fc
      Michael Roth 提交于
      Normally guests will set up CR3 themselves, but some guests, such as
      kselftests, and potentially CONFIG_PVH guests, rely on being booted
      with paging enabled and CR3 initialized to a pre-allocated page table.
      
      Currently CR3 updates via KVM_SET_SREGS* are not loaded into the guest
      VMCB until just prior to entering the guest. For SEV-ES/SEV-SNP, this
      is too late, since it will have switched over to using the VMSA page
      prior to that point, with the VMSA CR3 copied from the VMCB initial
      CR3 value: 0.
      
      Address this by sync'ing the CR3 value into the VMCB save area
      immediately when KVM_SET_SREGS* is issued so it will find it's way into
      the initial VMSA.
      Suggested-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NMichael Roth <michael.roth@amd.com>
      Message-Id: <20211216171358.61140-10-michael.roth@amd.com>
      [Remove vmx_post_set_cr3; add a remark about kvm_set_cr3 not calling the
       new hook. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      405329fc
    • P
      KVM: VMX: Provide vmread version using asm-goto-with-outputs · 907d1393
      Peter Zijlstra 提交于
      Use asm-goto-output for smaller fast path code.
      
      Message-Id: <YbcbbGW2GcMx6KpD@hirez.programming.kicks-ass.net>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      907d1393
    • D
      KVM: x86: Fix wall clock writes in Xen shared_info not to mark page dirty · 55749769
      David Woodhouse 提交于
      When dirty ring logging is enabled, any dirty logging without an active
      vCPU context will cause a kernel oops. But we've already declared that
      the shared_info page doesn't get dirty tracking anyway, since it would
      be kind of insane to mark it dirty every time we deliver an event channel
      interrupt. Userspace is supposed to just assume it's always dirty any
      time a vCPU can run or event channels are routed.
      
      So stop using the generic kvm_write_wall_clock() and just write directly
      through the gfn_to_pfn_cache that we already have set up.
      
      We can make kvm_write_wall_clock() static in x86.c again now, but let's
      not remove the 'sec_hi_ofs' argument even though it's not used yet. At
      some point we *will* want to use that for KVM guests too.
      
      Fixes: 629b5348 ("KVM: x86/xen: update wallclock region")
      Reported-by: Nbutt3rflyh4ck <butterflyhuangxx@gmail.com>
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20211210163625.2886-6-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      55749769
    • D
      KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery · 14243b38
      David Woodhouse 提交于
      This adds basic support for delivering 2 level event channels to a guest.
      
      Initially, it only supports delivery via the IRQ routing table, triggered
      by an eventfd. In order to do so, it has a kvm_xen_set_evtchn_fast()
      function which will use the pre-mapped shared_info page if it already
      exists and is still valid, while the slow path through the irqfd_inject
      workqueue will remap the shared_info page if necessary.
      
      It sets the bits in the shared_info page but not the vcpu_info; that is
      deferred to __kvm_xen_has_interrupt() which raises the vector to the
      appropriate vCPU.
      
      Add a 'verbose' mode to xen_shinfo_test while adding test cases for this.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20211210163625.2886-5-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      14243b38
    • D
      KVM: x86/xen: Maintain valid mapping of Xen shared_info page · 1cfc9c4b
      David Woodhouse 提交于
      Use the newly reinstated gfn_to_pfn_cache to maintain a kernel mapping
      of the Xen shared_info page so that it can be accessed in atomic context.
      
      Note that we do not participate in dirty tracking for the shared info
      page and we do not explicitly mark it dirty every single tim we deliver
      an event channel interrupts. We wouldn't want to do that even if we *did*
      have a valid vCPU context with which to do so.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20211210163625.2886-4-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1cfc9c4b
    • D
      KVM: Reinstate gfn_to_pfn_cache with invalidation support · 982ed0de
      David Woodhouse 提交于
      This can be used in two modes. There is an atomic mode where the cached
      mapping is accessed while holding the rwlock, and a mode where the
      physical address is used by a vCPU in guest mode.
      
      For the latter case, an invalidation will wake the vCPU with the new
      KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
      caches it still needs to access before entering guest mode again.
      
      Only one vCPU can be targeted by the wake requests; it's simple enough
      to make it wake all vCPUs or even a mask but I don't see a use case for
      that additional complexity right now.
      
      Invalidation happens from the invalidate_range_start MMU notifier, which
      needs to be able to sleep in order to wake the vCPU and wait for it.
      
      This means that revalidation potentially needs to "wait" for the MMU
      operation to complete and the invalidate_range_end notifier to be
      invoked. Like the vCPU when it takes a page fault in that period, we
      just spin — fixing that in a future patch by implementing an actual
      *wait* may be another part of shaving this particularly hirsute yak.
      
      As noted in the comments in the function itself, the only case where
      the invalidate_range_start notifier is expected to be called *without*
      being able to sleep is when the OOM reaper is killing the process. In
      that case, we expect the vCPU threads already to have exited, and thus
      there will be nothing to wake, and no reason to wait. So we clear the
      KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
      if there actually *was* anything to wake up.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20211210163625.2886-3-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      982ed0de
    • E
      KVM: x86: Update vPMCs when retiring branch instructions · 018d70ff
      Eric Hankland 提交于
      When KVM retires a guest branch instruction through emulation,
      increment any vPMCs that are configured to monitor "branch
      instructions retired," and update the sample period of those counters
      so that they will overflow at the right time.
      Signed-off-by: NEric Hankland <ehankland@google.com>
      [jmattson:
        - Split the code to increment "branch instructions retired" into a
          separate commit.
        - Moved/consolidated the calls to kvm_pmu_trigger_event() in the
          emulation of VMLAUNCH/VMRESUME to accommodate the evolution of
          that code.
      ]
      Fixes: f5132b01 ("KVM: Expose a version 2 architectural PMU to a guests")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Message-Id: <20211130074221.93635-7-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      018d70ff
    • E
      KVM: x86: Update vPMCs when retiring instructions · 9cd803d4
      Eric Hankland 提交于
      When KVM retires a guest instruction through emulation, increment any
      vPMCs that are configured to monitor "instructions retired," and
      update the sample period of those counters so that they will overflow
      at the right time.
      Signed-off-by: NEric Hankland <ehankland@google.com>
      [jmattson:
        - Split the code to increment "branch instructions retired" into a
          separate commit.
        - Added 'static' to kvm_pmu_incr_counter() definition.
        - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
          PERF_EVENT_STATE_ACTIVE.
      ]
      Fixes: f5132b01 ("KVM: Expose a version 2 architectural PMU to a guests")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      [likexu:
        - Drop checks for pmc->perf_event or event state or event type
        - Increase a counter once its umask bits and the first 8 select bits are matched
        - Rewrite kvm_pmu_incr_counter() with a less invasive approach to the host perf;
        - Rename kvm_pmu_record_event to kvm_pmu_trigger_event;
        - Add counter enable and CPL check for kvm_pmu_trigger_event();
      ]
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20211130074221.93635-6-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9cd803d4
    • L
      KVM: x86/pmu: Add pmc->intr to refactor kvm_perf_overflow{_intr}() · 40ccb96d
      Like Xu 提交于
      Depending on whether intr should be triggered or not, KVM registers
      two different event overflow callbacks in the perf_event context.
      
      The code skeleton of these two functions is very similar, so
      the pmc->intr can be stored into pmc from pmc_reprogram_counter()
      which provides smaller instructions footprint against the
      u-architecture branch predictor.
      
      The __kvm_perf_overflow() can be called in non-nmi contexts
      and a flag is needed to distinguish the caller context and thus
      avoid a check on kvm_is_in_guest(), otherwise we might get
      warnings from suspicious RCU or check_preemption_disabled().
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20211130074221.93635-5-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      40ccb96d
    • L
      KVM: x86/pmu: Reuse pmc_perf_hw_id() and drop find_fixed_event() · 6ed1298e
      Like Xu 提交于
      Since we set the same semantic event value for the fixed counter in
      pmc->eventsel, returning the perf_hw_id for the fixed counter via
      find_fixed_event() can be painlessly replaced by pmc_perf_hw_id()
      with the help of pmc_is_fixed() check.
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20211130074221.93635-4-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ed1298e