1. 02 4月, 2022 5 次提交
    • S
      KVM: x86: Add wrappers for setting/clearing APICv inhibits · 320af55a
      Sean Christopherson 提交于
      Add set/clear wrappers for toggling APICv inhibits to make the call sites
      more readable, and opportunistically rename the inner helpers to align
      with the new wrappers and to make them more readable as well.  Invert the
      flag from "activate" to "set"; activate is painfully ambiguous as it's
      not obvious if the inhibit is being activated, or if APICv is being
      activated, in which case the inhibit is being deactivated.
      
      For the functions that take @set, swap the order of the inhibit reason
      and @set so that the call sites are visually similar to those that bounce
      through the wrapper.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      320af55a
    • S
      KVM: x86: Make APICv inhibit reasons an enum and cleanup naming · 7491b7b2
      Sean Christopherson 提交于
      Use an enum for the APICv inhibit reasons, there is no meaning behind
      their values and they most definitely are not "unsigned longs".  Rename
      the various params to "reason" for consistency and clarity (inhibit may
      be confused as a command, i.e. inhibit APICv, instead of the reason that
      is getting toggled/checked).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7491b7b2
    • L
      KVM: X86: Handle implicit supervisor access with SMAP · 4f4aa80e
      Lai Jiangshan 提交于
      There are two kinds of implicit supervisor access
      	implicit supervisor access when CPL = 3
      	implicit supervisor access when CPL < 3
      
      Current permission_fault() handles only the first kind for SMAP.
      
      But if the access is implicit when SMAP is on, data may not be read
      nor write from any user-mode address regardless the current CPL.
      
      So the second kind should be also supported.
      
      The first kind can be detect via CPL and access mode: if it is
      supervisor access and CPL = 3, it must be implicit supervisor access.
      
      But it is not possible to detect the second kind without extra
      information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
      into @access. This extra information also works for the first kind, so
      the logic is changed to use this information for both cases.
      
      The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
      which is in the most significant 16 bits of u64 and less likely to be
      forced to change due to future hardware uses it.
      
      This patch removes the call to ->get_cpl() for access mode is determined
      by @access.  Not only does it reduce a function call, but also remove
      confusions when the permission is checked for nested TDP.  The nested
      TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
      on it.  The original code works just because it is always user walk for
      NPT and SMAP fault is not set for EPT in update_permission_bitmask.
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f4aa80e
    • L
      KVM: X86: Change the type of access u32 to u64 · 5b22bbe7
      Lai Jiangshan 提交于
      Change the type of access u32 to u64 for FNAME(walk_addr) and
      ->gva_to_gpa().
      
      The kinds of accesses are usually combinations of UWX, and VMX/SVM's
      nested paging adds a new factor of access: is it an access for a guest
      page table or for a final guest physical address.
      
      And SMAP relies a factor for supervisor access: explicit or implicit.
      
      So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
      all these information to do the walk.
      
      Although @access(u32) has enough bits to encode all the kinds, this
      patch extends it to u64:
      	o Extra bits will be in the higher 32 bits, so that we can
      	  easily obtain the traditional access mode (UWX) by converting
      	  it to u32.
      	o Reuse the value for the access kind defined by SVM's nested
      	  paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
      	  @error_code in kvm_handle_page_fault().
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5b22bbe7
    • P
      KVM: MMU: propagate alloc_workqueue failure · a1a39128
      Paolo Bonzini 提交于
      If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
      to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
      kvm_arch_init_vm also has to undo all the initialization, so
      group all the MMU initialization code at the beginning and
      handle cleaning up of kvm_page_track_init.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a1a39128
  2. 21 3月, 2022 2 次提交
    • O
      KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2 · 6d849191
      Oliver Upton 提交于
      KVM_CAP_DISABLE_QUIRKS is irrevocably broken. The capability does not
      advertise the set of quirks which may be disabled to userspace, so it is
      impossible to predict the behavior of KVM. Worse yet,
      KVM_CAP_DISABLE_QUIRKS will tolerate any value for cap->args[0], meaning
      it fails to reject attempts to set invalid quirk bits.
      
      The only valid workaround for the quirky quirks API is to add a new CAP.
      Actually advertise the set of quirks that can be disabled to userspace
      so it can predict KVM's behavior. Reject values for cap->args[0] that
      contain invalid bits.
      
      Finally, add documentation for the new capability and describe the
      existing quirks.
      Signed-off-by: NOliver Upton <oupton@google.com>
      Message-Id: <20220301060351.442881-5-oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6d849191
    • T
      kvm: x86: Require const tsc for RT · 5e17b2ee
      Thomas Gleixner 提交于
      Non constant TSC is a nightmare on bare metal already, but with
      virtualization it becomes a complete disaster because the workarounds
      are horrible latency wise. That's also a preliminary for running RT in
      a guest on top of a RT host.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Message-Id: <Yh5eJSG19S2sjZfy@linutronix.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5e17b2ee
  3. 08 3月, 2022 1 次提交
  4. 02 3月, 2022 1 次提交
    • P
      KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run · 8d25b7be
      Paolo Bonzini 提交于
      kvm_arch_vcpu_ioctl_run is already doing srcu_read_lock/unlock in two
      places, namely vcpu_run and post_kvm_run_save, and a third is actually
      needed around the call to vcpu->arch.complete_userspace_io to avoid
      the following splat:
      
        WARNING: suspicious RCU usage
        arch/x86/kvm/pmu.c:190 suspicious rcu_dereference_check() usage!
        other info that might help us debug this:
        rcu_scheduler_active = 2, debug_locks = 1
        1 lock held by CPU 28/KVM/370841:
        #0: ff11004089f280b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x87/0x730 [kvm]
        Call Trace:
         <TASK>
         dump_stack_lvl+0x59/0x73
         reprogram_fixed_counter+0x15d/0x1a0 [kvm]
         kvm_pmu_trigger_event+0x1a3/0x260 [kvm]
         ? free_moved_vector+0x1b4/0x1e0
         complete_fast_pio_in+0x8a/0xd0 [kvm]
      
      This splat is not at all unexpected, since complete_userspace_io callbacks
      can execute similar code to vmexits.  For example, SVM with nrips=false
      will call into the emulator from svm_skip_emulated_instruction().
      
      While it's tempting to never acquire kvm->srcu for an uninitialized vCPU,
      practically speaking there's no penalty to acquiring kvm->srcu "early"
      as the KVM_MP_STATE_UNINITIALIZED path is a one-time thing per vCPU.  On
      the other hand, seemingly innocuous helpers like kvm_apic_accept_events()
      and sync_regs() can theoretically reach code that might access
      SRCU-protected data structures, e.g. sync_regs() can trigger forced
      existing of nested mode via kvm_vcpu_ioctl_x86_set_vcpu_events().
      Reported-by: NLike Xu <likexu@tencent.com>
      Co-developed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d25b7be
  5. 01 3月, 2022 5 次提交
  6. 25 2月, 2022 5 次提交
  7. 19 2月, 2022 5 次提交
  8. 18 2月, 2022 1 次提交
  9. 17 2月, 2022 3 次提交
  10. 12 2月, 2022 1 次提交
    • M
      KVM: SVM: fix race between interrupt delivery and AVIC inhibition · 66fa226c
      Maxim Levitsky 提交于
      If svm_deliver_avic_intr is called just after the target vcpu's AVIC got
      inhibited, it might read a stale value of vcpu->arch.apicv_active
      which can lead to the target vCPU not noticing the interrupt.
      
      To fix this use load-acquire/store-release so that, if the target vCPU
      is IN_GUEST_MODE, we're guaranteed to see a previous disabling of the
      AVIC.  If AVIC has been disabled in the meanwhile, proceed with the
      KVM_REQ_EVENT-based delivery.
      
      Incomplete IPI vmexit has the same races as svm_deliver_avic_intr, and
      in fact it can be handled in exactly the same way; the only difference
      lies in who has set IRR, whether svm_deliver_interrupt or the processor.
      Therefore, svm_complete_interrupt_delivery can be used to fix incomplete
      IPI vmexits as well.
      Co-developed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      66fa226c
  11. 11 2月, 2022 11 次提交
    • D
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG · cb00a70b
      David Matlack 提交于
      When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
      write-protected when dirty logging is enabled on the memslot. Instead
      they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
      the first time and only for the specific sub-region being cleared.
      
      Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
      write-protecting to avoid causing write-protection faults on vCPU
      threads. This also allows userspace to smear the cost of huge page
      splitting across multiple ioctls, rather than splitting the entire
      memslot as is the case when initially-all-set is not used.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-17-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb00a70b
    • D
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled · a3fe5dbd
      David Matlack 提交于
      When dirty logging is enabled without initially-all-set, try to split
      all huge pages in the memslot down to 4KB pages so that vCPUs do not
      have to take expensive write-protection faults to split huge pages.
      
      Eager page splitting is best-effort only. This commit only adds the
      support for the TDP MMU, and even there splitting may fail due to out
      of memory conditions. Failures to split a huge page is fine from a
      correctness standpoint because KVM will always follow up splitting by
      write-protecting any remaining huge pages.
      
      Eager page splitting moves the cost of splitting huge pages off of the
      vCPU threads and onto the thread enabling dirty logging on the memslot.
      This is useful because:
      
       1. Splitting on the vCPU thread interrupts vCPUs execution and is
          disruptive to customers whereas splitting on VM ioctl threads can
          run in parallel with vCPU execution.
      
       2. Splitting all huge pages at once is more efficient because it does
          not require performing VM-exit handling or walking the page table for
          every 4KiB page in the memslot, and greatly reduces the amount of
          contention on the mmu_lock.
      
      For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
      per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
      all of their memory after dirty logging is enabled decreased by 95% from
      2.94s to 0.14s.
      
      Eager Page Splitting is over 100x more efficient than the current
      implementation of splitting on fault under the read lock. For example,
      taking the same workload as above, Eager Page Splitting reduced the CPU
      required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
      * 96 vCPU threads) to only 1.55 CPU-seconds.
      
      Eager page splitting does increase the amount of time it takes to enable
      dirty logging since it has split all huge pages. For example, the time
      it took to enable dirty logging in the 96GiB region of the
      aforementioned test increased from 0.001s to 1.55s.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a3fe5dbd
    • S
      KVM: x86: Use more verbose names for mem encrypt kvm_x86_ops hooks · 03d004cd
      Sean Christopherson 提交于
      Use slightly more verbose names for the so called "memory encrypt",
      a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current
      super short kvm_x86_ops names and SVM's more verbose, but non-conforming
      names.  This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP()
      to fill svm_x86_ops.
      
      Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better
      reflect its true nature, as it really is a full fledged ioctl() of its
      own.  Ideally, the hook would be named confidential_vm_ioctl() or so, as
      the ioctl() is a gateway to more than just memory encryption, and because
      its underlying purpose to support Confidential VMs, which can be provided
      without memory encryption, e.g. if the TCB of the guest includes the host
      kernel but not host userspace, or by isolation in hardware without
      encrypting memory.  But, diverging from KVM_MEMORY_ENCRYPT_OP even
      further is undeseriable, and short of creating alises for all related
      ioctl()s, which introduces a different flavor of divergence, KVM is stuck
      with the nomenclature.
      
      Defer renaming SVM's functions to a future commit as there are additional
      changes needed to make SVM fully conforming and to match reality (looking
      at you, svm_vm_copy_asid_from()).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-20-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      03d004cd
    • S
      KVM: x86: Move get_cs_db_l_bits() helper to SVM · 872e0c53
      Sean Christopherson 提交于
      Move kvm_get_cs_db_l_bits() to SVM and rename it appropriately so that
      its svm_x86_ops entry can be filled via kvm-x86-ops, and to eliminate a
      superfluous export from KVM x86.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-16-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      872e0c53
    • S
      KVM: x86: Use static_call() for copy/move encryption context ioctls() · 7ad02ef0
      Sean Christopherson 提交于
      Define and use static_call()s for .vm_{copy,move}_enc_context_from(),
      mostly so that the op is defined in kvm-x86-ops.h.  This will allow using
      KVM_X86_OP in vendor code to wire up the implementation.  Any performance
      gains eeked out by using static_call() is a happy bonus and not the
      primary motiviation.
      
      Opportunistically refactor the code to reduce indentation and keep line
      lengths reasonable, and to be consistent when wrapping versus running
      a bit over the 80 char soft limit.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-12-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7ad02ef0
    • S
      KVM: x86: Unexport kvm_x86_ops · dfc4e6ca
      Sean Christopherson 提交于
      Drop the export of kvm_x86_ops now it is no longer referenced by SVM or
      VMX.  Disallowing access to kvm_x86_ops is very desirable as it prevents
      vendor code from incorrectly modifying hooks after they have been set by
      kvm_arch_hardware_setup(), and more importantly after each function's
      associated static_call key has been updated.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-11-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dfc4e6ca
    • S
      KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names · e27bc044
      Sean Christopherson 提交于
      Rename a variety of kvm_x86_op function pointers so that preferred name
      for vendor implementations follows the pattern <vendor>_<function>, e.g.
      rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run().  This will
      allow vendor implementations to be wired up via the KVM_X86_OP macro.
      
      In many cases, VMX and SVM "disagree" on the preferred name, though in
      reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
      the kvm_x86_ops name.  Justification for using the VMX nomenclature:
      
        - set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
          event that has already been "set" in e.g. the vIRR.  SVM's relevant
          VMCB field is even named event_inj, and KVM's stat is irq_injections.
      
        - prepare_guest_switch => prepare_switch_to_guest because the former is
          ambiguous, e.g. it could mean switching between multiple guests,
          switching from the guest to host, etc...
      
        - update_pi_irte => pi_update_irte to allow for matching match the rest
          of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().
      
        - start_assignment => pi_start_assignment to again follow VMX's posted
          interrupt naming scheme, and to provide context for what bit of code
          might care about an otherwise undescribed "assignment".
      
      The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
      Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
      wrong.  x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
      variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
      appropriate name for the Hyper-V hooks would be flush_remote_tlbs.  Leave
      that change for another time as the Hyper-V hooks always start as NULL,
      i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
      names requires an astounding amount of churn.
      
      VMX and SVM function names are intentionally left as is to minimize the
      diff.  Both VMX and SVM will need to rename even more functions in order
      to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
      inevitable.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e27bc044
    • S
      KVM: x86: Drop export for .tlb_flush_current() static_call key · feee3d9d
      Sean Christopherson 提交于
      Remove the export of kvm_x86_tlb_flush_current() as there are no longer
      any users outside of common x86 code.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      feee3d9d
    • J
      KVM: x86: Remove unused "flags" of kvm_pv_kick_cpu_op() · 9d68c6f6
      Jinrong Liang 提交于
      The "unsigned long flags" parameter of  kvm_pv_kick_cpu_op() is not used,
      so remove it. No functional change intended.
      Signed-off-by: NJinrong Liang <cloudliang@tencent.com>
      Message-Id: <20220125095909.38122-20-cloudliang@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9d68c6f6
    • J
      KVM: x86: Remove unused "vcpu" of kvm_scale_tsc() · 62711e5a
      Jinrong Liang 提交于
      The "struct kvm_vcpu *vcpu" parameter of kvm_scale_tsc() is not used,
      so remove it. No functional change intended.
      Signed-off-by: NJinrong Liang <cloudliang@tencent.com>
      Message-Id: <20220125095909.38122-18-cloudliang@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      62711e5a
    • S
      KVM: x86: Skip APICv update if APICv is disable at the module level · f1575642
      Sean Christopherson 提交于
      Bail from the APICv update paths _before_ taking apicv_update_lock if
      APICv is disabled at the module level.  kvm_request_apicv_update() in
      particular is invoked from multiple paths that can be reached without
      APICv being enabled, e.g. svm_enable_irq_window(), and taking the
      rw_sem for write when APICv is disabled may introduce unnecessary
      contention and stalls.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211208015236.1616697-25-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f1575642