1. 05 4月, 2022 1 次提交
    • S
      KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded · 1d0e8480
      Sean Christopherson 提交于
      Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
      -1 is technically undefined behavior when its value is read out by
      param_get_bool(), as boolean values are supposed to be '0' or '1'.
      
      Alternatively, KVM could define a custom getter for the param, but the
      auto value doesn't depend on the vendor module in any way, and printing
      "auto" would be unnecessarily unfriendly to the user.
      
      In addition to fixing the undefined behavior, resolving the auto value
      also fixes the scenario where the auto value resolves to N and no vendor
      module is loaded.  Previously, -1 would result in Y being printed even
      though KVM would ultimately disable the mitigation.
      
      Rename the existing MMU module init/exit helpers to clarify that they're
      invoked with respect to the vendor module, and add comments to document
      why KVM has two separate "module init" flows.
      
        =========================================================================
        UBSAN: invalid-load in kernel/params.c:320:33
        load of value 255 is not a valid value for type '_Bool'
        CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         <TASK>
         dump_stack_lvl+0x34/0x44
         ubsan_epilogue+0x5/0x40
         __ubsan_handle_load_invalid_value.cold+0x43/0x48
         param_get_bool.cold+0xf/0x14
         param_attr_show+0x55/0x80
         module_attr_show+0x1c/0x30
         sysfs_kf_seq_show+0x93/0xc0
         seq_read_iter+0x11c/0x450
         new_sync_read+0x11b/0x1a0
         vfs_read+0xf0/0x190
         ksys_read+0x5f/0xe0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        =========================================================================
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
      Cc: stable@vger.kernel.org
      Reported-by: NBruno Goncalves <bgoncalv@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220331221359.3912754-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1d0e8480
  2. 02 4月, 2022 9 次提交
    • M
      KVM: x86: SVM: fix tsc scaling when the host doesn't support it · 88099313
      Maxim Levitsky 提交于
      It was decided that when TSC scaling is not supported,
      the virtual MSR_AMD64_TSC_RATIO should still have the default '1.0'
      value.
      
      However in this case kvm_max_tsc_scaling_ratio is not set,
      which breaks various assumptions.
      
      Fix this by always calculating kvm_max_tsc_scaling_ratio regardless of
      host support.  For consistency, do the same for VMX.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220322172449.235575-8-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      88099313
    • H
      KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr · ac8d6cad
      Hou Wenlong 提交于
      If MSR access is rejected by MSR filtering,
      kvm_set_msr()/kvm_get_msr() would return KVM_MSR_RET_FILTERED,
      and the return value is only handled well for rdmsr/wrmsr.
      However, some instruction emulation and state transition also
      use kvm_set_msr()/kvm_get_msr() to do msr access but may trigger
      some unexpected results if MSR access is rejected, E.g. RDPID
      emulation would inject a #UD but RDPID wouldn't cause a exit
      when RDPID is supported in hardware and ENABLE_RDTSCP is set.
      And it would also cause failure when load MSR at nested entry/exit.
      Since msr filtering is based on MSR bitmap, it is better to only
      do MSR filtering for rdmsr/wrmsr.
      Signed-off-by: NHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <2b2774154f7532c96a6f04d71c82a8bec7d9e80b.1646655860.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ac8d6cad
    • H
      KVM: x86/emulator: Emulate RDPID only if it is enabled in guest · a836839c
      Hou Wenlong 提交于
      When RDTSCP is supported but RDPID is not supported in host,
      RDPID emulation is available. However, __kvm_get_msr() would
      only fail when RDTSCP/RDPID both are disabled in guest, so
      the emulator wouldn't inject a #UD when RDPID is disabled but
      RDTSCP is enabled in guest.
      
      Fixes: fb6d4d34 ("KVM: x86: emulate RDPID")
      Signed-off-by: NHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <1dfd46ae5b76d3ed87bde3154d51c64ea64c99c1.1646226788.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a836839c
    • S
      KVM: x86: Trace all APICv inhibit changes and capture overall status · 4f4c4a3e
      Sean Christopherson 提交于
      Trace all APICv inhibit changes instead of just those that result in
      APICv being (un)inhibited, and log the current state.  Debugging why
      APICv isn't working is frustrating as it's hard to see why APICv is still
      inhibited, and logging only the first inhibition means unnecessary onion
      peeling.
      
      Opportunistically drop the export of the tracepoint, it is not and should
      not be used by vendor code due to the need to serialize toggling via
      apicv_update_lock.
      
      Note, using the common flow means kvm_apicv_init() switched from atomic
      to non-atomic bitwise operations.  The VM is unreachable at init, so
      non-atomic is perfectly ok.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f4c4a3e
    • S
      KVM: x86: Add wrappers for setting/clearing APICv inhibits · 320af55a
      Sean Christopherson 提交于
      Add set/clear wrappers for toggling APICv inhibits to make the call sites
      more readable, and opportunistically rename the inner helpers to align
      with the new wrappers and to make them more readable as well.  Invert the
      flag from "activate" to "set"; activate is painfully ambiguous as it's
      not obvious if the inhibit is being activated, or if APICv is being
      activated, in which case the inhibit is being deactivated.
      
      For the functions that take @set, swap the order of the inhibit reason
      and @set so that the call sites are visually similar to those that bounce
      through the wrapper.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      320af55a
    • S
      KVM: x86: Make APICv inhibit reasons an enum and cleanup naming · 7491b7b2
      Sean Christopherson 提交于
      Use an enum for the APICv inhibit reasons, there is no meaning behind
      their values and they most definitely are not "unsigned longs".  Rename
      the various params to "reason" for consistency and clarity (inhibit may
      be confused as a command, i.e. inhibit APICv, instead of the reason that
      is getting toggled/checked).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7491b7b2
    • L
      KVM: X86: Handle implicit supervisor access with SMAP · 4f4aa80e
      Lai Jiangshan 提交于
      There are two kinds of implicit supervisor access
      	implicit supervisor access when CPL = 3
      	implicit supervisor access when CPL < 3
      
      Current permission_fault() handles only the first kind for SMAP.
      
      But if the access is implicit when SMAP is on, data may not be read
      nor write from any user-mode address regardless the current CPL.
      
      So the second kind should be also supported.
      
      The first kind can be detect via CPL and access mode: if it is
      supervisor access and CPL = 3, it must be implicit supervisor access.
      
      But it is not possible to detect the second kind without extra
      information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
      into @access. This extra information also works for the first kind, so
      the logic is changed to use this information for both cases.
      
      The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
      which is in the most significant 16 bits of u64 and less likely to be
      forced to change due to future hardware uses it.
      
      This patch removes the call to ->get_cpl() for access mode is determined
      by @access.  Not only does it reduce a function call, but also remove
      confusions when the permission is checked for nested TDP.  The nested
      TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
      on it.  The original code works just because it is always user walk for
      NPT and SMAP fault is not set for EPT in update_permission_bitmask.
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f4aa80e
    • L
      KVM: X86: Change the type of access u32 to u64 · 5b22bbe7
      Lai Jiangshan 提交于
      Change the type of access u32 to u64 for FNAME(walk_addr) and
      ->gva_to_gpa().
      
      The kinds of accesses are usually combinations of UWX, and VMX/SVM's
      nested paging adds a new factor of access: is it an access for a guest
      page table or for a final guest physical address.
      
      And SMAP relies a factor for supervisor access: explicit or implicit.
      
      So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
      all these information to do the walk.
      
      Although @access(u32) has enough bits to encode all the kinds, this
      patch extends it to u64:
      	o Extra bits will be in the higher 32 bits, so that we can
      	  easily obtain the traditional access mode (UWX) by converting
      	  it to u32.
      	o Reuse the value for the access kind defined by SVM's nested
      	  paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
      	  @error_code in kvm_handle_page_fault().
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5b22bbe7
    • P
      KVM: MMU: propagate alloc_workqueue failure · a1a39128
      Paolo Bonzini 提交于
      If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
      to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
      kvm_arch_init_vm also has to undo all the initialization, so
      group all the MMU initialization code at the beginning and
      handle cleaning up of kvm_page_track_init.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a1a39128
  3. 21 3月, 2022 2 次提交
    • O
      KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2 · 6d849191
      Oliver Upton 提交于
      KVM_CAP_DISABLE_QUIRKS is irrevocably broken. The capability does not
      advertise the set of quirks which may be disabled to userspace, so it is
      impossible to predict the behavior of KVM. Worse yet,
      KVM_CAP_DISABLE_QUIRKS will tolerate any value for cap->args[0], meaning
      it fails to reject attempts to set invalid quirk bits.
      
      The only valid workaround for the quirky quirks API is to add a new CAP.
      Actually advertise the set of quirks that can be disabled to userspace
      so it can predict KVM's behavior. Reject values for cap->args[0] that
      contain invalid bits.
      
      Finally, add documentation for the new capability and describe the
      existing quirks.
      Signed-off-by: NOliver Upton <oupton@google.com>
      Message-Id: <20220301060351.442881-5-oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6d849191
    • T
      kvm: x86: Require const tsc for RT · 5e17b2ee
      Thomas Gleixner 提交于
      Non constant TSC is a nightmare on bare metal already, but with
      virtualization it becomes a complete disaster because the workarounds
      are horrible latency wise. That's also a preliminary for running RT in
      a guest on top of a RT host.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Message-Id: <Yh5eJSG19S2sjZfy@linutronix.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5e17b2ee
  4. 08 3月, 2022 1 次提交
  5. 02 3月, 2022 1 次提交
    • P
      KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run · 8d25b7be
      Paolo Bonzini 提交于
      kvm_arch_vcpu_ioctl_run is already doing srcu_read_lock/unlock in two
      places, namely vcpu_run and post_kvm_run_save, and a third is actually
      needed around the call to vcpu->arch.complete_userspace_io to avoid
      the following splat:
      
        WARNING: suspicious RCU usage
        arch/x86/kvm/pmu.c:190 suspicious rcu_dereference_check() usage!
        other info that might help us debug this:
        rcu_scheduler_active = 2, debug_locks = 1
        1 lock held by CPU 28/KVM/370841:
        #0: ff11004089f280b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x87/0x730 [kvm]
        Call Trace:
         <TASK>
         dump_stack_lvl+0x59/0x73
         reprogram_fixed_counter+0x15d/0x1a0 [kvm]
         kvm_pmu_trigger_event+0x1a3/0x260 [kvm]
         ? free_moved_vector+0x1b4/0x1e0
         complete_fast_pio_in+0x8a/0xd0 [kvm]
      
      This splat is not at all unexpected, since complete_userspace_io callbacks
      can execute similar code to vmexits.  For example, SVM with nrips=false
      will call into the emulator from svm_skip_emulated_instruction().
      
      While it's tempting to never acquire kvm->srcu for an uninitialized vCPU,
      practically speaking there's no penalty to acquiring kvm->srcu "early"
      as the KVM_MP_STATE_UNINITIALIZED path is a one-time thing per vCPU.  On
      the other hand, seemingly innocuous helpers like kvm_apic_accept_events()
      and sync_regs() can theoretically reach code that might access
      SRCU-protected data structures, e.g. sync_regs() can trigger forced
      existing of nested mode via kvm_vcpu_ioctl_x86_set_vcpu_events().
      Reported-by: NLike Xu <likexu@tencent.com>
      Co-developed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d25b7be
  6. 01 3月, 2022 5 次提交
  7. 25 2月, 2022 5 次提交
  8. 19 2月, 2022 5 次提交
  9. 18 2月, 2022 1 次提交
  10. 17 2月, 2022 3 次提交
  11. 16 2月, 2022 1 次提交
  12. 12 2月, 2022 1 次提交
    • M
      KVM: SVM: fix race between interrupt delivery and AVIC inhibition · 66fa226c
      Maxim Levitsky 提交于
      If svm_deliver_avic_intr is called just after the target vcpu's AVIC got
      inhibited, it might read a stale value of vcpu->arch.apicv_active
      which can lead to the target vCPU not noticing the interrupt.
      
      To fix this use load-acquire/store-release so that, if the target vCPU
      is IN_GUEST_MODE, we're guaranteed to see a previous disabling of the
      AVIC.  If AVIC has been disabled in the meanwhile, proceed with the
      KVM_REQ_EVENT-based delivery.
      
      Incomplete IPI vmexit has the same races as svm_deliver_avic_intr, and
      in fact it can be handled in exactly the same way; the only difference
      lies in who has set IRR, whether svm_deliver_interrupt or the processor.
      Therefore, svm_complete_interrupt_delivery can be used to fix incomplete
      IPI vmexits as well.
      Co-developed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      66fa226c
  13. 11 2月, 2022 5 次提交
    • D
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG · cb00a70b
      David Matlack 提交于
      When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
      write-protected when dirty logging is enabled on the memslot. Instead
      they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
      the first time and only for the specific sub-region being cleared.
      
      Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
      write-protecting to avoid causing write-protection faults on vCPU
      threads. This also allows userspace to smear the cost of huge page
      splitting across multiple ioctls, rather than splitting the entire
      memslot as is the case when initially-all-set is not used.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-17-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb00a70b
    • D
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled · a3fe5dbd
      David Matlack 提交于
      When dirty logging is enabled without initially-all-set, try to split
      all huge pages in the memslot down to 4KB pages so that vCPUs do not
      have to take expensive write-protection faults to split huge pages.
      
      Eager page splitting is best-effort only. This commit only adds the
      support for the TDP MMU, and even there splitting may fail due to out
      of memory conditions. Failures to split a huge page is fine from a
      correctness standpoint because KVM will always follow up splitting by
      write-protecting any remaining huge pages.
      
      Eager page splitting moves the cost of splitting huge pages off of the
      vCPU threads and onto the thread enabling dirty logging on the memslot.
      This is useful because:
      
       1. Splitting on the vCPU thread interrupts vCPUs execution and is
          disruptive to customers whereas splitting on VM ioctl threads can
          run in parallel with vCPU execution.
      
       2. Splitting all huge pages at once is more efficient because it does
          not require performing VM-exit handling or walking the page table for
          every 4KiB page in the memslot, and greatly reduces the amount of
          contention on the mmu_lock.
      
      For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
      per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
      all of their memory after dirty logging is enabled decreased by 95% from
      2.94s to 0.14s.
      
      Eager Page Splitting is over 100x more efficient than the current
      implementation of splitting on fault under the read lock. For example,
      taking the same workload as above, Eager Page Splitting reduced the CPU
      required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
      * 96 vCPU threads) to only 1.55 CPU-seconds.
      
      Eager page splitting does increase the amount of time it takes to enable
      dirty logging since it has split all huge pages. For example, the time
      it took to enable dirty logging in the 96GiB region of the
      aforementioned test increased from 0.001s to 1.55s.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a3fe5dbd
    • S
      KVM: x86: Use more verbose names for mem encrypt kvm_x86_ops hooks · 03d004cd
      Sean Christopherson 提交于
      Use slightly more verbose names for the so called "memory encrypt",
      a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current
      super short kvm_x86_ops names and SVM's more verbose, but non-conforming
      names.  This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP()
      to fill svm_x86_ops.
      
      Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better
      reflect its true nature, as it really is a full fledged ioctl() of its
      own.  Ideally, the hook would be named confidential_vm_ioctl() or so, as
      the ioctl() is a gateway to more than just memory encryption, and because
      its underlying purpose to support Confidential VMs, which can be provided
      without memory encryption, e.g. if the TCB of the guest includes the host
      kernel but not host userspace, or by isolation in hardware without
      encrypting memory.  But, diverging from KVM_MEMORY_ENCRYPT_OP even
      further is undeseriable, and short of creating alises for all related
      ioctl()s, which introduces a different flavor of divergence, KVM is stuck
      with the nomenclature.
      
      Defer renaming SVM's functions to a future commit as there are additional
      changes needed to make SVM fully conforming and to match reality (looking
      at you, svm_vm_copy_asid_from()).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-20-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      03d004cd
    • S
      KVM: x86: Move get_cs_db_l_bits() helper to SVM · 872e0c53
      Sean Christopherson 提交于
      Move kvm_get_cs_db_l_bits() to SVM and rename it appropriately so that
      its svm_x86_ops entry can be filled via kvm-x86-ops, and to eliminate a
      superfluous export from KVM x86.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-16-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      872e0c53
    • S
      KVM: x86: Use static_call() for copy/move encryption context ioctls() · 7ad02ef0
      Sean Christopherson 提交于
      Define and use static_call()s for .vm_{copy,move}_enc_context_from(),
      mostly so that the op is defined in kvm-x86-ops.h.  This will allow using
      KVM_X86_OP in vendor code to wire up the implementation.  Any performance
      gains eeked out by using static_call() is a happy bonus and not the
      primary motiviation.
      
      Opportunistically refactor the code to reduce indentation and keep line
      lengths reasonable, and to be consistent when wrapping versus running
      a bit over the 80 char soft limit.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-12-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7ad02ef0