1. 14 4月, 2022 3 次提交
  2. 12 4月, 2022 1 次提交
    • V
      KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPU · 42dcbe7d
      Vitaly Kuznetsov 提交于
      The following WARN is triggered from kvm_vm_ioctl_set_clock():
       WARNING: CPU: 10 PID: 579353 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3161 mark_page_dirty_in_slot+0x6c/0x80 [kvm]
       ...
       CPU: 10 PID: 579353 Comm: qemu-system-x86 Tainted: G        W  O      5.16.0.stable #20
       Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
       RIP: 0010:mark_page_dirty_in_slot+0x6c/0x80 [kvm]
       ...
       Call Trace:
        <TASK>
        ? kvm_write_guest+0x114/0x120 [kvm]
        kvm_hv_invalidate_tsc_page+0x9e/0xf0 [kvm]
        kvm_arch_vm_ioctl+0xa26/0xc50 [kvm]
        ? schedule+0x4e/0xc0
        ? __cond_resched+0x1a/0x50
        ? futex_wait+0x166/0x250
        ? __send_signal+0x1f1/0x3d0
        kvm_vm_ioctl+0x747/0xda0 [kvm]
        ...
      
      The WARN was introduced by commit 03c0304a86bc ("KVM: Warn if
      mark_page_dirty() is called without an active vCPU") but the change seems
      to be correct (unlike Hyper-V TSC page update mechanism). In fact, there's
      no real need to actually write to guest memory to invalidate TSC page, this
      can be done by the first vCPU which goes through kvm_guest_time_update().
      Reported-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220407201013.963226-1-vkuznets@redhat.com>
      42dcbe7d
  3. 05 4月, 2022 1 次提交
    • S
      KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded · 1d0e8480
      Sean Christopherson 提交于
      Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
      -1 is technically undefined behavior when its value is read out by
      param_get_bool(), as boolean values are supposed to be '0' or '1'.
      
      Alternatively, KVM could define a custom getter for the param, but the
      auto value doesn't depend on the vendor module in any way, and printing
      "auto" would be unnecessarily unfriendly to the user.
      
      In addition to fixing the undefined behavior, resolving the auto value
      also fixes the scenario where the auto value resolves to N and no vendor
      module is loaded.  Previously, -1 would result in Y being printed even
      though KVM would ultimately disable the mitigation.
      
      Rename the existing MMU module init/exit helpers to clarify that they're
      invoked with respect to the vendor module, and add comments to document
      why KVM has two separate "module init" flows.
      
        =========================================================================
        UBSAN: invalid-load in kernel/params.c:320:33
        load of value 255 is not a valid value for type '_Bool'
        CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         <TASK>
         dump_stack_lvl+0x34/0x44
         ubsan_epilogue+0x5/0x40
         __ubsan_handle_load_invalid_value.cold+0x43/0x48
         param_get_bool.cold+0xf/0x14
         param_attr_show+0x55/0x80
         module_attr_show+0x1c/0x30
         sysfs_kf_seq_show+0x93/0xc0
         seq_read_iter+0x11c/0x450
         new_sync_read+0x11b/0x1a0
         vfs_read+0xf0/0x190
         ksys_read+0x5f/0xe0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        =========================================================================
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
      Cc: stable@vger.kernel.org
      Reported-by: NBruno Goncalves <bgoncalv@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220331221359.3912754-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1d0e8480
  4. 02 4月, 2022 21 次提交
    • J
      KVM: x86: optimize PKU branching in kvm_load_{guest|host}_xsave_state · 945024d7
      Jon Kohler 提交于
      kvm_load_{guest|host}_xsave_state handles xsave on vm entry and exit,
      part of which is managing memory protection key state. The latest
      arch.pkru is updated with a rdpkru, and if that doesn't match the base
      host_pkru (which about 70% of the time), we issue a __write_pkru.
      
      To improve performance, implement the following optimizations:
       1. Reorder if conditions prior to wrpkru in both
          kvm_load_{guest|host}_xsave_state.
      
          Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is
          checked first, which when instrumented in our environment appeared
          to be always true and less overall work than kvm_read_cr4_bits.
      
          For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead
          one position. When instrumented, I saw this be true roughly ~70% of
          the time vs the other conditions which were almost always true.
          With this change, we will avoid 3rd condition check ~30% of the time.
      
       2. Wrap PKU sections with CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS,
          as if the user compiles out this feature, we should not have
          these branches at all.
      Signed-off-by: NJon Kohler <jon@nutanix.com>
      Message-Id: <20220324004439.6709-1-jon@nutanix.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      945024d7
    • M
      KVM: x86: allow per cpu apicv inhibit reasons · d5fa597e
      Maxim Levitsky 提交于
      Add optional callback .vcpu_get_apicv_inhibit_reasons returning
      extra inhibit reasons that prevent APICv from working on this vCPU.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220322174050.241850-6-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d5fa597e
    • S
      KVM: x86: Don't snapshot "max" TSC if host TSC is constant · 741e511b
      Sean Christopherson 提交于
      Don't snapshot tsc_khz into max_tsc_khz during KVM initialization if the
      host TSC is constant, in which case the actual TSC frequency will never
      change and thus capturing the "max" TSC during initialization is
      unnecessary, KVM can simply use tsc_khz during VM creation.
      
      On CPUs with constant TSC, but not a hardware-specified TSC frequency,
      snapshotting max_tsc_khz and using that to set a VM's default TSC
      frequency can lead to KVM thinking it needs to manually scale the guest's
      TSC if refining the TSC completes after KVM snapshots tsc_khz.  The
      actual frequency never changes, only the kernel's calculation of what
      that frequency is changes.  On systems without hardware TSC scaling, this
      either puts KVM into "always catchup" mode (extremely inefficient), or
      prevents creating VMs altogether.
      
      Ideally, KVM would not be able to race with TSC refinement, or would have
      a hook into tsc_refine_calibration_work() to get an alert when refinement
      is complete.  Avoiding the race altogether isn't practical as refinement
      takes a relative eternity; it's deliberately put on a work queue outside
      of the normal boot sequence to avoid unnecessarily delaying boot.
      
      Adding a hook is doable, but somewhat gross due to KVM's ability to be
      built as a module.  And if the TSC is constant, which is likely the case
      for every VMX/SVM-capable CPU produced in the last decade, the race can
      be hit if and only if userspace is able to create a VM before TSC
      refinement completes; refinement is slow, but not that slow.
      
      For now, punt on a proper fix, as not taking a snapshot can help some
      uses cases and not taking a snapshot is arguably correct irrespective of
      the race with refinement.
      
      [ dwmw2: Rebase on top of KVM-wide default_tsc_khz to ensure that all
               vCPUs get the same frequency even if we hit the race. ]
      
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Anton Romanov <romanton@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20220225145304.36166-3-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      741e511b
    • D
      KVM: x86: Accept KVM_[GS]ET_TSC_KHZ as a VM ioctl. · ffbb61d0
      David Woodhouse 提交于
      This sets the default TSC frequency for subsequently created vCPUs.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20220225145304.36166-2-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ffbb61d0
    • D
      KVM: x86/xen: Advertise and document KVM_XEN_HVM_CONFIG_EVTCHN_SEND · 661a20fa
      David Woodhouse 提交于
      At the end of the patch series adding this batch of event channel
      acceleration features, finally add the feature bit which advertises
      them and document it all.
      
      For SCHEDOP_poll we need to wake a polling vCPU when a given port
      is triggered, even when it's masked — and we want to implement that
      in the kernel, for efficiency. So we want the kernel to know that it
      has sole ownership of event channel delivery. Thus, we allow
      userspace to make the 'promise' by setting the corresponding feature
      bit in its KVM_XEN_HVM_CONFIG call. As we implement SCHEDOP_poll
      bypass later, we will do so only if that promise has been made by
      userspace.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-16-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      661a20fa
    • D
      KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID · 942c2490
      David Woodhouse 提交于
      In order to intercept hypercalls such as VCPUOP_set_singleshot_timer, we
      need to be aware of the Xen CPU numbering.
      
      This looks a lot like the Hyper-V handling of vpidx, for obvious reasons.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-12-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      942c2490
    • D
      KVM: x86/xen: Support direct injection of event channel events · 35025735
      David Woodhouse 提交于
      This adds a KVM_XEN_HVM_EVTCHN_SEND ioctl which allows direct injection
      of events given an explicit { vcpu, port, priority } in precisely the
      same form that those fields are given in the IRQ routing table.
      
      Userspace is currently able to inject 2-level events purely by setting
      the bits in the shared_info and vcpu_info, but FIFO event channels are
      harder to deal with; we will need the kernel to take sole ownership of
      delivery when we support those.
      
      A patch advertising this feature with a new bit in the KVM_CAP_XEN_HVM
      ioctl will be added in a subsequent patch.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-9-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      35025735
    • D
      KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_time_info · 69d413cf
      David Woodhouse 提交于
      This switches the final pvclock to kvm_setup_pvclock_pfncache() and now
      the old kvm_setup_pvclock_page() can be removed.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-7-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      69d413cf
    • D
      KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_info · 7caf9571
      David Woodhouse 提交于
      Currently, the fast path of kvm_xen_set_evtchn_fast() doesn't set the
      index bits in the target vCPU's evtchn_pending_sel, because it only has
      a userspace virtual address with which to do so. It just sets them in
      the kernel, and kvm_xen_has_interrupt() then completes the delivery to
      the actual vcpu_info structure when the vCPU runs.
      
      Using a gfn_to_pfn_cache allows kvm_xen_set_evtchn_fast() to do the full
      delivery in the common case.
      
      Clean up the fallback case too, by moving the deferred delivery out into
      a separate kvm_xen_inject_pending_events() function which isn't ever
      called in atomic contexts as __kvm_xen_has_interrupt() is.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-6-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7caf9571
    • D
      KVM: x86: Use gfn_to_pfn_cache for pv_time · 916d3608
      David Woodhouse 提交于
      Add a new kvm_setup_guest_pvclock() which parallels the existing
      kvm_setup_pvclock_page(). The latter will be removed once we convert
      all users to the gfn_to_pfn_cache version.
      
      Using the new cache, we can potentially let kvm_set_guest_paused() set
      the PVCLOCK_GUEST_STOPPED bit directly rather than having to delegate
      to the vCPU via KVM_REQ_CLOCK_UPDATE. But not yet.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-5-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      916d3608
    • D
      KVM: x86/xen: Use gfn_to_pfn_cache for runstate area · a795cd43
      David Woodhouse 提交于
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-4-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a795cd43
    • O
      KVM: x86: Allow userspace to opt out of hypercall patching · f1a9761f
      Oliver Upton 提交于
      KVM handles the VMCALL/VMMCALL instructions very strangely. Even though
      both of these instructions really should #UD when executed on the wrong
      vendor's hardware (i.e. VMCALL on SVM, VMMCALL on VMX), KVM replaces the
      guest's instruction with the appropriate instruction for the vendor.
      Nonetheless, older guest kernels without commit c1118b36 ("x86: kvm:
      use alternatives for VMCALL vs. VMMCALL if kernel text is read-only")
      do not patch in the appropriate instruction using alternatives, likely
      motivating KVM's intervention.
      
      Add a quirk allowing userspace to opt out of hypercall patching. If the
      quirk is disabled, KVM synthesizes a #UD in the guest.
      Signed-off-by: NOliver Upton <oupton@google.com>
      Message-Id: <20220316005538.2282772-2-oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f1a9761f
    • M
      KVM: x86: SVM: fix tsc scaling when the host doesn't support it · 88099313
      Maxim Levitsky 提交于
      It was decided that when TSC scaling is not supported,
      the virtual MSR_AMD64_TSC_RATIO should still have the default '1.0'
      value.
      
      However in this case kvm_max_tsc_scaling_ratio is not set,
      which breaks various assumptions.
      
      Fix this by always calculating kvm_max_tsc_scaling_ratio regardless of
      host support.  For consistency, do the same for VMX.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220322172449.235575-8-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      88099313
    • H
      KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr · ac8d6cad
      Hou Wenlong 提交于
      If MSR access is rejected by MSR filtering,
      kvm_set_msr()/kvm_get_msr() would return KVM_MSR_RET_FILTERED,
      and the return value is only handled well for rdmsr/wrmsr.
      However, some instruction emulation and state transition also
      use kvm_set_msr()/kvm_get_msr() to do msr access but may trigger
      some unexpected results if MSR access is rejected, E.g. RDPID
      emulation would inject a #UD but RDPID wouldn't cause a exit
      when RDPID is supported in hardware and ENABLE_RDTSCP is set.
      And it would also cause failure when load MSR at nested entry/exit.
      Since msr filtering is based on MSR bitmap, it is better to only
      do MSR filtering for rdmsr/wrmsr.
      Signed-off-by: NHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <2b2774154f7532c96a6f04d71c82a8bec7d9e80b.1646655860.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ac8d6cad
    • H
      KVM: x86/emulator: Emulate RDPID only if it is enabled in guest · a836839c
      Hou Wenlong 提交于
      When RDTSCP is supported but RDPID is not supported in host,
      RDPID emulation is available. However, __kvm_get_msr() would
      only fail when RDTSCP/RDPID both are disabled in guest, so
      the emulator wouldn't inject a #UD when RDPID is disabled but
      RDTSCP is enabled in guest.
      
      Fixes: fb6d4d34 ("KVM: x86: emulate RDPID")
      Signed-off-by: NHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <1dfd46ae5b76d3ed87bde3154d51c64ea64c99c1.1646226788.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a836839c
    • S
      KVM: x86: Trace all APICv inhibit changes and capture overall status · 4f4c4a3e
      Sean Christopherson 提交于
      Trace all APICv inhibit changes instead of just those that result in
      APICv being (un)inhibited, and log the current state.  Debugging why
      APICv isn't working is frustrating as it's hard to see why APICv is still
      inhibited, and logging only the first inhibition means unnecessary onion
      peeling.
      
      Opportunistically drop the export of the tracepoint, it is not and should
      not be used by vendor code due to the need to serialize toggling via
      apicv_update_lock.
      
      Note, using the common flow means kvm_apicv_init() switched from atomic
      to non-atomic bitwise operations.  The VM is unreachable at init, so
      non-atomic is perfectly ok.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f4c4a3e
    • S
      KVM: x86: Add wrappers for setting/clearing APICv inhibits · 320af55a
      Sean Christopherson 提交于
      Add set/clear wrappers for toggling APICv inhibits to make the call sites
      more readable, and opportunistically rename the inner helpers to align
      with the new wrappers and to make them more readable as well.  Invert the
      flag from "activate" to "set"; activate is painfully ambiguous as it's
      not obvious if the inhibit is being activated, or if APICv is being
      activated, in which case the inhibit is being deactivated.
      
      For the functions that take @set, swap the order of the inhibit reason
      and @set so that the call sites are visually similar to those that bounce
      through the wrapper.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      320af55a
    • S
      KVM: x86: Make APICv inhibit reasons an enum and cleanup naming · 7491b7b2
      Sean Christopherson 提交于
      Use an enum for the APICv inhibit reasons, there is no meaning behind
      their values and they most definitely are not "unsigned longs".  Rename
      the various params to "reason" for consistency and clarity (inhibit may
      be confused as a command, i.e. inhibit APICv, instead of the reason that
      is getting toggled/checked).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7491b7b2
    • L
      KVM: X86: Handle implicit supervisor access with SMAP · 4f4aa80e
      Lai Jiangshan 提交于
      There are two kinds of implicit supervisor access
      	implicit supervisor access when CPL = 3
      	implicit supervisor access when CPL < 3
      
      Current permission_fault() handles only the first kind for SMAP.
      
      But if the access is implicit when SMAP is on, data may not be read
      nor write from any user-mode address regardless the current CPL.
      
      So the second kind should be also supported.
      
      The first kind can be detect via CPL and access mode: if it is
      supervisor access and CPL = 3, it must be implicit supervisor access.
      
      But it is not possible to detect the second kind without extra
      information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
      into @access. This extra information also works for the first kind, so
      the logic is changed to use this information for both cases.
      
      The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
      which is in the most significant 16 bits of u64 and less likely to be
      forced to change due to future hardware uses it.
      
      This patch removes the call to ->get_cpl() for access mode is determined
      by @access.  Not only does it reduce a function call, but also remove
      confusions when the permission is checked for nested TDP.  The nested
      TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
      on it.  The original code works just because it is always user walk for
      NPT and SMAP fault is not set for EPT in update_permission_bitmask.
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f4aa80e
    • L
      KVM: X86: Change the type of access u32 to u64 · 5b22bbe7
      Lai Jiangshan 提交于
      Change the type of access u32 to u64 for FNAME(walk_addr) and
      ->gva_to_gpa().
      
      The kinds of accesses are usually combinations of UWX, and VMX/SVM's
      nested paging adds a new factor of access: is it an access for a guest
      page table or for a final guest physical address.
      
      And SMAP relies a factor for supervisor access: explicit or implicit.
      
      So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
      all these information to do the walk.
      
      Although @access(u32) has enough bits to encode all the kinds, this
      patch extends it to u64:
      	o Extra bits will be in the higher 32 bits, so that we can
      	  easily obtain the traditional access mode (UWX) by converting
      	  it to u32.
      	o Reuse the value for the access kind defined by SVM's nested
      	  paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
      	  @error_code in kvm_handle_page_fault().
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5b22bbe7
    • P
      KVM: MMU: propagate alloc_workqueue failure · a1a39128
      Paolo Bonzini 提交于
      If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
      to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
      kvm_arch_init_vm also has to undo all the initialization, so
      group all the MMU initialization code at the beginning and
      handle cleaning up of kvm_page_track_init.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a1a39128
  5. 21 3月, 2022 2 次提交
    • O
      KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2 · 6d849191
      Oliver Upton 提交于
      KVM_CAP_DISABLE_QUIRKS is irrevocably broken. The capability does not
      advertise the set of quirks which may be disabled to userspace, so it is
      impossible to predict the behavior of KVM. Worse yet,
      KVM_CAP_DISABLE_QUIRKS will tolerate any value for cap->args[0], meaning
      it fails to reject attempts to set invalid quirk bits.
      
      The only valid workaround for the quirky quirks API is to add a new CAP.
      Actually advertise the set of quirks that can be disabled to userspace
      so it can predict KVM's behavior. Reject values for cap->args[0] that
      contain invalid bits.
      
      Finally, add documentation for the new capability and describe the
      existing quirks.
      Signed-off-by: NOliver Upton <oupton@google.com>
      Message-Id: <20220301060351.442881-5-oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6d849191
    • T
      kvm: x86: Require const tsc for RT · 5e17b2ee
      Thomas Gleixner 提交于
      Non constant TSC is a nightmare on bare metal already, but with
      virtualization it becomes a complete disaster because the workarounds
      are horrible latency wise. That's also a preliminary for running RT in
      a guest on top of a RT host.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Message-Id: <Yh5eJSG19S2sjZfy@linutronix.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5e17b2ee
  6. 08 3月, 2022 1 次提交
  7. 02 3月, 2022 1 次提交
    • P
      KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run · 8d25b7be
      Paolo Bonzini 提交于
      kvm_arch_vcpu_ioctl_run is already doing srcu_read_lock/unlock in two
      places, namely vcpu_run and post_kvm_run_save, and a third is actually
      needed around the call to vcpu->arch.complete_userspace_io to avoid
      the following splat:
      
        WARNING: suspicious RCU usage
        arch/x86/kvm/pmu.c:190 suspicious rcu_dereference_check() usage!
        other info that might help us debug this:
        rcu_scheduler_active = 2, debug_locks = 1
        1 lock held by CPU 28/KVM/370841:
        #0: ff11004089f280b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x87/0x730 [kvm]
        Call Trace:
         <TASK>
         dump_stack_lvl+0x59/0x73
         reprogram_fixed_counter+0x15d/0x1a0 [kvm]
         kvm_pmu_trigger_event+0x1a3/0x260 [kvm]
         ? free_moved_vector+0x1b4/0x1e0
         complete_fast_pio_in+0x8a/0xd0 [kvm]
      
      This splat is not at all unexpected, since complete_userspace_io callbacks
      can execute similar code to vmexits.  For example, SVM with nrips=false
      will call into the emulator from svm_skip_emulated_instruction().
      
      While it's tempting to never acquire kvm->srcu for an uninitialized vCPU,
      practically speaking there's no penalty to acquiring kvm->srcu "early"
      as the KVM_MP_STATE_UNINITIALIZED path is a one-time thing per vCPU.  On
      the other hand, seemingly innocuous helpers like kvm_apic_accept_events()
      and sync_regs() can theoretically reach code that might access
      SRCU-protected data structures, e.g. sync_regs() can trigger forced
      existing of nested mode via kvm_vcpu_ioctl_x86_set_vcpu_events().
      Reported-by: NLike Xu <likexu@tencent.com>
      Co-developed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d25b7be
  8. 01 3月, 2022 5 次提交
  9. 25 2月, 2022 5 次提交