1. 12 5月, 2022 3 次提交
  2. 02 5月, 2022 1 次提交
  3. 30 4月, 2022 5 次提交
    • S
      KVM: SVM: Introduce trace point for the slow-path of avic_kic_target_vcpus · 9f084f7c
      Suravee Suthikulpanit 提交于
      This can help identify potential performance issues when handles
      AVIC incomplete IPI due vCPU not running.
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220420154954.19305-3-suravee.suthikulpanit@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9f084f7c
    • P
      KVM: x86/mmu: replace direct_map with root_role.direct · 347a0d0d
      Paolo Bonzini 提交于
      direct_map is always equal to the direct field of the root page's role:
      
      - for shadow paging, direct_map is true if CR0.PG=0 and root_role.direct is
      copied from cpu_role.base.direct
      
      - for TDP, it is always true and root_role.direct is also always true
      
      - for shadow TDP, it is always false and root_role.direct is also always
      false
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      347a0d0d
    • S
      KVM: x86: Clean up and document nested #PF workaround · 6819af75
      Sean Christopherson 提交于
      Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround
      with an explicit, common workaround in kvm_inject_emulated_page_fault().
      Aside from being a hack, the current approach is brittle and incomplete,
      e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(),
      and nVMX fails to apply the workaround when VMX is intercepting #PF due
      to allow_smaller_maxphyaddr=1.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6819af75
    • P
      KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT · d495f942
      Paolo Bonzini 提交于
      When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags
      member that at the time was unused.  Unfortunately this extensibility
      mechanism has several issues:
      
      - x86 is not writing the member, so it would not be possible to use it
        on x86 except for new events
      
      - the member is not aligned to 64 bits, so the definition of the
        uAPI struct is incorrect for 32- on 64-bit userspace.  This is a
        problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately
        usage of flags was only introduced in 5.18.
      
      Since padding has to be introduced, place a new field in there
      that tells if the flags field is valid.  To allow further extensibility,
      in fact, change flags to an array of 16 values, and store how many
      of the values are valid.  The availability of the new ndata field
      is tied to a system capability; all architectures are changed to
      fill in the field.
      
      To avoid breaking compilation of userspace that was using the flags
      field, provide a userspace-only union to overlap flags with data[0].
      The new field is placed at the same offset for both 32- and 64-bit
      userspace.
      
      Cc: Will Deacon <will@kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Message-Id: <20220422103013.34832-1-pbonzini@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d495f942
    • S
      KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR · 86931ff7
      Sean Christopherson 提交于
      Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
      MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
      The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
      guest, possibly with help from userspace, manages to coerce KVM into
      creating a SPTE for an "impossible" gfn, KVM will leak the associated
      shadow pages (page tables):
      
        WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
                                      kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Modules linked in: kvm_intel kvm irqbypass
        CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G        W         5.18.0-rc1+ #293
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Call Trace:
         <TASK>
         kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
         kvm_destroy_vm+0x162/0x2d0 [kvm]
         kvm_vm_release+0x1d/0x30 [kvm]
         __fput+0x82/0x240
         task_work_run+0x5b/0x90
         exit_to_user_mode_prepare+0xd2/0xe0
         syscall_exit_to_user_mode+0x1d/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      On bare metal, encountering an impossible gpa in the page fault path is
      well and truly impossible, barring CPU bugs, as the CPU will signal #PF
      during the gva=>gpa translation (or a similar failure when stuffing a
      physical address into e.g. the VMCS/VMCB).  But if KVM is running as a VM
      itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
      of the underlying hardware, in which case the hardware will not fault on
      the illegal-from-KVM's-perspective gpa.
      
      Alternatively, KVM could continue allowing the dodgy behavior and simply
      zap the max possible range.  But, for hosts with MAXPHYADDR < 52, that's
      a (minor) waste of cycles, and more importantly, KVM can't reasonably
      support impossible memslots when running on bare metal (or with an
      accurate MAXPHYADDR as a VM).  Note, limiting the overhead by checking if
      KVM is running as a guest is not a safe option as the host isn't required
      to announce itself to the guest in any way, e.g. doesn't need to set the
      HYPERVISOR CPUID bit.
      
      A second alternative to disallowing the memslot behavior would be to
      disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR.  That
      restriction is undesirable as there are legitimate use cases for doing
      so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
      systems so that VMs can be migrated between hosts with different
      MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
      
      Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
      even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
      any reserved physical address bits).
      
      The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
      The memslot and TDP MMU code want an exclusive value, but the name
      implies the returned value is inclusive, and the MMIO path needs an
      inclusive check.
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Fixes: 524a1e4e ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220428233416.2446833-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      86931ff7
  4. 14 4月, 2022 5 次提交
  5. 12 4月, 2022 1 次提交
    • V
      KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPU · 42dcbe7d
      Vitaly Kuznetsov 提交于
      The following WARN is triggered from kvm_vm_ioctl_set_clock():
       WARNING: CPU: 10 PID: 579353 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3161 mark_page_dirty_in_slot+0x6c/0x80 [kvm]
       ...
       CPU: 10 PID: 579353 Comm: qemu-system-x86 Tainted: G        W  O      5.16.0.stable #20
       Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
       RIP: 0010:mark_page_dirty_in_slot+0x6c/0x80 [kvm]
       ...
       Call Trace:
        <TASK>
        ? kvm_write_guest+0x114/0x120 [kvm]
        kvm_hv_invalidate_tsc_page+0x9e/0xf0 [kvm]
        kvm_arch_vm_ioctl+0xa26/0xc50 [kvm]
        ? schedule+0x4e/0xc0
        ? __cond_resched+0x1a/0x50
        ? futex_wait+0x166/0x250
        ? __send_signal+0x1f1/0x3d0
        kvm_vm_ioctl+0x747/0xda0 [kvm]
        ...
      
      The WARN was introduced by commit 03c0304a86bc ("KVM: Warn if
      mark_page_dirty() is called without an active vCPU") but the change seems
      to be correct (unlike Hyper-V TSC page update mechanism). In fact, there's
      no real need to actually write to guest memory to invalidate TSC page, this
      can be done by the first vCPU which goes through kvm_guest_time_update().
      Reported-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220407201013.963226-1-vkuznets@redhat.com>
      42dcbe7d
  6. 05 4月, 2022 1 次提交
    • S
      KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded · 1d0e8480
      Sean Christopherson 提交于
      Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
      -1 is technically undefined behavior when its value is read out by
      param_get_bool(), as boolean values are supposed to be '0' or '1'.
      
      Alternatively, KVM could define a custom getter for the param, but the
      auto value doesn't depend on the vendor module in any way, and printing
      "auto" would be unnecessarily unfriendly to the user.
      
      In addition to fixing the undefined behavior, resolving the auto value
      also fixes the scenario where the auto value resolves to N and no vendor
      module is loaded.  Previously, -1 would result in Y being printed even
      though KVM would ultimately disable the mitigation.
      
      Rename the existing MMU module init/exit helpers to clarify that they're
      invoked with respect to the vendor module, and add comments to document
      why KVM has two separate "module init" flows.
      
        =========================================================================
        UBSAN: invalid-load in kernel/params.c:320:33
        load of value 255 is not a valid value for type '_Bool'
        CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         <TASK>
         dump_stack_lvl+0x34/0x44
         ubsan_epilogue+0x5/0x40
         __ubsan_handle_load_invalid_value.cold+0x43/0x48
         param_get_bool.cold+0xf/0x14
         param_attr_show+0x55/0x80
         module_attr_show+0x1c/0x30
         sysfs_kf_seq_show+0x93/0xc0
         seq_read_iter+0x11c/0x450
         new_sync_read+0x11b/0x1a0
         vfs_read+0xf0/0x190
         ksys_read+0x5f/0xe0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        =========================================================================
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
      Cc: stable@vger.kernel.org
      Reported-by: NBruno Goncalves <bgoncalv@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220331221359.3912754-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1d0e8480
  7. 02 4月, 2022 21 次提交
    • J
      KVM: x86: optimize PKU branching in kvm_load_{guest|host}_xsave_state · 945024d7
      Jon Kohler 提交于
      kvm_load_{guest|host}_xsave_state handles xsave on vm entry and exit,
      part of which is managing memory protection key state. The latest
      arch.pkru is updated with a rdpkru, and if that doesn't match the base
      host_pkru (which about 70% of the time), we issue a __write_pkru.
      
      To improve performance, implement the following optimizations:
       1. Reorder if conditions prior to wrpkru in both
          kvm_load_{guest|host}_xsave_state.
      
          Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is
          checked first, which when instrumented in our environment appeared
          to be always true and less overall work than kvm_read_cr4_bits.
      
          For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead
          one position. When instrumented, I saw this be true roughly ~70% of
          the time vs the other conditions which were almost always true.
          With this change, we will avoid 3rd condition check ~30% of the time.
      
       2. Wrap PKU sections with CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS,
          as if the user compiles out this feature, we should not have
          these branches at all.
      Signed-off-by: NJon Kohler <jon@nutanix.com>
      Message-Id: <20220324004439.6709-1-jon@nutanix.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      945024d7
    • M
      KVM: x86: allow per cpu apicv inhibit reasons · d5fa597e
      Maxim Levitsky 提交于
      Add optional callback .vcpu_get_apicv_inhibit_reasons returning
      extra inhibit reasons that prevent APICv from working on this vCPU.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220322174050.241850-6-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d5fa597e
    • S
      KVM: x86: Don't snapshot "max" TSC if host TSC is constant · 741e511b
      Sean Christopherson 提交于
      Don't snapshot tsc_khz into max_tsc_khz during KVM initialization if the
      host TSC is constant, in which case the actual TSC frequency will never
      change and thus capturing the "max" TSC during initialization is
      unnecessary, KVM can simply use tsc_khz during VM creation.
      
      On CPUs with constant TSC, but not a hardware-specified TSC frequency,
      snapshotting max_tsc_khz and using that to set a VM's default TSC
      frequency can lead to KVM thinking it needs to manually scale the guest's
      TSC if refining the TSC completes after KVM snapshots tsc_khz.  The
      actual frequency never changes, only the kernel's calculation of what
      that frequency is changes.  On systems without hardware TSC scaling, this
      either puts KVM into "always catchup" mode (extremely inefficient), or
      prevents creating VMs altogether.
      
      Ideally, KVM would not be able to race with TSC refinement, or would have
      a hook into tsc_refine_calibration_work() to get an alert when refinement
      is complete.  Avoiding the race altogether isn't practical as refinement
      takes a relative eternity; it's deliberately put on a work queue outside
      of the normal boot sequence to avoid unnecessarily delaying boot.
      
      Adding a hook is doable, but somewhat gross due to KVM's ability to be
      built as a module.  And if the TSC is constant, which is likely the case
      for every VMX/SVM-capable CPU produced in the last decade, the race can
      be hit if and only if userspace is able to create a VM before TSC
      refinement completes; refinement is slow, but not that slow.
      
      For now, punt on a proper fix, as not taking a snapshot can help some
      uses cases and not taking a snapshot is arguably correct irrespective of
      the race with refinement.
      
      [ dwmw2: Rebase on top of KVM-wide default_tsc_khz to ensure that all
               vCPUs get the same frequency even if we hit the race. ]
      
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Anton Romanov <romanton@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20220225145304.36166-3-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      741e511b
    • D
      KVM: x86: Accept KVM_[GS]ET_TSC_KHZ as a VM ioctl. · ffbb61d0
      David Woodhouse 提交于
      This sets the default TSC frequency for subsequently created vCPUs.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20220225145304.36166-2-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ffbb61d0
    • D
      KVM: x86/xen: Advertise and document KVM_XEN_HVM_CONFIG_EVTCHN_SEND · 661a20fa
      David Woodhouse 提交于
      At the end of the patch series adding this batch of event channel
      acceleration features, finally add the feature bit which advertises
      them and document it all.
      
      For SCHEDOP_poll we need to wake a polling vCPU when a given port
      is triggered, even when it's masked — and we want to implement that
      in the kernel, for efficiency. So we want the kernel to know that it
      has sole ownership of event channel delivery. Thus, we allow
      userspace to make the 'promise' by setting the corresponding feature
      bit in its KVM_XEN_HVM_CONFIG call. As we implement SCHEDOP_poll
      bypass later, we will do so only if that promise has been made by
      userspace.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-16-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      661a20fa
    • D
      KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID · 942c2490
      David Woodhouse 提交于
      In order to intercept hypercalls such as VCPUOP_set_singleshot_timer, we
      need to be aware of the Xen CPU numbering.
      
      This looks a lot like the Hyper-V handling of vpidx, for obvious reasons.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-12-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      942c2490
    • D
      KVM: x86/xen: Support direct injection of event channel events · 35025735
      David Woodhouse 提交于
      This adds a KVM_XEN_HVM_EVTCHN_SEND ioctl which allows direct injection
      of events given an explicit { vcpu, port, priority } in precisely the
      same form that those fields are given in the IRQ routing table.
      
      Userspace is currently able to inject 2-level events purely by setting
      the bits in the shared_info and vcpu_info, but FIFO event channels are
      harder to deal with; we will need the kernel to take sole ownership of
      delivery when we support those.
      
      A patch advertising this feature with a new bit in the KVM_CAP_XEN_HVM
      ioctl will be added in a subsequent patch.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-9-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      35025735
    • D
      KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_time_info · 69d413cf
      David Woodhouse 提交于
      This switches the final pvclock to kvm_setup_pvclock_pfncache() and now
      the old kvm_setup_pvclock_page() can be removed.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-7-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      69d413cf
    • D
      KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_info · 7caf9571
      David Woodhouse 提交于
      Currently, the fast path of kvm_xen_set_evtchn_fast() doesn't set the
      index bits in the target vCPU's evtchn_pending_sel, because it only has
      a userspace virtual address with which to do so. It just sets them in
      the kernel, and kvm_xen_has_interrupt() then completes the delivery to
      the actual vcpu_info structure when the vCPU runs.
      
      Using a gfn_to_pfn_cache allows kvm_xen_set_evtchn_fast() to do the full
      delivery in the common case.
      
      Clean up the fallback case too, by moving the deferred delivery out into
      a separate kvm_xen_inject_pending_events() function which isn't ever
      called in atomic contexts as __kvm_xen_has_interrupt() is.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-6-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7caf9571
    • D
      KVM: x86: Use gfn_to_pfn_cache for pv_time · 916d3608
      David Woodhouse 提交于
      Add a new kvm_setup_guest_pvclock() which parallels the existing
      kvm_setup_pvclock_page(). The latter will be removed once we convert
      all users to the gfn_to_pfn_cache version.
      
      Using the new cache, we can potentially let kvm_set_guest_paused() set
      the PVCLOCK_GUEST_STOPPED bit directly rather than having to delegate
      to the vCPU via KVM_REQ_CLOCK_UPDATE. But not yet.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-5-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      916d3608
    • D
      KVM: x86/xen: Use gfn_to_pfn_cache for runstate area · a795cd43
      David Woodhouse 提交于
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-4-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a795cd43
    • O
      KVM: x86: Allow userspace to opt out of hypercall patching · f1a9761f
      Oliver Upton 提交于
      KVM handles the VMCALL/VMMCALL instructions very strangely. Even though
      both of these instructions really should #UD when executed on the wrong
      vendor's hardware (i.e. VMCALL on SVM, VMMCALL on VMX), KVM replaces the
      guest's instruction with the appropriate instruction for the vendor.
      Nonetheless, older guest kernels without commit c1118b36 ("x86: kvm:
      use alternatives for VMCALL vs. VMMCALL if kernel text is read-only")
      do not patch in the appropriate instruction using alternatives, likely
      motivating KVM's intervention.
      
      Add a quirk allowing userspace to opt out of hypercall patching. If the
      quirk is disabled, KVM synthesizes a #UD in the guest.
      Signed-off-by: NOliver Upton <oupton@google.com>
      Message-Id: <20220316005538.2282772-2-oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f1a9761f
    • M
      KVM: x86: SVM: fix tsc scaling when the host doesn't support it · 88099313
      Maxim Levitsky 提交于
      It was decided that when TSC scaling is not supported,
      the virtual MSR_AMD64_TSC_RATIO should still have the default '1.0'
      value.
      
      However in this case kvm_max_tsc_scaling_ratio is not set,
      which breaks various assumptions.
      
      Fix this by always calculating kvm_max_tsc_scaling_ratio regardless of
      host support.  For consistency, do the same for VMX.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220322172449.235575-8-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      88099313
    • H
      KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr · ac8d6cad
      Hou Wenlong 提交于
      If MSR access is rejected by MSR filtering,
      kvm_set_msr()/kvm_get_msr() would return KVM_MSR_RET_FILTERED,
      and the return value is only handled well for rdmsr/wrmsr.
      However, some instruction emulation and state transition also
      use kvm_set_msr()/kvm_get_msr() to do msr access but may trigger
      some unexpected results if MSR access is rejected, E.g. RDPID
      emulation would inject a #UD but RDPID wouldn't cause a exit
      when RDPID is supported in hardware and ENABLE_RDTSCP is set.
      And it would also cause failure when load MSR at nested entry/exit.
      Since msr filtering is based on MSR bitmap, it is better to only
      do MSR filtering for rdmsr/wrmsr.
      Signed-off-by: NHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <2b2774154f7532c96a6f04d71c82a8bec7d9e80b.1646655860.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ac8d6cad
    • H
      KVM: x86/emulator: Emulate RDPID only if it is enabled in guest · a836839c
      Hou Wenlong 提交于
      When RDTSCP is supported but RDPID is not supported in host,
      RDPID emulation is available. However, __kvm_get_msr() would
      only fail when RDTSCP/RDPID both are disabled in guest, so
      the emulator wouldn't inject a #UD when RDPID is disabled but
      RDTSCP is enabled in guest.
      
      Fixes: fb6d4d34 ("KVM: x86: emulate RDPID")
      Signed-off-by: NHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <1dfd46ae5b76d3ed87bde3154d51c64ea64c99c1.1646226788.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a836839c
    • S
      KVM: x86: Trace all APICv inhibit changes and capture overall status · 4f4c4a3e
      Sean Christopherson 提交于
      Trace all APICv inhibit changes instead of just those that result in
      APICv being (un)inhibited, and log the current state.  Debugging why
      APICv isn't working is frustrating as it's hard to see why APICv is still
      inhibited, and logging only the first inhibition means unnecessary onion
      peeling.
      
      Opportunistically drop the export of the tracepoint, it is not and should
      not be used by vendor code due to the need to serialize toggling via
      apicv_update_lock.
      
      Note, using the common flow means kvm_apicv_init() switched from atomic
      to non-atomic bitwise operations.  The VM is unreachable at init, so
      non-atomic is perfectly ok.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f4c4a3e
    • S
      KVM: x86: Add wrappers for setting/clearing APICv inhibits · 320af55a
      Sean Christopherson 提交于
      Add set/clear wrappers for toggling APICv inhibits to make the call sites
      more readable, and opportunistically rename the inner helpers to align
      with the new wrappers and to make them more readable as well.  Invert the
      flag from "activate" to "set"; activate is painfully ambiguous as it's
      not obvious if the inhibit is being activated, or if APICv is being
      activated, in which case the inhibit is being deactivated.
      
      For the functions that take @set, swap the order of the inhibit reason
      and @set so that the call sites are visually similar to those that bounce
      through the wrapper.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      320af55a
    • S
      KVM: x86: Make APICv inhibit reasons an enum and cleanup naming · 7491b7b2
      Sean Christopherson 提交于
      Use an enum for the APICv inhibit reasons, there is no meaning behind
      their values and they most definitely are not "unsigned longs".  Rename
      the various params to "reason" for consistency and clarity (inhibit may
      be confused as a command, i.e. inhibit APICv, instead of the reason that
      is getting toggled/checked).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311043517.17027-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7491b7b2
    • L
      KVM: X86: Handle implicit supervisor access with SMAP · 4f4aa80e
      Lai Jiangshan 提交于
      There are two kinds of implicit supervisor access
      	implicit supervisor access when CPL = 3
      	implicit supervisor access when CPL < 3
      
      Current permission_fault() handles only the first kind for SMAP.
      
      But if the access is implicit when SMAP is on, data may not be read
      nor write from any user-mode address regardless the current CPL.
      
      So the second kind should be also supported.
      
      The first kind can be detect via CPL and access mode: if it is
      supervisor access and CPL = 3, it must be implicit supervisor access.
      
      But it is not possible to detect the second kind without extra
      information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
      into @access. This extra information also works for the first kind, so
      the logic is changed to use this information for both cases.
      
      The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
      which is in the most significant 16 bits of u64 and less likely to be
      forced to change due to future hardware uses it.
      
      This patch removes the call to ->get_cpl() for access mode is determined
      by @access.  Not only does it reduce a function call, but also remove
      confusions when the permission is checked for nested TDP.  The nested
      TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
      on it.  The original code works just because it is always user walk for
      NPT and SMAP fault is not set for EPT in update_permission_bitmask.
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f4aa80e
    • L
      KVM: X86: Change the type of access u32 to u64 · 5b22bbe7
      Lai Jiangshan 提交于
      Change the type of access u32 to u64 for FNAME(walk_addr) and
      ->gva_to_gpa().
      
      The kinds of accesses are usually combinations of UWX, and VMX/SVM's
      nested paging adds a new factor of access: is it an access for a guest
      page table or for a final guest physical address.
      
      And SMAP relies a factor for supervisor access: explicit or implicit.
      
      So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
      all these information to do the walk.
      
      Although @access(u32) has enough bits to encode all the kinds, this
      patch extends it to u64:
      	o Extra bits will be in the higher 32 bits, so that we can
      	  easily obtain the traditional access mode (UWX) by converting
      	  it to u32.
      	o Reuse the value for the access kind defined by SVM's nested
      	  paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
      	  @error_code in kvm_handle_page_fault().
      Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5b22bbe7
    • P
      KVM: MMU: propagate alloc_workqueue failure · a1a39128
      Paolo Bonzini 提交于
      If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
      to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
      kvm_arch_init_vm also has to undo all the initialization, so
      group all the MMU initialization code at the beginning and
      handle cleaning up of kvm_page_track_init.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a1a39128
  8. 21 3月, 2022 2 次提交
    • O
      KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2 · 6d849191
      Oliver Upton 提交于
      KVM_CAP_DISABLE_QUIRKS is irrevocably broken. The capability does not
      advertise the set of quirks which may be disabled to userspace, so it is
      impossible to predict the behavior of KVM. Worse yet,
      KVM_CAP_DISABLE_QUIRKS will tolerate any value for cap->args[0], meaning
      it fails to reject attempts to set invalid quirk bits.
      
      The only valid workaround for the quirky quirks API is to add a new CAP.
      Actually advertise the set of quirks that can be disabled to userspace
      so it can predict KVM's behavior. Reject values for cap->args[0] that
      contain invalid bits.
      
      Finally, add documentation for the new capability and describe the
      existing quirks.
      Signed-off-by: NOliver Upton <oupton@google.com>
      Message-Id: <20220301060351.442881-5-oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6d849191
    • T
      kvm: x86: Require const tsc for RT · 5e17b2ee
      Thomas Gleixner 提交于
      Non constant TSC is a nightmare on bare metal already, but with
      virtualization it becomes a complete disaster because the workarounds
      are horrible latency wise. That's also a preliminary for running RT in
      a guest on top of a RT host.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Message-Id: <Yh5eJSG19S2sjZfy@linutronix.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5e17b2ee
  9. 08 3月, 2022 1 次提交