1. 07 1月, 2022 11 次提交
    • E
      KVM: x86: Update vPMCs when retiring branch instructions · 018d70ff
      Eric Hankland 提交于
      When KVM retires a guest branch instruction through emulation,
      increment any vPMCs that are configured to monitor "branch
      instructions retired," and update the sample period of those counters
      so that they will overflow at the right time.
      Signed-off-by: NEric Hankland <ehankland@google.com>
      [jmattson:
        - Split the code to increment "branch instructions retired" into a
          separate commit.
        - Moved/consolidated the calls to kvm_pmu_trigger_event() in the
          emulation of VMLAUNCH/VMRESUME to accommodate the evolution of
          that code.
      ]
      Fixes: f5132b01 ("KVM: Expose a version 2 architectural PMU to a guests")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Message-Id: <20211130074221.93635-7-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      018d70ff
    • E
      KVM: x86: Update vPMCs when retiring instructions · 9cd803d4
      Eric Hankland 提交于
      When KVM retires a guest instruction through emulation, increment any
      vPMCs that are configured to monitor "instructions retired," and
      update the sample period of those counters so that they will overflow
      at the right time.
      Signed-off-by: NEric Hankland <ehankland@google.com>
      [jmattson:
        - Split the code to increment "branch instructions retired" into a
          separate commit.
        - Added 'static' to kvm_pmu_incr_counter() definition.
        - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
          PERF_EVENT_STATE_ACTIVE.
      ]
      Fixes: f5132b01 ("KVM: Expose a version 2 architectural PMU to a guests")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      [likexu:
        - Drop checks for pmc->perf_event or event state or event type
        - Increase a counter once its umask bits and the first 8 select bits are matched
        - Rewrite kvm_pmu_incr_counter() with a less invasive approach to the host perf;
        - Rename kvm_pmu_record_event to kvm_pmu_trigger_event;
        - Add counter enable and CPL check for kvm_pmu_trigger_event();
      ]
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20211130074221.93635-6-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9cd803d4
    • L
      KVM: x86/pmu: Add pmc->intr to refactor kvm_perf_overflow{_intr}() · 40ccb96d
      Like Xu 提交于
      Depending on whether intr should be triggered or not, KVM registers
      two different event overflow callbacks in the perf_event context.
      
      The code skeleton of these two functions is very similar, so
      the pmc->intr can be stored into pmc from pmc_reprogram_counter()
      which provides smaller instructions footprint against the
      u-architecture branch predictor.
      
      The __kvm_perf_overflow() can be called in non-nmi contexts
      and a flag is needed to distinguish the caller context and thus
      avoid a check on kvm_is_in_guest(), otherwise we might get
      warnings from suspicious RCU or check_preemption_disabled().
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20211130074221.93635-5-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      40ccb96d
    • L
      KVM: x86/pmu: Reuse pmc_perf_hw_id() and drop find_fixed_event() · 6ed1298e
      Like Xu 提交于
      Since we set the same semantic event value for the fixed counter in
      pmc->eventsel, returning the perf_hw_id for the fixed counter via
      find_fixed_event() can be painlessly replaced by pmc_perf_hw_id()
      with the help of pmc_is_fixed() check.
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20211130074221.93635-4-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ed1298e
    • L
      KVM: x86/pmu: Refactoring find_arch_event() to pmc_perf_hw_id() · 7c174f30
      Like Xu 提交于
      The find_arch_event() returns a "unsigned int" value,
      which is used by the pmc_reprogram_counter() to
      program a PERF_TYPE_HARDWARE type perf_event.
      
      The returned value is actually the kernel defined generic
      perf_hw_id, let's rename it to pmc_perf_hw_id() with simpler
      incoming parameters for better self-explanation.
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20211130074221.93635-3-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7c174f30
    • L
      KVM: x86/pmu: Setup pmc->eventsel for fixed PMCs · 76187563
      Like Xu 提交于
      The current pmc->eventsel for fixed counter is underutilised. The
      pmc->eventsel can be setup for all known available fixed counters
      since we have mapping between fixed pmc index and
      the intel_arch_events array.
      
      Either gp or fixed counter, it will simplify the later checks for
      consistency between eventsel and perf_hw_id.
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20211130074221.93635-2-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      76187563
    • P
      KVM: x86: avoid out of bounds indices for fixed performance counters · 006a0f06
      Paolo Bonzini 提交于
      Because IceLake has 4 fixed performance counters but KVM only
      supports 3, it is possible for reprogram_fixed_counters to pass
      to reprogram_fixed_counter an index that is out of bounds for the
      fixed_pmc_events array.
      
      Ultimately intel_find_fixed_event, which is the only place that uses
      fixed_pmc_events, handles this correctly because it checks against the
      size of fixed_pmc_events anyway.  Every other place operates on the
      fixed_counters[] array which is sized according to INTEL_PMC_MAX_FIXED.
      However, it is cleaner if the unsupported performance counters are culled
      early on in reprogram_fixed_counters.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      006a0f06
    • L
      KVM: VMX: Mark VCPU_EXREG_CR3 dirty when !CR0_PG -> CR0_PG if EPT + !URG · 5b61178c
      Lai Jiangshan 提交于
      When !CR0_PG -> CR0_PG, vcpu->arch.cr3 becomes active, but GUEST_CR3 is
      still vmx->ept_identity_map_addr if EPT + !URG.  So VCPU_EXREG_CR3 is
      considered to be dirty and GUEST_CR3 needs to be updated in this case.
      Reported-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20211216021938.11752-4-jiangshanlai@gmail.com>
      Fixes: c62c7bd4 ("KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5b61178c
    • L
      KVM: x86/mmu: Reconstruct shadow page root if the guest PDPTEs is changed · 6b123c3a
      Lai Jiangshan 提交于
      For shadow paging, the page table needs to be reconstructed before the
      coming VMENTER if the guest PDPTEs is changed.
      
      But not all paths that call load_pdptrs() will cause the page tables to be
      reconstructed. Normally, kvm_mmu_reset_context() and kvm_mmu_free_roots()
      are used to launch later reconstruction.
      
      The commit d81135a5("KVM: x86: do not reset mmu if CR0.CD and
      CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
      when changing CR0.CD and CR0.NW.
      
      The commit 21823fbd("KVM: x86: Invalidate all PGDs for the current
      PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
      load_pdptrs() when rewriting the CR3 with the same value.
      
      The commit a91a7c70("KVM: X86: Don't reset mmu context when
      toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
      load_pdptrs() when changing CR4.PGE.
      
      Guests like linux would keep the PDPTEs unchanged for every instance of
      pagetable, so this missing reconstruction has no problem for linux
      guests.
      
      Fixes: d81135a5("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
      Fixes: 21823fbd("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
      Fixes: a91a7c70("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20211216021938.11752-3-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6b123c3a
    • L
      KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs() · a9f2705e
      Lai Jiangshan 提交于
      The host CR3 in the vcpu thread can only be changed when scheduling,
      so commit 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
      changed vmx.c to only save it in vmx_prepare_switch_to_guest().
      
      However, it also has to be synced in vmx_sync_vmcs_host_state() when switching VMCS.
      vmx_set_host_fs_gs() is called in both places, so rename it to
      vmx_set_vmcs_host_state() and make it update HOST_CR3.
      
      Fixes: 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20211216021938.11752-2-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a9f2705e
    • P
      Revert "KVM: X86: Update mmu->pdptrs only when it is changed" · 46cbc040
      Paolo Bonzini 提交于
      This reverts commit 24cd19a2.
      Sean Christopherson reports:
      
      "Commit 24cd19a2 ('KVM: X86: Update mmu->pdptrs only when it is
      changed') breaks nested VMs with EPT in L0 and PAE shadow paging in L2.
      Reproducing is trivial, just disable EPT in L1 and run a VM.  I haven't
      investigating how it breaks things."
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      46cbc040
  2. 22 12月, 2021 1 次提交
    • S
      KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU · fdba608f
      Sean Christopherson 提交于
      Drop a check that guards triggering a posted interrupt on the currently
      running vCPU, and more importantly guards waking the target vCPU if
      triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE.
      If a vIRQ is delivered from asynchronous context, the target vCPU can be
      the currently running vCPU and can also be blocking, in which case
      skipping kvm_vcpu_wake_up() is effectively dropping what is supposed to
      be a wake event for the vCPU.
      
      The "do nothing" logic when "vcpu == running_vcpu" mostly works only
      because the majority of calls to ->deliver_posted_interrupt(), especially
      when using posted interrupts, come from synchronous KVM context.  But if
      a device is exposed to the guest using vfio-pci passthrough, the VFIO IRQ
      and vCPU are bound to the same pCPU, and the IRQ is _not_ configured to
      use posted interrupts, wake events from the device will be delivered to
      KVM from IRQ context, e.g.
      
        vfio_msihandler()
        |
        |-> eventfd_signal()
            |
            |-> ...
                |
                |->  irqfd_wakeup()
                     |
                     |->kvm_arch_set_irq_inatomic()
                        |
                        |-> kvm_irq_delivery_to_apic_fast()
                            |
                            |-> kvm_apic_set_irq()
      
      This also aligns the non-nested and nested usage of triggering posted
      interrupts, and will allow for additional cleanups.
      
      Fixes: 379a3c8e ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath")
      Cc: stable@vger.kernel.org
      Reported-by: NLongpeng (Mike) <longpeng2@huawei.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-18-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fdba608f
  3. 20 12月, 2021 7 次提交
    • S
      KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required · cd0e615c
      Sean Christopherson 提交于
      Synthesize a triple fault if L2 guest state is invalid at the time of
      VM-Enter, which can happen if L1 modifies SMRAM or if userspace stuffs
      guest state via ioctls(), e.g. KVM_SET_SREGS.  KVM should never emulate
      invalid guest state, since from L1's perspective, it's architecturally
      impossible for L2 to have invalid state while L2 is running in hardware.
      E.g. attempts to set CR0 or CR4 to unsupported values will either VM-Exit
      or #GP.
      
      Modifying vCPU state via RSM+SMRAM and ioctl() are the only paths that
      can trigger this scenario, as nested VM-Enter correctly rejects any
      attempt to enter L2 with invalid state.
      
      RSM is a straightforward case as (a) KVM follows AMD's SMRAM layout and
      behavior, and (b) Intel's SDM states that loading reserved CR0/CR4 bits
      via RSM results in shutdown, i.e. there is precedent for KVM's behavior.
      Following AMD's SMRAM layout is important as AMD's layout saves/restores
      the descriptor cache information, including CS.RPL and SS.RPL, and also
      defines all the fields relevant to invalid guest state as read-only, i.e.
      so long as the vCPU had valid state before the SMI, which is guaranteed
      for L2, RSM will generate valid state unless SMRAM was modified.  Intel's
      layout saves/restores only the selector, which means that scenarios where
      the selector and cached RPL don't match, e.g. conforming code segments,
      would yield invalid guest state.  Intel CPUs fudge around this issued by
      stuffing SS.RPL and CS.RPL on RSM.  Per Intel's SDM on the "Default
      Treatment of RSM", paraphrasing for brevity:
      
        IF internal storage indicates that the [CPU was post-VMXON]
        THEN
           enter VMX operation (root or non-root);
           restore VMX-critical state as defined in Section 34.14.1;
           set to their fixed values any bits in CR0 and CR4 whose values must
           be fixed in VMX operation [unless coming from an unrestricted guest];
           IF RFLAGS.VM = 0 AND (in VMX root operation OR the
              “unrestricted guest” VM-execution control is 0)
           THEN
             CS.RPL := SS.DPL;
             SS.RPL := SS.DPL;
           FI;
           restore current VMCS pointer;
        FI;
      
      Note that Intel CPUs also overwrite the fixed CR0/CR4 bits, whereas KVM
      will sythesize TRIPLE_FAULT in this scenario.  KVM's behavior is allowed
      as both Intel and AMD define CR0/CR4 SMRAM fields as read-only, i.e. the
      only way for CR0 and/or CR4 to have illegal values is if they were
      modified by the L1 SMM handler, and Intel's SDM "SMRAM State Save Map"
      section states "modifying these registers will result in unpredictable
      behavior".
      
      KVM's ioctl() behavior is less straightforward.  Because KVM allows
      ioctls() to be executed in any order, rejecting an ioctl() if it would
      result in invalid L2 guest state is not an option as KVM cannot know if
      a future ioctl() would resolve the invalid state, e.g. KVM_SET_SREGS, or
      drop the vCPU out of L2, e.g. KVM_SET_NESTED_STATE.  Ideally, KVM would
      reject KVM_RUN if L2 contained invalid guest state, but that carries the
      risk of a false positive, e.g. if RSM loaded invalid guest state and KVM
      exited to userspace.  Setting a flag/request to detect such a scenario is
      undesirable because (a) it's extremely unlikely to add value to KVM as a
      whole, and (b) KVM would need to consider ioctl() interactions with such
      a flag, e.g. if userspace migrated the vCPU while the flag were set.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211207193006.120997-3-seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cd0e615c
    • S
      KVM: VMX: Always clear vmx->fail on emulation_required · a80dfc02
      Sean Christopherson 提交于
      Revert a relatively recent change that set vmx->fail if the vCPU is in L2
      and emulation_required is true, as that behavior is completely bogus.
      Setting vmx->fail and synthesizing a VM-Exit is contradictory and wrong:
      
        (a) it's impossible to have both a VM-Fail and VM-Exit
        (b) vmcs.EXIT_REASON is not modified on VM-Fail
        (c) emulation_required refers to guest state and guest state checks are
            always VM-Exits, not VM-Fails.
      
      For KVM specifically, emulation_required is handled before nested exits
      in __vmx_handle_exit(), thus setting vmx->fail has no immediate effect,
      i.e. KVM calls into handle_invalid_guest_state() and vmx->fail is ignored.
      Setting vmx->fail can ultimately result in a WARN in nested_vmx_vmexit()
      firing when tearing down the VM as KVM never expects vmx->fail to be set
      when L2 is active, KVM always reflects those errors into L1.
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 21158 at arch/x86/kvm/vmx/nested.c:4548
                                      nested_vmx_vmexit+0x16bd/0x17e0
                                      arch/x86/kvm/vmx/nested.c:4547
        Modules linked in:
        CPU: 0 PID: 21158 Comm: syz-executor.1 Not tainted 5.16.0-rc3-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:nested_vmx_vmexit+0x16bd/0x17e0 arch/x86/kvm/vmx/nested.c:4547
        Code: <0f> 0b e9 2e f8 ff ff e8 57 b3 5d 00 0f 0b e9 00 f1 ff ff 89 e9 80
        Call Trace:
         vmx_leave_nested arch/x86/kvm/vmx/nested.c:6220 [inline]
         nested_vmx_free_vcpu+0x83/0xc0 arch/x86/kvm/vmx/nested.c:330
         vmx_free_vcpu+0x11f/0x2a0 arch/x86/kvm/vmx/vmx.c:6799
         kvm_arch_vcpu_destroy+0x6b/0x240 arch/x86/kvm/x86.c:10989
         kvm_vcpu_destroy+0x29/0x90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:441
         kvm_free_vcpus arch/x86/kvm/x86.c:11426 [inline]
         kvm_arch_destroy_vm+0x3ef/0x6b0 arch/x86/kvm/x86.c:11545
         kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1189 [inline]
         kvm_put_kvm+0x751/0xe40 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1220
         kvm_vcpu_release+0x53/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3489
         __fput+0x3fc/0x870 fs/file_table.c:280
         task_work_run+0x146/0x1c0 kernel/task_work.c:164
         exit_task_work include/linux/task_work.h:32 [inline]
         do_exit+0x705/0x24f0 kernel/exit.c:832
         do_group_exit+0x168/0x2d0 kernel/exit.c:929
         get_signal+0x1740/0x2120 kernel/signal.c:2852
         arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868
         handle_signal_work kernel/entry/common.c:148 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
         exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207
         __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
         syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
         do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: c8607e4a ("KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry")
      Reported-by: syzbot+f1d2136db9c80d4733e8@syzkaller.appspotmail.com
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211207193006.120997-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a80dfc02
    • M
      KVM: x86: Always set kvm_run->if_flag · c5063551
      Marc Orr 提交于
      The kvm_run struct's if_flag is a part of the userspace/kernel API. The
      SEV-ES patches failed to set this flag because it's no longer needed by
      QEMU (according to the comment in the source code). However, other
      hypervisors may make use of this flag. Therefore, set the flag for
      guests with encrypted registers (i.e., with guest_state_protected set).
      
      Fixes: f1c6366e ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
      Signed-off-by: NMarc Orr <marcorr@google.com>
      Message-Id: <20211209155257.128747-1-marcorr@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      c5063551
    • S
      KVM: x86/mmu: Don't advance iterator after restart due to yielding · 3a0f64de
      Sean Christopherson 提交于
      After dropping mmu_lock in the TDP MMU, restart the iterator during
      tdp_iter_next() and do not advance the iterator.  Advancing the iterator
      results in skipping the top-level SPTE and all its children, which is
      fatal if any of the skipped SPTEs were not visited before yielding.
      
      When zapping all SPTEs, i.e. when min_level == root_level, restarting the
      iter and then invoking tdp_iter_next() is always fatal if the current gfn
      has as a valid SPTE, as advancing the iterator results in try_step_side()
      skipping the current gfn, which wasn't visited before yielding.
      
      Sprinkle WARNs on iter->yielded being true in various helpers that are
      often used in conjunction with yielding, and tag the helper with
      __must_check to reduce the probabily of improper usage.
      
      Failing to zap a top-level SPTE manifests in one of two ways.  If a valid
      SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
      the shadow page will be leaked and KVM will WARN accordingly.
      
        WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
        RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
        Call Trace:
         <TASK>
         kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
         kvm_destroy_vm+0x162/0x2a0 [kvm]
         kvm_vcpu_release+0x34/0x60 [kvm]
         __fput+0x82/0x240
         task_work_run+0x5c/0x90
         do_exit+0x364/0xa10
         ? futex_unqueue+0x38/0x60
         do_group_exit+0x33/0xa0
         get_signal+0x155/0x850
         arch_do_signal_or_restart+0xed/0x750
         exit_to_user_mode_prepare+0xc5/0x120
         syscall_exit_to_user_mode+0x1d/0x40
         do_syscall_64+0x48/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
      kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
      marking a struct page as dirty/accessed after it has been put back on the
      free list.  This directly triggers a WARN due to encountering a page with
      page_count() == 0, but it can also lead to data corruption and additional
      errors in the kernel.
      
        WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
        RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
        Call Trace:
         <TASK>
         kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
         __handle_changed_spte+0x92e/0xca0 [kvm]
         __handle_changed_spte+0x63c/0xca0 [kvm]
         __handle_changed_spte+0x63c/0xca0 [kvm]
         __handle_changed_spte+0x63c/0xca0 [kvm]
         zap_gfn_range+0x549/0x620 [kvm]
         kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
         mmu_free_root_page+0x219/0x2c0 [kvm]
         kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
         kvm_mmu_unload+0x1c/0xa0 [kvm]
         kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
         kvm_put_kvm+0x3b1/0x8b0 [kvm]
         kvm_vcpu_release+0x4e/0x70 [kvm]
         __fput+0x1f7/0x8c0
         task_work_run+0xf8/0x1a0
         do_exit+0x97b/0x2230
         do_group_exit+0xda/0x2a0
         get_signal+0x3be/0x1e50
         arch_do_signal_or_restart+0x244/0x17f0
         exit_to_user_mode_prepare+0xcb/0x120
         syscall_exit_to_user_mode+0x1d/0x40
         do_syscall_64+0x4d/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Note, the underlying bug existed even before commit 1af4a960 ("KVM:
      x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
      tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
      incorrectly advance past a top-level entry when yielding on a lower-level
      entry.  But with respect to leaking shadow pages, the bug was introduced
      by yielding before processing the current gfn.
      
      Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
      callers could jump to their "retry" label.  The downside of that approach
      is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
      in the loop, and there's no easy way to enfornce that requirement.
      
      Ideally, KVM would handling the cond_resched() fully within the iterator
      macro (the code is actually quite clean) and avoid this entire class of
      bugs, but that is extremely difficult do while also supporting yielding
      after tdp_mmu_set_spte_atomic() fails.  Yielding after failing to set a
      SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
      bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
      may block operations on the SPTE for a significant amount of time.
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Fixes: 1af4a960 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
      Reported-by: NIgnat Korchagin <ignat@cloudflare.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211214033528.123268-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3a0f64de
    • W
      KVM: x86: remove PMU FIXED_CTR3 from msrs_to_save_all · 9fb12fe5
      Wei Wang 提交于
      The fixed counter 3 is used for the Topdown metrics, which hasn't been
      enabled for KVM guests. Userspace accessing to it will fail as it's not
      included in get_fixed_pmc(). This breaks KVM selftests on ICX+ machines,
      which have this counter.
      
      To reproduce it on ICX+ machines, ./state_test reports:
      ==== Test Assertion Failure ====
      lib/x86_64/processor.c:1078: r == nmsrs
      pid=4564 tid=4564 - Argument list too long
      1  0x000000000040b1b9: vcpu_save_state at processor.c:1077
      2  0x0000000000402478: main at state_test.c:209 (discriminator 6)
      3  0x00007fbe21ed5f92: ?? ??:0
      4  0x000000000040264d: _start at ??:?
       Unexpected result from KVM_GET_MSRS, r: 17 (failed MSR was 0x30c)
      
      With this patch, it works well.
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Message-Id: <20211217124934.32893-1-wei.w.wang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9fb12fe5
    • S
      KVM: x86: Retry page fault if MMU reload is pending and root has no sp · 18c841e1
      Sean Christopherson 提交于
      Play nice with a NULL shadow page when checking for an obsolete root in
      the page fault handler by flagging the page fault as stale if there's no
      shadow page associated with the root and KVM_REQ_MMU_RELOAD is pending.
      Invalidating memslots, which is the only case where _all_ roots need to
      be reloaded, requests all vCPUs to reload their MMUs while holding
      mmu_lock for lock.
      
      The "special" roots, e.g. pae_root when KVM uses PAE paging, are not
      backed by a shadow page.  Running with TDP disabled or with nested NPT
      explodes spectaculary due to dereferencing a NULL shadow page pointer.
      
      Skip the KVM_REQ_MMU_RELOAD check if there is a valid shadow page for the
      root.  Zapping shadow pages in response to guest activity, e.g. when the
      guest frees a PGD, can trigger KVM_REQ_MMU_RELOAD even if the current
      vCPU isn't using the affected root.  I.e. KVM_REQ_MMU_RELOAD can be seen
      with a completely valid root shadow page.  This is a bit of a moot point
      as KVM currently unloads all roots on KVM_REQ_MMU_RELOAD, but that will
      be cleaned up in the future.
      
      Fixes: a955cad8 ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211209060552.2956723-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      18c841e1
    • V
      KVM: x86: Drop guest CPUID check for host initiated writes to MSR_IA32_PERF_CAPABILITIES · 1aa2abb3
      Vitaly Kuznetsov 提交于
      The ability to write to MSR_IA32_PERF_CAPABILITIES from the host should
      not depend on guest visible CPUID entries, even if just to allow
      creating/restoring guest MSRs and CPUIDs in any sequence.
      
      Fixes: 27461da3 ("KVM: x86/pmu: Support full width counting")
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211216165213.338923-3-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1aa2abb3
  4. 10 12月, 2021 5 次提交
  5. 09 12月, 2021 3 次提交
  6. 08 12月, 2021 13 次提交
    • V
      KVM: nVMX: Implement Enlightened MSR Bitmap feature · 502d2bf5
      Vitaly Kuznetsov 提交于
      Updating MSR bitmap for L2 is not cheap and rearly needed. TLFS for Hyper-V
      offers 'Enlightened MSR Bitmap' feature which allows L1 hypervisor to
      inform L0 when it changes MSR bitmap, this eliminates the need to examine
      L1's MSR bitmap for L2 every time when 'real' MSR bitmap for L2 gets
      constructed.
      
      Use 'vmx->nested.msr_bitmap_changed' flag to implement the feature.
      
      Note, KVM already uses 'Enlightened MSR bitmap' feature when it runs as a
      nested hypervisor on top of Hyper-V. The newly introduced feature is going
      to be used by Hyper-V guests on KVM.
      
      When the feature is enabled for Win10+WSL2, it shaves off around 700 CPU
      cycles from a nested vmexit cost (tight cpuid loop test).
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211129094704.326635-5-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      502d2bf5
    • V
      KVM: nVMX: Track whether changes in L0 require MSR bitmap for L2 to be rebuilt · ed2a4800
      Vitaly Kuznetsov 提交于
      Introduce a flag to keep track of whether MSR bitmap for L2 needs to be
      rebuilt due to changes in MSR bitmap for L1 or switching to a different
      L2. This information will be used for Enlightened MSR Bitmap feature for
      Hyper-V guests.
      
      Note, setting msr_bitmap_changed to 'true' from set_current_vmptr() is
      not really needed for Enlightened MSR Bitmap as the feature can only
      be used in conjunction with Enlightened VMCS but let's keep tracking
      information complete, it's cheap and in the future similar PV feature can
      easily be implemented for KVM on KVM too.
      
      No functional change intended.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211129094704.326635-4-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed2a4800
    • V
      KVM: VMX: Introduce vmx_msr_bitmap_l01_changed() helper · b84155c3
      Vitaly Kuznetsov 提交于
      In preparation to enabling 'Enlightened MSR Bitmap' feature for Hyper-V
      guests move MSR bitmap update tracking to a dedicated helper.
      
      Note: vmx_msr_bitmap_l01_changed() is called when MSR bitmap might be
      updated. KVM doesn't check if the bit we're trying to set is already set
      (or the bit it's trying to clear is already cleared). Such situations
      should not be common and a few false positives should not be a problem.
      
      No functional change intended.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211129094704.326635-3-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b84155c3
    • H
      KVM: x86: Exit to userspace if emulation prepared a completion callback · adbfb12d
      Hou Wenlong 提交于
      em_rdmsr() and em_wrmsr() return X86EMUL_IO_NEEDED if MSR accesses
      required an exit to userspace. However, x86_emulate_insn() doesn't return
      X86EMUL_*, so x86_emulate_instruction() doesn't directly act on
      X86EMUL_IO_NEEDED; instead, it looks for other signals to differentiate
      between PIO, MMIO, etc. causing RDMSR/WRMSR emulation not to
      exit to userspace now.
      
      Nevertheless, if the userspace_msr_exit_test testcase in selftests
      is changed to test RDMSR/WRMSR with a forced emulation prefix,
      the test passes.  What happens is that first userspace exit
      information is filled but the userspace exit does not happen.
      Because x86_emulate_instruction() returns 1, the guest retries
      the instruction---but this time RIP has already been adjusted
      past the forced emulation prefix, so the guest executes RDMSR/WRMSR
      and the userspace exit finally happens.
      
      Since the X86EMUL_IO_NEEDED path has provided a complete_userspace_io
      callback, x86_emulate_instruction() can just return 0 if the
      callback is not NULL. Then RDMSR/WRMSR instruction emulation will
      exit to userspace directly, without the RDMSR/WRMSR vmexit.
      
      Fixes: 1ae09954 ("KVM: x86: Allow deflecting unknown MSR accesses to user space")
      Signed-off-by: NHou Wenlong <houwenlong93@linux.alibaba.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <56f9df2ee5c05a81155e2be366c9dc1f7adc8817.1635842679.git.houwenlong93@linux.alibaba.com>
      adbfb12d
    • V
      KVM: nVMX: Don't use Enlightened MSR Bitmap for L3 · 250552b9
      Vitaly Kuznetsov 提交于
      When KVM runs as a nested hypervisor on top of Hyper-V it uses Enlightened
      VMCS and enables Enlightened MSR Bitmap feature for its L1s and L2s (which
      are actually L2s and L3s from Hyper-V's perspective). When MSR bitmap is
      updated, KVM has to reset HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP from
      clean fields to make Hyper-V aware of the change. For KVM's L1s, this is
      done in vmx_disable_intercept_for_msr()/vmx_enable_intercept_for_msr().
      MSR bitmap for L2 is build in nested_vmx_prepare_msr_bitmap() by blending
      MSR bitmap for L1 and L1's idea of MSR bitmap for L2. KVM, however, doesn't
      check if the resulting bitmap is different and never cleans
      HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP in eVMCS02. This is incorrect and
      may result in Hyper-V missing the update.
      
      The issue could've been solved by calling evmcs_touch_msr_bitmap() for
      eVMCS02 from nested_vmx_prepare_msr_bitmap() unconditionally but doing so
      would not give any performance benefits (compared to not using Enlightened
      MSR Bitmap at all). 3-level nesting is also not a very common setup
      nowadays.
      
      Don't enable 'Enlightened MSR Bitmap' feature for KVM's L2s (real L3s) for
      now.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211129094704.326635-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      250552b9
    • H
      KVM: x86: Use different callback if msr access comes from the emulator · d2f7d498
      Hou Wenlong 提交于
      If msr access triggers an exit to userspace, the
      complete_userspace_io callback would skip instruction by vendor
      callback for kvm_skip_emulated_instruction(). However, when msr
      access comes from the emulator, e.g. if kvm.force_emulation_prefix
      is enabled and the guest uses rdmsr/wrmsr with kvm prefix,
      VM_EXIT_INSTRUCTION_LEN in vmcs is invalid and
      kvm_emulate_instruction() should be used to skip instruction
      instead.
      
      As Sean noted, unlike the previous case, there's no #UD if
      unrestricted guest is disabled and the guest accesses an MSR in
      Big RM. So the correct way to fix this is to attach a different
      callback when the msr access comes from the emulator.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NHou Wenlong <houwenlong93@linux.alibaba.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <34208da8f51580a06e45afefac95afea0e3f96e3.1635842679.git.houwenlong93@linux.alibaba.com>
      d2f7d498
    • H
      KVM: x86: Add an emulation type to handle completion of user exits · 906fa904
      Hou Wenlong 提交于
      The next patch would use kvm_emulate_instruction() with
      EMULTYPE_SKIP in complete_userspace_io callback to fix a
      problem in msr access emulation. However, EMULTYPE_SKIP
      only updates RIP, more things like updating interruptibility
      state and injecting single-step #DBs would be done in the
      callback. Since the emulator also does those things after
      x86_emulate_insn(), add a new emulation type to pair with
      EMULTYPE_SKIP to do those things for completion of user exits
      within the emulator.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NHou Wenlong <houwenlong93@linux.alibaba.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <8f8c8e268b65f31d55c2881a4b30670946ecfa0d.1635842679.git.houwenlong93@linux.alibaba.com>
      906fa904
    • S
      KVM: x86: Handle 32-bit wrap of EIP for EMULTYPE_SKIP with flat code seg · 5e854864
      Sean Christopherson 提交于
      Truncate the new EIP to a 32-bit value when handling EMULTYPE_SKIP as the
      decode phase does not truncate _eip.  Wrapping the 32-bit boundary is
      legal if and only if CS is a flat code segment, but that check is
      implicitly handled in the form of limit checks in the decode phase.
      
      Opportunstically prepare for a future fix by storing the result of any
      truncation in "eip" instead of "_eip".
      
      Fixes: 1957aa63 ("KVM: VMX: Handle single-step #DB for EMULTYPE_SKIP on EPT misconfig")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <093eabb1eab2965201c9b018373baf26ff256d85.1635842679.git.houwenlong93@linux.alibaba.com>
      5e854864
    • L
      KVM: Clear pv eoi pending bit only when it is set · 51b1209c
      Li RongQing 提交于
      merge pv_eoi_get_pending and pv_eoi_clr_pending into a single
      function pv_eoi_test_and_clear_pending, which returns and clear
      the value of the pending bit.
      
      This makes it possible to clear the pending bit only if the guest
      had set it, and otherwise skip the call to pv_eoi_put_user().
      This can save up to 300 nsec on AMD EPYC processors.
      Suggested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NLi RongQing <lirongqing@baidu.com>
      Message-Id: <1636026974-50555-2-git-send-email-lirongqing@baidu.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      51b1209c
    • L
      KVM: x86: don't print when fail to read/write pv eoi memory · ce5977b1
      Li RongQing 提交于
      If guest gives MSR_KVM_PV_EOI_EN a wrong value, this printk() will
      be trigged, and kernel log is spammed with the useless message
      
      Fixes: 0d88800d ("kvm: x86: ioapic and apic debug macros cleanup")
      Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NLi RongQing <lirongqing@baidu.com>
      Cc: stable@kernel.org
      Message-Id: <1636026974-50555-1-git-send-email-lirongqing@baidu.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce5977b1
    • L
      KVM: X86: Remove mmu parameter from load_pdptrs() · 2df4a5eb
      Lai Jiangshan 提交于
      It uses vcpu->arch.walk_mmu always; nested EPT does not have PDPTRs,
      and nested NPT treats them like all other non-leaf page table levels
      instead of caching them.
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20211124122055.64424-11-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2df4a5eb
    • L
      KVM: X86: Rename gpte_is_8_bytes to has_4_byte_gpte and invert the direction · bb3b394d
      Lai Jiangshan 提交于
      This bit is very close to mean "role.quadrant is not in use", except that
      it is false also when the MMU is mapping guest physical addresses
      directly.  In that case, role.quadrant is indeed not in use, but there
      are no guest PTEs at all.
      
      Changing the name and direction of the bit removes the special case,
      since a guest with paging disabled, or not considering guest paging
      structures as is the case for two-dimensional paging, does not have
      to deal with 4-byte guest PTEs.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20211124122055.64424-10-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb3b394d
    • L
      KVM: VMX: Use ept_caps_to_lpage_level() in hardware_setup() · f8cd457f
      Lai Jiangshan 提交于
      Using ept_caps_to_lpage_level is simpler.
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20211124122055.64424-9-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f8cd457f