1. 28 1月, 2022 9 次提交
    • H
      KVM: eventfd: Fix false positive RCU usage warning · 6a0c6170
      Hou Wenlong 提交于
      Fix the following false positive warning:
       =============================
       WARNING: suspicious RCU usage
       5.16.0-rc4+ #57 Not tainted
       -----------------------------
       arch/x86/kvm/../../../virt/kvm/eventfd.c:484 RCU-list traversed in non-reader section!!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 2, debug_locks = 1
       3 locks held by fc_vcpu 0/330:
        #0: ffff8884835fc0b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x88/0x6f0 [kvm]
        #1: ffffc90004c0bb68 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0x600/0x1860 [kvm]
        #2: ffffc90004c0c1d0 (&kvm->irq_srcu){....}-{0:0}, at: kvm_notify_acked_irq+0x36/0x180 [kvm]
      
       stack backtrace:
       CPU: 26 PID: 330 Comm: fc_vcpu 0 Not tainted 5.16.0-rc4+
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
       Call Trace:
        <TASK>
        dump_stack_lvl+0x44/0x57
        kvm_notify_acked_gsi+0x6b/0x70 [kvm]
        kvm_notify_acked_irq+0x8d/0x180 [kvm]
        kvm_ioapic_update_eoi+0x92/0x240 [kvm]
        kvm_apic_set_eoi_accelerated+0x2a/0xe0 [kvm]
        handle_apic_eoi_induced+0x3d/0x60 [kvm_intel]
        vmx_handle_exit+0x19c/0x6a0 [kvm_intel]
        vcpu_enter_guest+0x66e/0x1860 [kvm]
        kvm_arch_vcpu_ioctl_run+0x438/0x7f0 [kvm]
        kvm_vcpu_ioctl+0x38a/0x6f0 [kvm]
        __x64_sys_ioctl+0x89/0xc0
        do_syscall_64+0x3a/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Since kvm_unregister_irq_ack_notifier() does synchronize_srcu(&kvm->irq_srcu),
      kvm->irq_ack_notifier_list is protected by kvm->irq_srcu. In fact,
      kvm->irq_srcu SRCU read lock is held in kvm_notify_acked_irq(), making it
      a false positive warning. So use hlist_for_each_entry_srcu() instead of
      hlist_for_each_entry_rcu().
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NHou Wenlong <houwenlong93@linux.alibaba.com>
      Message-Id: <f98bac4f5052bad2c26df9ad50f7019e40434512.1643265976.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6a0c6170
    • V
      KVM: nVMX: Allow VMREAD when Enlightened VMCS is in use · 6cbbaab6
      Vitaly Kuznetsov 提交于
      Hyper-V TLFS explicitly forbids VMREAD and VMWRITE instructions when
      Enlightened VMCS interface is in use:
      
      "Any VMREAD or VMWRITE instructions while an enlightened VMCS is
      active is unsupported and can result in unexpected behavior.""
      
      Windows 11 + WSL2 seems to ignore this, attempts to VMREAD VMCS field
      0x4404 ("VM-exit interruption information") are observed. Failing
      these attempts with nested_vmx_failInvalid() makes such guests
      unbootable.
      
      Microsoft confirms this is a Hyper-V bug and claims that it'll get fixed
      eventually but for the time being we need a workaround. (Temporary) allow
      VMREAD to get data from the currently loaded Enlightened VMCS.
      
      Note: VMWRITE instructions remain forbidden, it is not clear how to
      handle them properly and hopefully won't ever be needed.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-6-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6cbbaab6
    • V
      KVM: nVMX: Implement evmcs_field_offset() suitable for handle_vmread() · 892a42c1
      Vitaly Kuznetsov 提交于
      In preparation to allowing reads from Enlightened VMCS from
      handle_vmread(), implement evmcs_field_offset() to get the correct
      read offset. get_evmcs_offset(), which is being used by KVM-on-Hyper-V,
      is almost what's needed but a few things need to be adjusted. First,
      WARN_ON() is unacceptable for handle_vmread() as any field can (in
      theory) be supplied by the guest and not all fields are defined in
      eVMCS v1. Second, we need to handle 'holes' in eVMCS (missing fields).
      It also sounds like a good idea to WARN_ON() if such fields are ever
      accessed by KVM-on-Hyper-V.
      
      Implement dedicated evmcs_field_offset() helper.
      
      No functional change intended.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-5-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      892a42c1
    • V
      KVM: nVMX: Rename vmcs_to_field_offset{,_table} · 2423a4c0
      Vitaly Kuznetsov 提交于
      vmcs_to_field_offset{,_table} may sound misleading as VMCS is an opaque
      blob which is not supposed to be accessed directly. In fact,
      vmcs_to_field_offset{,_table} are related to KVM defined VMCS12 structure.
      
      Rename vmcs_field_to_offset() to get_vmcs12_field_offset() for clarity.
      
      No functional change intended.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-4-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2423a4c0
    • V
      KVM: nVMX: eVMCS: Filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER · 7a601e2c
      Vitaly Kuznetsov 提交于
      Enlightened VMCS v1 doesn't have VMX_PREEMPTION_TIMER_VALUE field,
      PIN_BASED_VMX_PREEMPTION_TIMER is also filtered out already so it makes
      sense to filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER too.
      
      Note, none of the currently existing Windows/Hyper-V versions are known
      to enable 'save VMX-preemption timer value' when eVMCS is in use, the
      change is aimed at making the filtering future proof.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-3-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a601e2c
    • V
      KVM: nVMX: Also filter MSR_IA32_VMX_TRUE_PINBASED_CTLS when eVMCS · f80ae0ef
      Vitaly Kuznetsov 提交于
      Similar to MSR_IA32_VMX_EXIT_CTLS/MSR_IA32_VMX_TRUE_EXIT_CTLS,
      MSR_IA32_VMX_ENTRY_CTLS/MSR_IA32_VMX_TRUE_ENTRY_CTLS pair,
      MSR_IA32_VMX_TRUE_PINBASED_CTLS needs to be filtered the same way
      MSR_IA32_VMX_PINBASED_CTLS is currently filtered as guests may solely rely
      on 'true' MSR data.
      
      Note, none of the currently existing Windows/Hyper-V versions are known
      to stumble upon the unfiltered MSR_IA32_VMX_TRUE_PINBASED_CTLS, the change
      is aimed at making the filtering future proof.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220112170134.1904308-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f80ae0ef
    • P
      selftests: kvm: check dynamic bits against KVM_X86_XCOMP_GUEST_SUPP · b19c99b9
      Paolo Bonzini 提交于
      Provide coverage for the new API.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b19c99b9
    • P
      KVM: x86: add system attribute to retrieve full set of supported xsave states · dd6e6312
      Paolo Bonzini 提交于
      Because KVM_GET_SUPPORTED_CPUID is meant to be passed (by simple-minded
      VMMs) to KVM_SET_CPUID2, it cannot include any dynamic xsave states that
      have not been enabled.  Probing those, for example so that they can be
      passed to ARCH_REQ_XCOMP_GUEST_PERM, requires a new ioctl or arch_prctl.
      The latter is in fact worse, even though that is what the rest of the
      API uses, because it would require supported_xcr0 to be moved from the
      KVM module to the kernel just for this use.  In addition, the value
      would be nonsensical (or an error would have to be returned) until
      the KVM module is loaded in.
      
      Therefore, to limit the growth of system ioctls, add a /dev/kvm
      variant of KVM_{GET,HAS}_DEVICE_ATTR, and implement it in x86
      with just one group (0) and attribute (KVM_X86_XCOMP_GUEST_SUPP).
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dd6e6312
    • S
      KVM: x86: Add a helper to retrieve userspace address from kvm_device_attr · 56f289a8
      Sean Christopherson 提交于
      Add a helper to handle converting the u64 userspace address embedded in
      struct kvm_device_attr into a userspace pointer, it's all too easy to
      forget the intermediate "unsigned long" cast as well as the truncation
      check.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      56f289a8
  2. 27 1月, 2022 24 次提交
    • P
      selftests: kvm: move vm_xsave_req_perm call to amx_test · dd4516ae
      Paolo Bonzini 提交于
      There is no need for tests other than amx_test to enable dynamic xsave
      states.  Remove the call to vm_xsave_req_perm from generic code,
      and move it inside the test.  While at it, allow customizing the bit
      that is requested, so that future tests can use it differently.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dd4516ae
    • L
      KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any time · 05a9e065
      Like Xu 提交于
      XCR0 is reset to 1 by RESET but not INIT and IA32_XSS is zeroed by
      both RESET and INIT. The kvm_set_msr_common()'s handling of MSR_IA32_XSS
      also needs to update kvm_update_cpuid_runtime(). In the above cases, the
      size in bytes of the XSAVE area containing all states enabled by XCR0 or
      (XCRO | IA32_XSS) needs to be updated.
      
      For simplicity and consistency, existing helpers are used to write values
      and call kvm_update_cpuid_runtime(), and it's not exactly a fast path.
      
      Fixes: a554d207 ("KVM: X86: Processor States following Reset or INIT")
      Cc: stable@vger.kernel.org
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220126172226.2298529-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      05a9e065
    • L
      KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSS · 4c282e51
      Like Xu 提交于
      Do a runtime CPUID update for a vCPU if MSR_IA32_XSS is written, as the
      size in bytes of the XSAVE area is affected by the states enabled in XSS.
      
      Fixes: 20300099 ("kvm: vmx: add MSR logic for XSAVES")
      Cc: stable@vger.kernel.org
      Signed-off-by: NLike Xu <likexu@tencent.com>
      [sean: split out as a separate patch, adjust Fixes tag]
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220126172226.2298529-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4c282e51
    • X
      KVM: x86: Keep MSR_IA32_XSS unchanged for INIT · be4f3b3f
      Xiaoyao Li 提交于
      It has been corrected from SDM version 075 that MSR_IA32_XSS is reset to
      zero on Power up and Reset but keeps unchanged on INIT.
      
      Fixes: a554d207 ("KVM: X86: Processor States following Reset or INIT")
      Cc: stable@vger.kernel.org
      Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220126172226.2298529-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      be4f3b3f
    • S
      KVM: x86: Free kvm_cpuid_entry2 array on post-KVM_RUN KVM_SET_CPUID{,2} · 811f95ff
      Sean Christopherson 提交于
      Free the "struct kvm_cpuid_entry2" array on successful post-KVM_RUN
      KVM_SET_CPUID{,2} to fix a memory leak, the callers of kvm_set_cpuid()
      free the array only on failure.
      
       BUG: memory leak
       unreferenced object 0xffff88810963a800 (size 2048):
        comm "syz-executor025", pid 3610, jiffies 4294944928 (age 8.080s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 0d 00 00 00  ................
          47 65 6e 75 6e 74 65 6c 69 6e 65 49 00 00 00 00  GenuntelineI....
        backtrace:
          [<ffffffff814948ee>] kmalloc_node include/linux/slab.h:604 [inline]
          [<ffffffff814948ee>] kvmalloc_node+0x3e/0x100 mm/util.c:580
          [<ffffffff814950f2>] kvmalloc include/linux/slab.h:732 [inline]
          [<ffffffff814950f2>] vmemdup_user+0x22/0x100 mm/util.c:199
          [<ffffffff8109f5ff>] kvm_vcpu_ioctl_set_cpuid2+0x8f/0xf0 arch/x86/kvm/cpuid.c:423
          [<ffffffff810711b9>] kvm_arch_vcpu_ioctl+0xb99/0x1e60 arch/x86/kvm/x86.c:5251
          [<ffffffff8103e92d>] kvm_vcpu_ioctl+0x4ad/0x950 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4066
          [<ffffffff815afacc>] vfs_ioctl fs/ioctl.c:51 [inline]
          [<ffffffff815afacc>] __do_sys_ioctl fs/ioctl.c:874 [inline]
          [<ffffffff815afacc>] __se_sys_ioctl fs/ioctl.c:860 [inline]
          [<ffffffff815afacc>] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860
          [<ffffffff844a3335>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<ffffffff844a3335>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: c6617c61 ("KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN")
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+be576ad7655690586eec@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220125210445.2053429-1-seanjc@google.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      811f95ff
    • S
      KVM: nVMX: WARN on any attempt to allocate shadow VMCS for vmcs02 · d6e656cd
      Sean Christopherson 提交于
      WARN if KVM attempts to allocate a shadow VMCS for vmcs02.  KVM emulates
      VMCS shadowing but doesn't virtualize it, i.e. KVM should never allocate
      a "real" shadow VMCS for L2.
      
      The previous code WARNed but continued anyway with the allocation,
      presumably in an attempt to avoid NULL pointer dereference.
      However, alloc_vmcs (and hence alloc_shadow_vmcs) can fail, and
      indeed the sole caller does:
      
      	if (enable_shadow_vmcs && !alloc_shadow_vmcs(vcpu))
      		goto out_shadow_vmcs;
      
      which makes it not a useful attempt.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220527.2093146-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d6e656cd
    • S
      KVM: selftests: Don't skip L2's VMCALL in SMM test for SVM guest · 4cf3d3eb
      Sean Christopherson 提交于
      Don't skip the vmcall() in l2_guest_code() prior to re-entering L2, doing
      so will result in L2 running to completion, popping '0' off the stack for
      RET, jumping to address '0', and ultimately dying with a triple fault
      shutdown.
      
      It's not at all obvious why the test re-enters L2 and re-executes VMCALL,
      but presumably it serves a purpose.  The VMX path doesn't skip vmcall(),
      and the test can't possibly have passed on SVM, so just do what VMX does.
      
      Fixes: d951b221 ("KVM: selftests: smm_test: Test SMM enter from L2")
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220125221725.2101126-1-seanjc@google.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4cf3d3eb
    • V
      KVM: x86: Check .flags in kvm_cpuid_check_equal() too · 033a3ea5
      Vitaly Kuznetsov 提交于
      kvm_cpuid_check_equal() checks for the (full) equality of the supplied
      CPUID data so .flags need to be checked too.
      Reported-by: NSean Christopherson <seanjc@google.com>
      Fixes: c6617c61 ("KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220126131804.2839410-1-vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      033a3ea5
    • S
      KVM: x86: Forcibly leave nested virt when SMM state is toggled · f7e57078
      Sean Christopherson 提交于
      Forcibly leave nested virtualization operation if userspace toggles SMM
      state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS.  If userspace
      forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
      vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
      vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
      
      Don't attempt to gracefully handle the transition as (a) most transitions
      are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
      sufficient information to handle all transitions, e.g. SVM wants access
      to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
      KVM_SET_NESTED_STATE during state restore as the latter disallows putting
      the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
      being post-VMXON in SMM if SMM is not active.
      
      Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
      due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
      just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
      in an architecturally impossible state.
      
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Modules linked in:
        CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
        Call Trace:
         <TASK>
         kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
         kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
         kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
         kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
         kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
         kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
         kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
         kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
         __fput+0x286/0x9f0 fs/file_table.c:311
         task_work_run+0xdd/0x1a0 kernel/task_work.c:164
         exit_task_work include/linux/task_work.h:32 [inline]
         do_exit+0xb29/0x2a30 kernel/exit.c:806
         do_group_exit+0xd2/0x2f0 kernel/exit.c:935
         get_signal+0x4b0/0x28c0 kernel/signal.c:2862
         arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
         handle_signal_work kernel/entry/common.c:148 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
         exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
         __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
         syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
         do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220358.2091737-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7e57078
    • V
      KVM: SVM: drop unnecessary code in svm_hv_vmcb_dirty_nested_enlightenments() · aa3b39f3
      Vitaly Kuznetsov 提交于
      Commit 3fa5e8fd ("KVM: SVM: delay svm_vcpu_init_msrpm after
      svm->vmcb is initialized") re-arranged svm_vcpu_init_msrpm() call in
      svm_create_vcpu(), thus making the comment about vmcb being NULL
      obsolete. Drop it.
      
      While on it, drop superfluous vmcb_is_clean() check: vmcb_mark_dirty()
      is a bit flip, an extra check is unlikely to bring any performance gain.
      Drop now-unneeded vmcb_is_clean() helper as well.
      
      Fixes: 3fa5e8fd ("KVM: SVM: delay svm_vcpu_init_msrpm after svm->vmcb is initialized")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211220152139.418372-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      aa3b39f3
    • V
      KVM: SVM: hyper-v: Enable Enlightened MSR-Bitmap support for real · 38dfa830
      Vitaly Kuznetsov 提交于
      Commit c4327f15 ("KVM: SVM: hyper-v: Enlightened MSR-Bitmap support")
      introduced enlightened MSR-Bitmap support for KVM-on-Hyper-V but it didn't
      actually enable the support. Similar to enlightened NPT TLB flush and
      direct TLB flush features, the guest (KVM) has to tell L0 (Hyper-V) that
      it's using the feature by setting the appropriate feature fit in VMCB
      control area (sw reserved fields).
      
      Fixes: c4327f15 ("KVM: SVM: hyper-v: Enlightened MSR-Bitmap support")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211220152139.418372-3-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      38dfa830
    • S
      KVM: SVM: Don't kill SEV guest if SMAP erratum triggers in usermode · cdf85e0c
      Sean Christopherson 提交于
      Inject a #GP instead of synthesizing triple fault to try to avoid killing
      the guest if emulation of an SEV guest fails due to encountering the SMAP
      erratum.  The injected #GP may still be fatal to the guest, e.g. if the
      userspace process is providing critical functionality, but KVM should
      make every attempt to keep the guest alive.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-10-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cdf85e0c
    • S
      KVM: SVM: Don't apply SEV+SMAP workaround on code fetch or PT access · 3280cc22
      Sean Christopherson 提交于
      Resume the guest instead of synthesizing a triple fault shutdown if the
      instruction bytes buffer is empty due to the #NPF being on the code fetch
      itself or on a page table access.  The SMAP errata applies if and only if
      the code fetch was successful and ucode's subsequent data read from the
      code page encountered a SMAP violation.  In practice, the guest is likely
      hosed either way, but crashing the guest on a code fetch to emulated MMIO
      is technically wrong according to the behavior described in the APM.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-9-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3280cc22
    • S
      KVM: SVM: Inject #UD on attempted emulation for SEV guest w/o insn buffer · 04c40f34
      Sean Christopherson 提交于
      Inject #UD if KVM attempts emulation for an SEV guests without an insn
      buffer and instruction decoding is required.  The previous behavior of
      allowing emulation if there is no insn buffer is undesirable as doing so
      means KVM is reading guest private memory and thus decoding cyphertext,
      i.e. is emulating garbage.  The check was previously necessary as the
      emulation type was not provided, i.e. SVM needed to allow emulation to
      handle completion of emulation after exiting to userspace to handle I/O.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      04c40f34
    • S
      KVM: SVM: WARN if KVM attempts emulation on #UD or #GP for SEV guests · 132627c6
      Sean Christopherson 提交于
      WARN if KVM attempts to emulate in response to #UD or #GP for SEV guests,
      i.e. if KVM intercepts #UD or #GP, as emulation on any fault except #NPF
      is impossible since KVM cannot read guest private memory to get the code
      stream, and the CPU's DecodeAssists feature only provides the instruction
      bytes on #NPF.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-7-seanjc@google.com>
      [Warn on EMULTYPE_TRAP_UD_FORCED according to Liam Merwick's review. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      132627c6
    • S
      KVM: x86: Pass emulation type to can_emulate_instruction() · 4d31d9ef
      Sean Christopherson 提交于
      Pass the emulation type to kvm_x86_ops.can_emulate_insutrction() so that
      a future commit can harden KVM's SEV support to WARN on emulation
      scenarios that should never happen.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d31d9ef
    • S
      KVM: SVM: Explicitly require DECODEASSISTS to enable SEV support · c532f290
      Sean Christopherson 提交于
      Add a sanity check on DECODEASSIST being support if SEV is supported, as
      KVM cannot read guest private memory and thus relies on the CPU to
      provide the instruction byte stream on #NPF for emulation.  The intent of
      the check is to document the dependency, it should never fail in practice
      as producing hardware that supports SEV but not DECODEASSISTS would be
      non-sensical.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c532f290
    • S
      KVM: SVM: Don't intercept #GP for SEV guests · 0b0be065
      Sean Christopherson 提交于
      Never intercept #GP for SEV guests as reading SEV guest private memory
      will return cyphertext, i.e. emulating on #GP can't work as intended.
      
      Cc: stable@vger.kernel.org
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0b0be065
    • S
      Revert "KVM: SVM: avoid infinite loop on NPF from bad address" · 31c25585
      Sean Christopherson 提交于
      Revert a completely broken check on an "invalid" RIP in SVM's workaround
      for the DecodeAssists SMAP errata.  kvm_vcpu_gfn_to_memslot() obviously
      expects a gfn, i.e. operates in the guest physical address space, whereas
      RIP is a virtual (not even linear) address.  The "fix" worked for the
      problematic KVM selftest because the test identity mapped RIP.
      
      Fully revert the hack instead of trying to translate RIP to a GPA, as the
      non-SEV case is now handled earlier, and KVM cannot access guest page
      tables to translate RIP.
      
      This reverts commit e72436bc.
      
      Fixes: e72436bc ("KVM: SVM: avoid infinite loop on NPF from bad address")
      Reported-by: NLiam Merwick <liam.merwick@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      31c25585
    • S
      KVM: SVM: Never reject emulation due to SMAP errata for !SEV guests · 55467fcd
      Sean Christopherson 提交于
      Always signal that emulation is possible for !SEV guests regardless of
      whether or not the CPU provided a valid instruction byte stream.  KVM can
      read all guest state (memory and registers) for !SEV guests, i.e. can
      fetch the code stream from memory even if the CPU failed to do so because
      of the SMAP errata.
      
      Fixes: 05d5a486 ("KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)")
      Cc: stable@vger.kernel.org
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      55467fcd
    • D
      KVM: x86: nSVM: skip eax alignment check for non-SVM instructions · 47c28d43
      Denis Valeev 提交于
      The bug occurs on #GP triggered by VMware backdoor when eax value is
      unaligned. eax alignment check should not be applied to non-SVM
      instructions because it leads to incorrect omission of the instructions
      emulation.
      Apply the alignment check only to SVM instructions to fix.
      
      Fixes: d1cba6c9 ("KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround")
      Signed-off-by: NDenis Valeev <lemniscattaden@gmail.com>
      Message-Id: <Yexlhaoe1Fscm59u@q>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      47c28d43
    • L
      KVM: x86/cpuid: Exclude unpermitted xfeatures sizes at KVM_GET_SUPPORTED_CPUID · 1ffce092
      Like Xu 提交于
      With the help of xstate_get_guest_group_perm(), KVM can exclude unpermitted
      xfeatures in cpuid.0xd.0.eax, in which case the corresponding xfeatures
      sizes should also be matched to the permitted xfeatures.
      
      To fix this inconsistency, the permitted_xcr0 and permitted_xss are defined
      consistently, which implies 'supported' plus certain permissions for this
      task, and it also fixes cpuid.0xd.1.ebx and later leaf-by-leaf queries.
      
      Fixes: 445ecdf7 ("kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID")
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20220125115223.33707-1-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1ffce092
    • W
      KVM: LAPIC: Also cancel preemption timer during SET_LAPIC · 35fe7cfb
      Wanpeng Li 提交于
      The below warning is splatting during guest reboot.
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 1931 at arch/x86/kvm/x86.c:10322 kvm_arch_vcpu_ioctl_run+0x874/0x880 [kvm]
        CPU: 0 PID: 1931 Comm: qemu-system-x86 Tainted: G          I       5.17.0-rc1+ #5
        RIP: 0010:kvm_arch_vcpu_ioctl_run+0x874/0x880 [kvm]
        Call Trace:
         <TASK>
         kvm_vcpu_ioctl+0x279/0x710 [kvm]
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7fd39797350b
      
      This can be triggered by not exposing tsc-deadline mode and doing a reboot in
      the guest. The lapic_shutdown() function which is called in sys_reboot path
      will not disarm the flying timer, it just masks LVTT. lapic_shutdown() clears
      APIC state w/ LVT_MASKED and timer-mode bit is 0, this can trigger timer-mode
      switch between tsc-deadline and oneshot/periodic, which can result in preemption
      timer be cancelled in apic_update_lvtt(). However, We can't depend on this when
      not exposing tsc-deadline mode and oneshot/periodic modes emulated by preemption
      timer. Qemu will synchronise states around reset, let's cancel preemption timer
      under KVM_SET_LAPIC.
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1643102220-35667-1-git-send-email-wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      35fe7cfb
    • J
      KVM: VMX: Remove vmcs_config.order · 519669cc
      Jim Mattson 提交于
      The maximum size of a VMCS (or VMXON region) is 4096. By definition,
      these are order 0 allocations.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Message-Id: <20220125004359.147600-1-jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      519669cc
  3. 25 1月, 2022 4 次提交
    • Q
      KVM/X86: Make kvm_vcpu_reload_apic_access_page() static · d081a343
      Quanfa Fu 提交于
      Make kvm_vcpu_reload_apic_access_page() static
      as it is no longer invoked directly by vmx
      and it is also no longer exported.
      
      No functional change intended.
      Signed-off-by: NQuanfa Fu <quanfafu@gmail.com>
      Message-Id: <20211219091446.174584-1-quanfafu@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d081a343
    • D
      KVM: selftests: Re-enable access_tracking_perf_test · de1956f4
      David Matlack 提交于
      This selftest was accidentally removed by commit 6a581508
      ("selftest: KVM: Add intra host migration tests"). Add it back.
      
      Fixes: 6a581508 ("selftest: KVM: Add intra host migration tests")
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220120003826.2805036-1-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      de1956f4
    • S
      KVM: VMX: Set vmcs.PENDING_DBG.BS on #DB in STI/MOVSS blocking shadow · b9bed78e
      Sean Christopherson 提交于
      Set vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS, a.k.a. the pending single-step
      breakpoint flag, when re-injecting a #DB with RFLAGS.TF=1, and STI or
      MOVSS blocking is active.  Setting the flag is necessary to make VM-Entry
      consistency checks happy, as VMX has an invariant that if RFLAGS.TF is
      set and STI/MOVSS blocking is true, then the previous instruction must
      have been STI or MOV/POP, and therefore a single-step #DB must be pending
      since the RFLAGS.TF cannot have been set by the previous instruction,
      i.e. the one instruction delay after setting RFLAGS.TF must have already
      expired.
      
      Normally, the CPU sets vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS appropriately
      when recording guest state as part of a VM-Exit, but #DB VM-Exits
      intentionally do not treat the #DB as "guest state" as interception of
      the #DB effectively makes the #DB host-owned, thus KVM needs to manually
      set PENDING_DBG.BS when forwarding/re-injecting the #DB to the guest.
      
      Note, although this bug can be triggered by guest userspace, doing so
      requires IOPL=3, and guest userspace running with IOPL=3 has full access
      to all I/O ports (from the guest's perspective) and can crash/reboot the
      guest any number of ways.  IOPL=3 is required because STI blocking kicks
      in if and only if RFLAGS.IF is toggled 0=>1, and if CPL>IOPL, STI either
      takes a #GP or modifies RFLAGS.VIF, not RFLAGS.IF.
      
      MOVSS blocking can be initiated by userspace, but can be coincident with
      a #DB if and only if DR7.GD=1 (General Detect enabled) and a MOV DR is
      executed in the MOVSS shadow.  MOV DR #GPs at CPL>0, thus MOVSS blocking
      is problematic only for CPL0 (and only if the guest is crazy enough to
      access a DR in a MOVSS shadow).  All other sources of #DBs are either
      suppressed by MOVSS blocking (single-step, code fetch, data, and I/O),
      are mutually exclusive with MOVSS blocking (T-bit task switch), or are
      already handled by KVM (ICEBP, a.k.a. INT1).
      
      This bug was originally found by running tests[1] created for XSA-308[2].
      Note that Xen's userspace test emits ICEBP in the MOVSS shadow, which is
      presumably why the Xen bug was deemed to be an exploitable DOS from guest
      userspace.  KVM already handles ICEBP by skipping the ICEBP instruction
      and thus clears MOVSS blocking as a side effect of its "emulation".
      
      [1] http://xenbits.xenproject.org/docs/xtf/xsa-308_2main_8c_source.html
      [2] https://xenbits.xen.org/xsa/advisory-308.htmlReported-by: NDavid Woodhouse <dwmw2@infradead.org>
      Reported-by: NAlexander Graf <graf@amazon.de>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220120000624.655815-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b9bed78e
    • V
      KVM: x86: Move CPUID.(EAX=0x12,ECX=1) mangling to __kvm_update_cpuid_runtime() · 5c89be1d
      Vitaly Kuznetsov 提交于
      Full equality check of CPUID data on update (kvm_cpuid_check_equal()) may
      fail for SGX enabled CPUs as CPUID.(EAX=0x12,ECX=1) is currently being
      mangled in kvm_vcpu_after_set_cpuid(). Move it to
      __kvm_update_cpuid_runtime() and split off cpuid_get_supported_xcr0()
      helper  as 'vcpu->arch.guest_supported_xcr0' update needs (logically)
      to stay in kvm_vcpu_after_set_cpuid().
      
      Cc: stable@vger.kernel.org
      Fixes: feb627e8 ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220124103606.2630588-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5c89be1d
  4. 24 1月, 2022 3 次提交
    • X
      KVM: remove async parameter of hva_to_pfn_remapped() · 1625566e
      Xianting Tian 提交于
      The async parameter of hva_to_pfn_remapped() is not used, so remove it.
      Signed-off-by: NXianting Tian <xianting.tian@linux.alibaba.com>
      Message-Id: <20220124020456.156386-1-xianting.tian@linux.alibaba.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1625566e
    • P
      x86,kvm/xen: Remove superfluous .fixup usage · adb759e5
      Peter Zijlstra 提交于
      Commit 14243b38 ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and
      event channel delivery") adds superfluous .fixup usage after the whole
      .fixup section was removed in commit e5eefda5 ("x86: Remove .fixup
      section").
      
      Fixes: 14243b38 ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery")
      Reported-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Message-Id: <20220123124219.GH20638@worktop.programming.kicks-ass.net>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      adb759e5
    • S
      KVM: VMX: Zero host's SYSENTER_ESP iff SYSENTER is NOT used · 94fea1d8
      Sean Christopherson 提交于
      Zero vmcs.HOST_IA32_SYSENTER_ESP when initializing *constant* host state
      if and only if SYSENTER cannot be used, i.e. the kernel is a 64-bit
      kernel and is not emulating 32-bit syscalls.  As the name suggests,
      vmx_set_constant_host_state() is intended for state that is *constant*.
      When SYSENTER is used, SYSENTER_ESP isn't constant because stacks are
      per-CPU, and the VMCS must be updated whenever the vCPU is migrated to a
      new CPU.  The logic in vmx_vcpu_load_vmcs() doesn't differentiate between
      "never loaded" and "loaded on a different CPU", i.e. setting SYSENTER_ESP
      on VMCS load also handles setting correct host state when the VMCS is
      first loaded.
      
      Because a VMCS must be loaded before it is initialized during vCPU RESET,
      zeroing the field in vmx_set_constant_host_state() obliterates the value
      that was written when the VMCS was loaded.  If the vCPU is run before it
      is migrated, the subsequent VM-Exit will zero out MSR_IA32_SYSENTER_ESP,
      leading to a #DF on the next 32-bit syscall.
      
        double fault: 0000 [#1] SMP
        CPU: 0 PID: 990 Comm: stable Not tainted 5.16.0+ #97
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        EIP: entry_SYSENTER_32+0x0/0xe7
        Code: <9c> 50 eb 17 0f 20 d8 a9 00 10 00 00 74 0d 25 ff ef ff ff 0f 22 d8
        EAX: 000000a2 EBX: a8d1300c ECX: a8d13014 EDX: 00000000
        ESI: a8f87000 EDI: a8d13014 EBP: a8d12fc0 ESP: 00000000
        DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00210093
        CR0: 80050033 CR2: fffffffc CR3: 02c3b000 CR4: 00152e90
      
      Fixes: 6ab8a405 ("KVM: VMX: Avoid to rdmsrl(MSR_IA32_SYSENTER_ESP)")
      Cc: Lai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220122015211.1468758-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      94fea1d8