1. 08 6月, 2022 5 次提交
    • S
      KVM: x86: Introduce "struct kvm_caps" to track misc caps/settings · 938c8745
      Sean Christopherson 提交于
      Add kvm_caps to hold a variety of capabilites and defaults that aren't
      handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce
      the amount of boilerplate code required to add a new feature.  The vast
      majority (all?) of the caps interact with vendor code and are written
      only during initialization, i.e. should be tagged __read_mostly, declared
      extern in x86.h, and exported.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220524135624.22988-4-chenyi.qiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      938c8745
    • M
      KVM: nSVM: Transparently handle L1 -> L2 NMI re-injection · 159fc6fa
      Maciej S. Szmigiero 提交于
      A NMI that L1 wants to inject into its L2 should be directly re-injected,
      without causing L0 side effects like engaging NMI blocking for L1.
      
      It's also worth noting that in this case it is L1 responsibility
      to track the NMI window status for its L2 guest.
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <f894d13501cd48157b3069a4b4c7369575ddb60e.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      159fc6fa
    • S
      KVM: SVM: Re-inject INTn instead of retrying the insn on "failure" · 7e5b5ef8
      Sean Christopherson 提交于
      Re-inject INTn software interrupts instead of retrying the instruction if
      the CPU encountered an intercepted exception while vectoring the INTn,
      e.g. if KVM intercepted a #PF when utilizing shadow paging.  Retrying the
      instruction is architecturally wrong e.g. will result in a spurious #DB
      if there's a code breakpoint on the INT3/O, and lack of re-injection also
      breaks nested virtualization, e.g. if L1 injects a software interrupt and
      vectoring the injected interrupt encounters an exception that is
      intercepted by L0 but not L1.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <1654ad502f860948e4f2d57b8bd881d67301f785.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7e5b5ef8
    • S
      KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction · 6ef88d6e
      Sean Christopherson 提交于
      Re-inject INT3/INTO instead of retrying the instruction if the CPU
      encountered an intercepted exception while vectoring the software
      exception, e.g. if vectoring INT3 encounters a #PF and KVM is using
      shadow paging.  Retrying the instruction is architecturally wrong, e.g.
      will result in a spurious #DB if there's a code breakpoint on the INT3/O,
      and lack of re-injection also breaks nested virtualization, e.g. if L1
      injects a software exception and vectoring the injected exception
      encounters an exception that is intercepted by L0 but not L1.
      
      Due to, ahem, deficiencies in the SVM architecture, acquiring the next
      RIP may require flowing through the emulator even if NRIPS is supported,
      as the CPU clears next_rip if the VM-Exit is due to an exception other
      than "exceptions caused by the INT3, INTO, and BOUND instructions".  To
      deal with this, "skip" the instruction to calculate next_rip (if it's
      not already known), and then unwind the RIP write and any side effects
      (RFLAGS updates).
      
      Save the computed next_rip and use it to re-stuff next_rip if injection
      doesn't complete.  This allows KVM to do the right thing if next_rip was
      known prior to injection, e.g. if L1 injects a soft event into L2, and
      there is no backing INTn instruction, e.g. if L1 is injecting an
      arbitrary event.
      
      Note, it's impossible to guarantee architectural correctness given SVM's
      architectural flaws.  E.g. if the guest executes INTn (no KVM injection),
      an exit occurs while vectoring the INTn, and the guest modifies the code
      stream while the exit is being handled, KVM will compute the incorrect
      next_rip due to "skipping" the wrong instruction.  A future enhancement
      to make this less awful would be for KVM to detect that the decoded
      instruction is not the correct INTn and drop the to-be-injected soft
      event (retrying is a lesser evil compared to shoving the wrong RIP on the
      exception stack).
      Reported-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <65cb88deab40bc1649d509194864312a89bbe02e.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ef88d6e
    • M
      KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 · 00f08d99
      Maciej S. Szmigiero 提交于
      The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
      This field value (instead of the saved guest RIP) in used by the CPU for
      the return address pushed on stack when injecting a software interrupt or
      INT3 or INTO exception.
      
      Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
      loading a nested state and NRIPS is exposed to L1.  If NRIPS is supported
      in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
      vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
      saves RIP on the stack as-is).
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Co-developed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <c2e0a3d78db3ae30530f11d4e9254b452a89f42b.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      00f08d99
  2. 07 6月, 2022 1 次提交
    • M
      KVM: SVM: fix tsc scaling cache logic · 11d39e8c
      Maxim Levitsky 提交于
      SVM uses a per-cpu variable to cache the current value of the
      tsc scaling multiplier msr on each cpu.
      
      Commit 1ab9287a
      ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
      broke this caching logic.
      
      Refactor the code so that all TSC scaling multiplier writes go through
      a single function which checks and updates the cache.
      
      This fixes the following scenario:
      
      1. A CPU runs a guest with some tsc scaling ratio.
      
      2. New guest with different tsc scaling ratio starts on this CPU
         and terminates almost immediately.
      
         This ensures that the short running guest had set the tsc scaling ratio just
         once when it was set via KVM_SET_TSC_KHZ. Due to the bug,
         the per-cpu cache is not updated.
      
      3. The original guest continues to run, it doesn't restore the msr
         value back to its own value, because the cache matches,
         and thus continues to run with a wrong tsc scaling ratio.
      
      Fixes: 1ab9287a ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220606181149.103072-1-mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      11d39e8c
  3. 30 4月, 2022 1 次提交
  4. 14 4月, 2022 1 次提交
    • S
      KVM: x86: Drop WARNs that assert a triple fault never "escapes" from L2 · 45846661
      Sean Christopherson 提交于
      Remove WARNs that sanity check that KVM never lets a triple fault for L2
      escape and incorrectly end up in L1.  In normal operation, the sanity
      check is perfectly valid, but it incorrectly assumes that it's impossible
      for userspace to induce KVM_REQ_TRIPLE_FAULT without bouncing through
      KVM_RUN (which guarantees kvm_check_nested_state() will see and handle
      the triple fault).
      
      The WARN can currently be triggered if userspace injects a machine check
      while L2 is active and CR4.MCE=0.  And a future fix to allow save/restore
      of KVM_REQ_TRIPLE_FAULT, e.g. so that a synthesized triple fault isn't
      lost on migration, will make it trivially easy for userspace to trigger
      the WARN.
      
      Clearing KVM_REQ_TRIPLE_FAULT when forcibly leaving guest mode is
      tempting, but wrong, especially if/when the request is saved/restored,
      e.g. if userspace restores events (including a triple fault) and then
      restores nested state (which may forcibly leave guest mode).  Ignoring
      the fact that KVM doesn't currently provide the necessary APIs, it's
      userspace's responsibility to manage pending events during save/restore.
      
        ------------[ cut here ]------------
        WARNING: CPU: 7 PID: 1399 at arch/x86/kvm/vmx/nested.c:4522 nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
        Modules linked in: kvm_intel kvm irqbypass
        CPU: 7 PID: 1399 Comm: state_test Not tainted 5.17.0-rc3+ #808
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
        Call Trace:
         <TASK>
         vmx_leave_nested+0x30/0x40 [kvm_intel]
         vmx_set_nested_state+0xca/0x3e0 [kvm_intel]
         kvm_arch_vcpu_ioctl+0xf49/0x13e0 [kvm]
         kvm_vcpu_ioctl+0x4b9/0x660 [kvm]
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      Fixes: cb6a32c2 ("KVM: x86: Handle triple fault in L2 without killing L1")
      Cc: stable@vger.kernel.org
      Cc: Chenyi Qiang <chenyi.qiang@intel.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220407002315.78092-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      45846661
  5. 02 4月, 2022 7 次提交
  6. 25 2月, 2022 1 次提交
    • P
      KVM: x86/mmu: load new PGD after the shadow MMU is initialized · 3cffc89d
      Paolo Bonzini 提交于
      Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
      shadow_root_level anymore, pull the PGD load after the initialization of
      the shadow MMUs.
      
      Besides being more intuitive, this enables future simplifications
      and optimizations because it's not necessary anymore to compute the
      role outside kvm_init_mmu.  In particular, kvm_mmu_reset_context was not
      attempting to use a cached PGD to avoid having to figure out the new role.
      With this change, it could follow what nested_{vmx,svm}_load_cr3 are doing,
      and avoid unloading all the cached roots.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3cffc89d
  7. 11 2月, 2022 2 次提交
    • V
      KVM: nSVM: Implement Enlightened MSR-Bitmap feature · 66c03a92
      Vitaly Kuznetsov 提交于
      Similar to nVMX commit 502d2bf5 ("KVM: nVMX: Implement Enlightened MSR
      Bitmap feature"), add support for the feature for nSVM (Hyper-V on KVM).
      
      Notable differences from nVMX implementation:
      - As the feature uses SW reserved fields in VMCB control, KVM needs to
      make sure it's dealing with a Hyper-V guest (kvm_hv_hypercall_enabled()).
      
      - 'msrpm_base_pa' needs to be always be overwritten in
      nested_svm_vmrun_msrpm(), even when the update is skipped. As an
      optimization, nested_vmcb02_prepare_control() copies it from VMCB01
      so when MSR-Bitmap feature for L2 is disabled nothing needs to be done.
      
      - 'struct vmcb_ctrl_area_cached' needs to be extended with clean
      fields/sw reserved data and __nested_copy_vmcb_control_to_cache() needs to
      copy it so nested_svm_vmrun_msrpm() can use it later.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220202095100.129834-5-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      66c03a92
    • V
      KVM: nSVM: Track whether changes in L0 require MSR bitmap for L2 to be rebuilt · 73c25546
      Vitaly Kuznetsov 提交于
      Similar to nVMX commit ed2a4800 ("KVM: nVMX: Track whether changes in
      L0 require MSR bitmap for L2 to be rebuilt"), introduce a flag to keep
      track of whether MSR bitmap for L2 needs to be rebuilt due to changes in
      MSR bitmap for L1 or switching to a different L2.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220202095100.129834-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      73c25546
  8. 09 2月, 2022 1 次提交
  9. 27 1月, 2022 1 次提交
    • S
      KVM: x86: Forcibly leave nested virt when SMM state is toggled · f7e57078
      Sean Christopherson 提交于
      Forcibly leave nested virtualization operation if userspace toggles SMM
      state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS.  If userspace
      forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
      vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
      vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
      
      Don't attempt to gracefully handle the transition as (a) most transitions
      are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
      sufficient information to handle all transitions, e.g. SVM wants access
      to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
      KVM_SET_NESTED_STATE during state restore as the latter disallows putting
      the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
      being post-VMXON in SMM if SMM is not active.
      
      Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
      due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
      just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
      in an architecturally impossible state.
      
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Modules linked in:
        CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
        Call Trace:
         <TASK>
         kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
         kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
         kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
         kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
         kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
         kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
         kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
         kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
         __fput+0x286/0x9f0 fs/file_table.c:311
         task_work_run+0xdd/0x1a0 kernel/task_work.c:164
         exit_task_work include/linux/task_work.h:32 [inline]
         do_exit+0xb29/0x2a30 kernel/exit.c:806
         do_group_exit+0xd2/0x2f0 kernel/exit.c:935
         get_signal+0x4b0/0x28c0 kernel/signal.c:2862
         arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
         handle_signal_work kernel/entry/common.c:148 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
         exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
         __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
         syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
         do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220358.2091737-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7e57078
  10. 08 12月, 2021 9 次提交
  11. 01 10月, 2021 2 次提交
  12. 30 9月, 2021 1 次提交
  13. 23 9月, 2021 1 次提交
  14. 22 9月, 2021 1 次提交
  15. 16 8月, 2021 2 次提交
  16. 02 8月, 2021 1 次提交
    • P
      KVM: nSVM: remove useless kvm_clear_*_queue · db105fab
      Paolo Bonzini 提交于
      For an event to be in injected state when nested_svm_vmrun executes,
      it must have come from exitintinfo when svm_complete_interrupts ran:
      
        vcpu_enter_guest
         static_call(kvm_x86_run) -> svm_vcpu_run
          svm_complete_interrupts
           // now the event went from "exitintinfo" to "injected"
         static_call(kvm_x86_handle_exit) -> handle_exit
          svm_invoke_exit_handler
            vmrun_interception
             nested_svm_vmrun
      
      However, no event could have been in exitintinfo before a VMRUN
      vmexit.  The code in svm.c is a bit more permissive than the one
      in vmx.c:
      
              if (is_external_interrupt(svm->vmcb->control.exit_int_info) &&
                  exit_code != SVM_EXIT_EXCP_BASE + PF_VECTOR &&
                  exit_code != SVM_EXIT_NPF && exit_code != SVM_EXIT_TASK_SWITCH &&
                  exit_code != SVM_EXIT_INTR && exit_code != SVM_EXIT_NMI)
      
      but in any case, a VMRUN instruction would not even start to execute
      during an attempted event delivery.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      db105fab
  17. 28 7月, 2021 1 次提交
  18. 26 7月, 2021 2 次提交