• S
    KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail · f35e2a68
    Sean Christopherson 提交于
    [ Upstream commit bd18bffca35397214ae68d85cf7203aca25c3c1d ]
    
    A VMEnter that VMFails (as opposed to VMExits) does not touch host
    state beyond registers that are explicitly noted in the VMFail path,
    e.g. EFLAGS.  Host state does not need to be loaded because VMFail
    is only signaled for consistency checks that occur before the CPU
    starts to load guest state, i.e. there is no need to restore any
    state as nothing has been modified.  But in the case where a VMFail
    is detected by hardware and not by KVM (due to deferring consistency
    checks to hardware), KVM has already loaded some amount of guest
    state.  Luckily, "loaded" only means loaded to KVM's software model,
    i.e. vmcs01 has not been modified.  So, unwind our software model to
    the pre-VMEntry host state.
    
    Not restoring host state in this VMFail path leads to a variety of
    failures because we end up with stale data in vcpu->arch, e.g. CR0,
    CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
    significant delta in the stale data is all but guaranteed to crash
    L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
    
    An alternative to this "soft" reload would be to load host state from
    vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
    wildly inconsistent with respect to the VMX architecture, e.g. an L1
    VMM with separate VMExit and VMFail paths would explode.
    
    Note that this approach does not mean KVM is 100% accurate with
    respect to VMX hardware behavior, even at an architectural level
    (the exact order of consistency checks is microarchitecture specific).
    But 100% emulation accuracy isn't the goal (with this patch), rather
    the goal is to be consistent in the information delivered to L1, e.g.
    a VMExit should not fall-through VMENTER, and a VMFail should not jump
    to HOST_RIP.
    
    This technically reverts commit "5af41573 (KVM: nVMX: Fix mmu
    context after VMLAUNCH/VMRESUME failure)", but retains the core
    aspects of that patch, just in an open coded form due to the need to
    pull state from vmcs01 instead of vmcs12.  Restoring host state
    resolves a variety of issues introduced by commit "4f350c6d
    (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
    which remedied the incorrect behavior of treating VMFail like VMExit
    but in doing so neglected to restore arch state that had been modified
    prior to attempting nested VMEnter.
    
    A sample failure that occurs due to stale vcpu.arch state is a fault
    of some form while emulating an LGDT (due to emulated UMIP) from L1
    after a failed VMEntry to L3, in this case when running the KVM unit
    test test_tpr_threshold_values in L1.  L0 also hits a WARN in this
    case due to a stale arch.cr4.UMIP.
    
    L1:
      BUG: unable to handle kernel paging request at ffffc90000663b9e
      PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
      Oops: 0009 [#1] SMP
      CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      RIP: 0010:native_load_gdt+0x0/0x10
    
      ...
    
      Call Trace:
       load_fixmap_gdt+0x22/0x30
       __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
       vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
       nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
       vmx_handle_exit+0x246/0x15a0 [kvm_intel]
       kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
       kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
       do_vfs_ioctl+0x9f/0x600
       ksys_ioctl+0x66/0x70
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x4f/0x100
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    L0:
      WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
      ...
      CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
      Hardware name: Intel Corporation Kabylake Client platform/KBL S
      RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
    
      ...
    
      Call Trace:
       kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
       kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
       do_vfs_ioctl+0x9f/0x5e0
       ksys_ioctl+0x66/0x70
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x49/0xf0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
    
    Fixes: 5af41573 (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
    Fixes: 4f350c6d (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
    Cc: Jim Mattson <jmattson@google.com>
    Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Cc: Radim KrÄmář <rkrcmar@redhat.com>
    Cc: Wanpeng Li <wanpeng.li@hotmail.com>
    Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
    Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: NSasha Levin <sashal@kernel.org>
    f35e2a68
vmx.c 409.7 KB