1. 16 4月, 2019 6 次提交
    • S
      KVM: x86: clear SMM flags before loading state while leaving SMM · 9ec19493
      Sean Christopherson 提交于
      RSM emulation is currently broken on VMX when the interrupted guest has
      CR4.VMXE=1.  Stop dancing around the issue of HF_SMM_MASK being set when
      loading SMSTATE into architectural state, e.g. by toggling it for
      problematic flows, and simply clear HF_SMM_MASK prior to loading
      architectural state (from SMRAM save state area).
      Reported-by: NJon Doron <arilou@gmail.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Fixes: 5bea5123 ("KVM: VMX: check nested state and CR4.VMXE against SMM")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9ec19493
    • S
      KVM: x86: Load SMRAM in a single shot when leaving SMM · ed19321f
      Sean Christopherson 提交于
      RSM emulation is currently broken on VMX when the interrupted guest has
      CR4.VMXE=1.  Rather than dance around the issue of HF_SMM_MASK being set
      when loading SMSTATE into architectural state, ideally RSM emulation
      itself would be reworked to clear HF_SMM_MASK prior to loading non-SMM
      architectural state.
      
      Ostensibly, the only motivation for having HF_SMM_MASK set throughout
      the loading of state from the SMRAM save state area is so that the
      memory accesses from GET_SMSTATE() are tagged with role.smm.  Load
      all of the SMRAM save state area from guest memory at the beginning of
      RSM emulation, and load state from the buffer instead of reading guest
      memory one-by-one.
      
      This paves the way for clearing HF_SMM_MASK prior to loading state,
      and also aligns RSM with the enter_smm() behavior, which fills a
      buffer and writes SMRAM save state in a single go.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed19321f
    • W
      x86/kvm: move kvm_load/put_guest_xcr0 into atomic context · 1811d979
      WANG Chao 提交于
      guest xcr0 could leak into host when MCE happens in guest mode. Because
      do_machine_check() could schedule out at a few places.
      
      For example:
      
      kvm_load_guest_xcr0
      ...
      kvm_x86_ops->run(vcpu) {
        vmx_vcpu_run
          vmx_complete_atomic_exit
            kvm_machine_check
              do_machine_check
                do_memory_failure
                  memory_failure
                    lock_page
      
      In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule
      out, host cpu has guest xcr0 loaded (0xff).
      
      In __switch_to {
           switch_fpu_finish
             copy_kernel_to_fpregs
               XRSTORS
      
      If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will
      generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in
      and tries to reinitialize fpu by restoring init fpu state. Same story as
      last #GP, except we get DOUBLE FAULT this time.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NWANG Chao <chao.wang@ucloud.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1811d979
    • V
      KVM: x86: svm: make sure NMI is injected after nmi_singlestep · 99c22179
      Vitaly Kuznetsov 提交于
      I noticed that apic test from kvm-unit-tests always hangs on my EPYC 7401P,
      the hanging test nmi-after-sti is trying to deliver 30000 NMIs and tracing
      shows that we're sometimes able to deliver a few but never all.
      
      When we're trying to inject an NMI we may fail to do so immediately for
      various reasons, however, we still need to inject it so enable_nmi_window()
      arms nmi_singlestep mode. #DB occurs as expected, but we're not checking
      for pending NMIs before entering the guest and unless there's a different
      event to process, the NMI will never get delivered.
      
      Make KVM_REQ_EVENT request on the vCPU from db_interception() to make sure
      pending NMIs are checked and possibly injected.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      99c22179
    • S
      svm/avic: Fix invalidate logical APIC id entry · e44e3eac
      Suthikulpanit, Suravee 提交于
      Only clear the valid bit when invalidate logical APIC id entry.
      The current logic clear the valid bit, but also set the rest of
      the bits (including reserved bits) to 1.
      
      Fixes: 98d90582 ('svm: Fix AVIC DFR and LDR handling')
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e44e3eac
    • S
      Revert "svm: Fix AVIC incomplete IPI emulation" · 4a58038b
      Suthikulpanit, Suravee 提交于
      This reverts commit bb218fbc.
      
      As Oren Twaig pointed out the old discussion:
      
        https://patchwork.kernel.org/patch/8292231/
      
      that the change coud potentially cause an extra IPI to be sent to
      the destination vcpu because the AVIC hardware already set the IRR bit
      before the incomplete IPI #VMEXIT with id=1 (target vcpu is not running).
      Since writting to ICR and ICR2 will also set the IRR. If something triggers
      the destination vcpu to get scheduled before the emulation finishes, then
      this could result in an additional IPI.
      
      Also, the issue mentioned in the commit bb218fbc was misdiagnosed.
      
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reported-by: NOren Twaig <oren@scalemp.com>
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4a58038b
  2. 06 4月, 2019 2 次提交
  3. 29 3月, 2019 1 次提交
    • S
      KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation) · 05d5a486
      Singh, Brijesh 提交于
      Errata#1096:
      
      On a nested data page fault when CR.SMAP=1 and the guest data read
      generates a SMAP violation, GuestInstrBytes field of the VMCB on a
      VMEXIT will incorrectly return 0h instead the correct guest
      instruction bytes .
      
      Recommend Workaround:
      
      To determine what instruction the guest was executing the hypervisor
      will have to decode the instruction at the instruction pointer.
      
      The recommended workaround can not be implemented for the SEV
      guest because guest memory is encrypted with the guest specific key,
      and instruction decoder will not be able to decode the instruction
      bytes. If we hit this errata in the SEV guest then log the message
      and request a guest shutdown.
      Reported-by: NVenkatesh Srinivas <venkateshs@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      05d5a486
  4. 21 2月, 2019 3 次提交
    • B
      kvm: svm: Add memcg accounting to KVM allocations · 1ec69647
      Ben Gardon 提交于
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1ec69647
    • S
      svm: Fix improper check when deactivate AVIC · c57cd3c8
      Suthikulpanit, Suravee 提交于
      The function svm_refresh_apicv_exec_ctrl() always returning prematurely
      as kvm_vcpu_apicv_active() always return false when calling from
      the function arch/x86/kvm/x86.c:kvm_vcpu_deactivate_apicv().
      This is because the apicv_active is set to false just before calling
      refresh_apicv_exec_ctrl().
      
      Also, we need to mark VMCB_AVIC bit as dirty instead of VMCB_INTR.
      
      So, fix svm_refresh_apicv_exec_ctrl() to properly deactivate AVIC.
      
      Fixes: 67034bb9 ('KVM: SVM: Add irqchip_split() checks before enabling AVIC')
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c57cd3c8
    • S
      svm: Fix AVIC DFR and LDR handling · 98d90582
      Suthikulpanit, Suravee 提交于
      Current SVM AVIC driver makes two incorrect assumptions:
        1. APIC LDR register cannot be zero
        2. APIC DFR for all vCPUs must be the same
      
      LDR=0 means the local APIC does not support logical destination mode.
      Therefore, the driver should mark any previously assigned logical APIC ID
      table entry as invalid, and return success.  Also, DFR is specific to
      a particular local APIC, and can be different among all vCPUs
      (as observed on Windows 10).
      
      These incorrect assumptions cause Windows 10 and FreeBSD VMs to fail
      to boot with AVIC enabled. So, instead of flush the whole logical APIC ID
      table, handle DFR and LDR for each vCPU independently.
      
      Fixes: 18f40c53 ('svm: Add VMEXIT handlers for AVIC')
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reported-by: NJulian Stecklina <jsteckli@amazon.de>
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      98d90582
  5. 26 1月, 2019 4 次提交
    • G
      KVM: x86: Mark expected switch fall-throughs · b2869f28
      Gustavo A. R. Silva 提交于
      In preparation to enabling -Wimplicit-fallthrough, mark switch
      cases where we are expecting to fall through.
      
      This patch fixes the following warnings:
      
      arch/x86/kvm/lapic.c:1037:27: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/lapic.c:1876:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/hyperv.c:1637:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/svm.c:4396:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/mmu.c:4372:36: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/x86.c:3835:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/x86.c:7938:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/vmx/vmx.c:2015:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/vmx/vmx.c:1773:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      
      Warning level 3 was used: -Wimplicit-fallthrough=3
      
      This patch is part of the ongoing efforts to enabling -Wimplicit-fallthrough.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b2869f28
    • V
      KVM: nSVM: clear events pending from svm_complete_interrupts() when exiting to L1 · 619ad846
      Vitaly Kuznetsov 提交于
      kvm-unit-tests' eventinj "NMI failing on IDT" test results in NMI being
      delivered to the host (L1) when it's running nested. The problem seems to
      be: svm_complete_interrupts() raises 'nmi_injected' flag but later we
      decide to reflect EXIT_NPF to L1. The flag remains pending and we do NMI
      injection upon entry so it got delivered to L1 instead of L2.
      
      It seems that VMX code solves the same issue in prepare_vmcs12(), this was
      introduced with code refactoring in commit 5f3d5799 ("KVM: nVMX: Rework
      event injection and recovery").
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      619ad846
    • S
      svm: Fix AVIC incomplete IPI emulation · bb218fbc
      Suravee Suthikulpanit 提交于
      In case of incomplete IPI with invalid interrupt type, the current
      SVM driver does not properly emulate the IPI, and fails to boot
      FreeBSD guests with multiple vcpus when enabling AVIC.
      
      Fix this by update APIC ICR high/low registers, which also
      emulate sending the IPI.
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb218fbc
    • S
      svm: Add warning message for AVIC IPI invalid target · 37ef0c44
      Suravee Suthikulpanit 提交于
      Print warning message when IPI target ID is invalid due to one of
      the following reasons:
        * In logical mode: cluster > max_cluster (64)
        * In physical mode: target > max_physical (512)
        * Address is not present in the physical or logical ID tables
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37ef0c44
  6. 12 1月, 2019 1 次提交
  7. 21 12月, 2018 4 次提交
  8. 20 12月, 2018 1 次提交
    • V
      KVM: x86: nSVM: fix switch to guest mmu · 3cf85f9f
      Vitaly Kuznetsov 提交于
      Recent optimizations in MMU code broke nested SVM with NPT in L1
      completely: when we do nested_svm_{,un}init_mmu_context() we want
      to switch from TDP MMU to shadow MMU, both init_kvm_tdp_mmu() and
      kvm_init_shadow_mmu() check if re-configuration is needed by looking
      at cache source data. The data, however, doesn't change - it's only
      the type of the MMU which changes. We end up not re-initializing
      guest MMU as shadow and everything goes off the rails.
      
      The issue could have been fixed by putting MMU type into extended MMU
      role but this is not really needed. We can just split root and guest MMUs
      the exact same way we did for nVMX, their types never change in the
      lifetime of a vCPU.
      
      There is still room for improvement: currently, we reset all MMU roots
      when switching from L1 to L2 and back and this is not needed.
      
      Fixes: 7dcd5755 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3cf85f9f
  9. 15 12月, 2018 3 次提交
  10. 27 11月, 2018 4 次提交
  11. 18 10月, 2018 2 次提交
  12. 17 10月, 2018 2 次提交
  13. 10 10月, 2018 1 次提交
    • P
      KVM: x86: support CONFIG_KVM_AMD=y with CONFIG_CRYPTO_DEV_CCP_DD=m · 853c1109
      Paolo Bonzini 提交于
      SEV requires access to the AMD cryptographic device APIs, and this
      does not work when KVM is builtin and the crypto driver is a module.
      Actually the Kconfig conditions for CONFIG_KVM_AMD_SEV try to disable
      SEV in that case, but it does not work because the actual crypto
      calls are not culled, only sev_hardware_setup() is.
      
      This patch adds two CONFIG_KVM_AMD_SEV checks that gate all the remaining
      SEV code; it fixes this particular configuration, and drops 5 KiB of
      code when CONFIG_KVM_AMD_SEV=n.
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      853c1109
  14. 20 9月, 2018 2 次提交
    • S
      KVM: VMX: use preemption timer to force immediate VMExit · d264ee0c
      Sean Christopherson 提交于
      A VMX preemption timer value of '0' is guaranteed to cause a VMExit
      prior to the CPU executing any instructions in the guest.  Use the
      preemption timer (if it's supported) to trigger immediate VMExit
      in place of the current method of sending a self-IPI.  This ensures
      that pending VMExit injection to L1 occurs prior to executing any
      instructions in the guest (regardless of nesting level).
      
      When deferring VMExit injection, KVM generates an immediate VMExit
      from the (possibly nested) guest by sending itself an IPI.  Because
      hardware interrupts are blocked prior to VMEnter and are unblocked
      (in hardware) after VMEnter, this results in taking a VMExit(INTR)
      before any guest instruction is executed.  But, as this approach
      relies on the IPI being received before VMEnter executes, it only
      works as intended when KVM is running as L0.  Because there are no
      architectural guarantees regarding when IPIs are delivered, when
      running nested the INTR may "arrive" long after L2 is running e.g.
      L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
      
      For the most part, this unintended delay is not an issue since the
      events being injected to L1 also do not have architectural guarantees
      regarding their timing.  The notable exception is the VMX preemption
      timer[1], which is architecturally guaranteed to cause a VMExit prior
      to executing any instructions in the guest if the timer value is '0'
      at VMEnter.  Specifically, the delay in injecting the VMExit causes
      the preemption timer KVM unit test to fail when run in a nested guest.
      
      Note: this approach is viable even on CPUs with a broken preemption
      timer, as broken in this context only means the timer counts at the
      wrong rate.  There are no known errata affecting timer value of '0'.
      
      [1] I/O SMIs also have guarantees on when they arrive, but I have
          no idea if/how those are emulated in KVM.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Use a hook for SVM instead of leaving the default in x86.c - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d264ee0c
    • A
      KVM: SVM: Switch to bitmap_zalloc() · a101c9d6
      Andy Shevchenko 提交于
      Switch to bitmap_zalloc() to show clearly what we are allocating.
      Besides that it returns pointer of bitmap type instead of opaque void *.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a101c9d6
  15. 30 8月, 2018 3 次提交
  16. 22 8月, 2018 1 次提交
    • T
      KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled · 024d83ca
      Thomas Gleixner 提交于
      Mikhail reported the following lockdep splat:
      
      WARNING: possible irq lock inversion dependency detected
      CPU 0/KVM/10284 just changed the state of lock:
        000000000d538a88 (&st->lock){+...}, at:
        speculative_store_bypass_update+0x10b/0x170
      
      but this lock was taken by another, HARDIRQ-safe lock
      in the past:
      
      (&(&sighand->siglock)->rlock){-.-.}
      
         and interrupts could create inverse lock ordering between them.
      
      Possible interrupt unsafe locking scenario:
      
          CPU0                    CPU1
          ----                    ----
         lock(&st->lock);
                                 local_irq_disable();
                                 lock(&(&sighand->siglock)->rlock);
                                 lock(&st->lock);
          <Interrupt>
           lock(&(&sighand->siglock)->rlock);
           *** DEADLOCK ***
      
      The code path which connects those locks is:
      
         speculative_store_bypass_update()
         ssb_prctl_set()
         do_seccomp()
         do_syscall_64()
      
      In svm_vcpu_run() speculative_store_bypass_update() is called with
      interupts enabled via x86_virt_spec_ctrl_set_guest/host().
      
      This is actually a false positive, because GIF=0 so interrupts are
      disabled even if IF=1; however, we can easily move the invocations of
      x86_virt_spec_ctrl_set_guest/host() into the interrupt disabled region to
      cure it, and it's a good idea to keep the GIF=0/IF=1 area as small
      and self-contained as possible.
      
      Fixes: 1f50ddb4 ("x86/speculation: Handle HT correctly on AMD")
      Reported-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: kvm@vger.kernel.org
      Cc: x86@kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      024d83ca