1. 05 5月, 2019 1 次提交
  2. 27 4月, 2019 3 次提交
    • S
      Revert "svm: Fix AVIC incomplete IPI emulation" · 3e1b3e4d
      Suthikulpanit, Suravee 提交于
      commit 4a58038b9e420276157785afa0a0bbb4b9bc2265 upstream.
      
      This reverts commit bb218fbcfaaa3b115d4cd7a43c0ca164f3a96e57.
      
      As Oren Twaig pointed out the old discussion:
      
        https://patchwork.kernel.org/patch/8292231/
      
      that the change coud potentially cause an extra IPI to be sent to
      the destination vcpu because the AVIC hardware already set the IRR bit
      before the incomplete IPI #VMEXIT with id=1 (target vcpu is not running).
      Since writting to ICR and ICR2 will also set the IRR. If something triggers
      the destination vcpu to get scheduled before the emulation finishes, then
      this could result in an additional IPI.
      
      Also, the issue mentioned in the commit bb218fbcfaaa was misdiagnosed.
      
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reported-by: NOren Twaig <oren@scalemp.com>
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3e1b3e4d
    • V
      KVM: x86: svm: make sure NMI is injected after nmi_singlestep · c9e34935
      Vitaly Kuznetsov 提交于
      commit 99c221796a810055974b54c02e8f53297e48d146 upstream.
      
      I noticed that apic test from kvm-unit-tests always hangs on my EPYC 7401P,
      the hanging test nmi-after-sti is trying to deliver 30000 NMIs and tracing
      shows that we're sometimes able to deliver a few but never all.
      
      When we're trying to inject an NMI we may fail to do so immediately for
      various reasons, however, we still need to inject it so enable_nmi_window()
      arms nmi_singlestep mode. #DB occurs as expected, but we're not checking
      for pending NMIs before entering the guest and unless there's a different
      event to process, the NMI will never get delivered.
      
      Make KVM_REQ_EVENT request on the vCPU from db_interception() to make sure
      pending NMIs are checked and possibly injected.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9e34935
    • S
      KVM: x86: Don't clear EFER during SMM transitions for 32-bit vCPU · 18cf09a8
      Sean Christopherson 提交于
      commit 8f4dc2e77cdfaf7e644ef29693fa229db29ee1de upstream.
      
      Neither AMD nor Intel CPUs have an EFER field in the legacy SMRAM save
      state area, i.e. don't save/restore EFER across SMM transitions.  KVM
      somewhat models this, e.g. doesn't clear EFER on entry to SMM if the
      guest doesn't support long mode.  But during RSM, KVM unconditionally
      clears EFER so that it can get back to pure 32-bit mode in order to
      start loading CRs with their actual non-SMM values.
      
      Clear EFER only when it will be written when loading the non-SMM state
      so as to preserve bits that can theoretically be set on 32-bit vCPUs,
      e.g. KVM always emulates EFER_SCE.
      
      And because CR4.PAE is cleared only to play nice with EFER, wrap that
      code in the long mode check as well.  Note, this may result in a
      compiler warning about cr4 being consumed uninitialized.  Re-read CR4
      even though it's technically unnecessary, as doing so allows for more
      readable code and RSM emulation is not a performance critical path.
      
      Fixes: 660a5d51 ("KVM: x86: save/load state on SMM switch")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18cf09a8
  3. 20 4月, 2019 1 次提交
    • S
      KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail · f35e2a68
      Sean Christopherson 提交于
      [ Upstream commit bd18bffca35397214ae68d85cf7203aca25c3c1d ]
      
      A VMEnter that VMFails (as opposed to VMExits) does not touch host
      state beyond registers that are explicitly noted in the VMFail path,
      e.g. EFLAGS.  Host state does not need to be loaded because VMFail
      is only signaled for consistency checks that occur before the CPU
      starts to load guest state, i.e. there is no need to restore any
      state as nothing has been modified.  But in the case where a VMFail
      is detected by hardware and not by KVM (due to deferring consistency
      checks to hardware), KVM has already loaded some amount of guest
      state.  Luckily, "loaded" only means loaded to KVM's software model,
      i.e. vmcs01 has not been modified.  So, unwind our software model to
      the pre-VMEntry host state.
      
      Not restoring host state in this VMFail path leads to a variety of
      failures because we end up with stale data in vcpu->arch, e.g. CR0,
      CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
      significant delta in the stale data is all but guaranteed to crash
      L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
      
      An alternative to this "soft" reload would be to load host state from
      vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
      wildly inconsistent with respect to the VMX architecture, e.g. an L1
      VMM with separate VMExit and VMFail paths would explode.
      
      Note that this approach does not mean KVM is 100% accurate with
      respect to VMX hardware behavior, even at an architectural level
      (the exact order of consistency checks is microarchitecture specific).
      But 100% emulation accuracy isn't the goal (with this patch), rather
      the goal is to be consistent in the information delivered to L1, e.g.
      a VMExit should not fall-through VMENTER, and a VMFail should not jump
      to HOST_RIP.
      
      This technically reverts commit "5af41573 (KVM: nVMX: Fix mmu
      context after VMLAUNCH/VMRESUME failure)", but retains the core
      aspects of that patch, just in an open coded form due to the need to
      pull state from vmcs01 instead of vmcs12.  Restoring host state
      resolves a variety of issues introduced by commit "4f350c6d
      (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
      which remedied the incorrect behavior of treating VMFail like VMExit
      but in doing so neglected to restore arch state that had been modified
      prior to attempting nested VMEnter.
      
      A sample failure that occurs due to stale vcpu.arch state is a fault
      of some form while emulating an LGDT (due to emulated UMIP) from L1
      after a failed VMEntry to L3, in this case when running the KVM unit
      test test_tpr_threshold_values in L1.  L0 also hits a WARN in this
      case due to a stale arch.cr4.UMIP.
      
      L1:
        BUG: unable to handle kernel paging request at ffffc90000663b9e
        PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
        Oops: 0009 [#1] SMP
        CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:native_load_gdt+0x0/0x10
      
        ...
      
        Call Trace:
         load_fixmap_gdt+0x22/0x30
         __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
         vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
         nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
         vmx_handle_exit+0x246/0x15a0 [kvm_intel]
         kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
         kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
         do_vfs_ioctl+0x9f/0x600
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x4f/0x100
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      L0:
        WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
        ...
        CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
        Hardware name: Intel Corporation Kabylake Client platform/KBL S
        RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
      
        ...
      
        Call Trace:
         kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
         kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
         do_vfs_ioctl+0x9f/0x5e0
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x49/0xf0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 5af41573 (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
      Fixes: 4f350c6d (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim KrÄmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f35e2a68
  4. 17 4月, 2019 4 次提交
  5. 03 4月, 2019 2 次提交
    • S
      KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts · 3a18eaba
      Sean Christopherson 提交于
      commit 0cf9135b773bf32fba9dd8e6699c1b331ee4b749 upstream.
      
      The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
      userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
      regardless of hardware support under the pretense that KVM fully
      emulates MSR_IA32_ARCH_CAPABILITIES.  Unfortunately, only VMX hosts
      handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
      also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).
      
      Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
      that it's emulated on AMD hosts.
      
      Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
      Cc: stable@vger.kernel.org
      Reported-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a18eaba
    • S
      KVM: x86: update %rip after emulating IO · b9733a74
      Sean Christopherson 提交于
      commit 45def77ebf79e2e8942b89ed79294d97ce914fa0 upstream.
      
      Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
      OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
      vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
      that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
      
      To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
      Skip pio instruction when it is emulated, not executed") changed the
      behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
      instruction prior to exiting to userspace.  Full emulation doesn't need
      such tricks becase re-emulating the instruction will naturally handle
      %rip being changed to point at the reset vector.
      
      Updating %rip prior to executing to userspace has several drawbacks:
      
        - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
          fails it will likely yell about the wrong address.
        - Single step exits to userspace for are effectively dropped as
          KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
        - Behavior of PIO emulation is different depending on whether it
          goes down the fast path or the slow path.
      
      Rather than skip the PIO instruction before exiting to userspace,
      snapshot the linear %rip and cancel PIO completion if the current
      value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
      common scenario, the snapshot and comparison has negligible overhead
      as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
      VMREAD in this case.
      
      All other alternatives to snapshotting the linear %rip that don't
      rely on an explicit reset announcenment suffer from one corner case
      or another.  For example, canceling PIO completion on any write to
      %rip fails if userspace does a save/restore of %rip, and attempting to
      avoid that issue by canceling PIO only if %rip changed then fails if PIO
      collides with the reset %rip.  Attempting to zero in on the exact reset
      vector won't work for APs, which means adding more hooks such as the
      vCPU's MP_STATE, and so on and so forth.
      
      Checking for a linear %rip match technically suffers from corner cases,
      e.g. userspace could theoretically rewrite the underlying code page and
      expect a different instruction to execute, or the guest hardcodes a PIO
      reset at 0xfffffff0, but those are far, far outside of what can be
      considered normal operation.
      
      Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
      Cc: <stable@vger.kernel.org>
      Reported-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9733a74
  6. 24 3月, 2019 6 次提交
    • S
      KVM: nVMX: Ignore limit checks on VMX instructions using flat segments · 5ffb710b
      Sean Christopherson 提交于
      commit 34333cc6c2cb021662fd32e24e618d1b86de95bf upstream.
      
      Regarding segments with a limit==0xffffffff, the SDM officially states:
      
          When the effective limit is FFFFFFFFH (4 GBytes), these accesses may
          or may not cause the indicated exceptions.  Behavior is
          implementation-specific and may vary from one execution to another.
      
      In practice, all CPUs that support VMX ignore limit checks for "flat
      segments", i.e. an expand-up data or code segment with base=0 and
      limit=0xffffffff.  This is subtly different than wrapping the effective
      address calculation based on the address size, as the flat segment
      behavior also applies to accesses that would wrap the 4g boundary, e.g.
      a 4-byte access starting at 0xffffffff will access linear addresses
      0xffffffff, 0x0, 0x1 and 0x2.
      
      Fixes: f9eb4af6 ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ffb710b
    • S
      KVM: nVMX: Apply addr size mask to effective address for VMX instructions · 29b515c2
      Sean Christopherson 提交于
      commit 8570f9e881e3fde98801bb3a47eef84dd934d405 upstream.
      
      The address size of an instruction affects the effective address, not
      the virtual/linear address.  The final address may still be truncated,
      e.g. to 32-bits outside of long mode, but that happens irrespective of
      the address size, e.g. a 32-bit address size can yield a 64-bit virtual
      address when using FS/GS with a non-zero base.
      
      Fixes: 064aea77 ("KVM: nVMX: Decoding memory operands of VMX instructions")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29b515c2
    • S
      KVM: nVMX: Sign extend displacements of VMX instr's mem operands · 9ce0ffeb
      Sean Christopherson 提交于
      commit 946c522b603f281195af1df91837a1d4d1eb3bc9 upstream.
      
      The VMCS.EXIT_QUALIFCATION field reports the displacements of memory
      operands for various instructions, including VMX instructions, as a
      naturally sized unsigned value, but masks the value by the addr size,
      e.g. given a ModRM encoded as -0x28(%ebp), the -0x28 displacement is
      reported as 0xffffffd8 for a 32-bit address size.  Despite some weird
      wording regarding sign extension, the SDM explicitly states that bits
      beyond the instructions address size are undefined:
      
          In all cases, bits of this field beyond the instruction’s address
          size are undefined.
      
      Failure to sign extend the displacement results in KVM incorrectly
      treating a negative displacement as a large positive displacement when
      the address size of the VMX instruction is smaller than KVM's native
      size, e.g. a 32-bit address size on a 64-bit KVM.
      
      The very original decoding, added by commit 064aea77 ("KVM: nVMX:
      Decoding memory operands of VMX instructions"), sort of modeled sign
      extension by truncating the final virtual/linear address for a 32-bit
      address size.  I.e. it messed up the effective address but made it work
      by adjusting the final address.
      
      When segmentation checks were added, the truncation logic was kept
      as-is and no sign extension logic was introduced.  In other words, it
      kept calculating the wrong effective address while mostly generating
      the correct virtual/linear address.  As the effective address is what's
      used in the segment limit checks, this results in KVM incorreclty
      injecting #GP/#SS faults due to non-existent segment violations when
      a nested VMM uses negative displacements with an address size smaller
      than KVM's native address size.
      
      Using the -0x28(%ebp) example, an EBP value of 0x1000 will result in
      KVM using 0x100000fd8 as the effective address when checking for a
      segment limit violation.  This causes a 100% failure rate when running
      a 32-bit KVM build as L1 on top of a 64-bit KVM L0.
      
      Fixes: f9eb4af6 ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9ce0ffeb
    • S
      KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux · c235af5a
      Sean Christopherson 提交于
      commit ddfd1730fd829743e41213e32ccc8b4aa6dc8325 upstream.
      
      When installing new memslots, KVM sets bit 0 of the generation number to
      indicate that an update is in-progress.  Until the update is complete,
      there are no guarantees as to whether a vCPU will see the old or the new
      memslots.  Explicity prevent caching MMIO accesses so as to avoid using
      an access cached from the old memslots after the new memslots have been
      installed.
      
      Note that it is unclear whether or not disabling caching during the
      update window is strictly necessary as there is no definitive
      documentation as to what ordering guarantees KVM provides with respect
      to updating memslots.  That being said, the MMIO spte code does not
      allow reusing sptes created while an update is in-progress, and the
      associated documentation explicitly states:
      
          We do not want to use an MMIO sptes created with an odd generation
          number, ...  If KVM is unlucky and creates an MMIO spte while the
          low bit is 1, the next access to the spte will always be a cache miss.
      
      At the very least, disabling the per-vCPU MMIO cache during updates will
      make its behavior consistent with the MMIO spte behavior and
      documentation.
      
      Fixes: 56f17dd3 ("kvm: x86: fix stale mmio cache bug")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c235af5a
    • S
      KVM: x86/mmu: Detect MMIO generation wrap in any address space · 656e9e5d
      Sean Christopherson 提交于
      commit e1359e2beb8b0a1188abc997273acbaedc8ee791 upstream.
      
      The check to detect a wrap of the MMIO generation explicitly looks for a
      generation number of zero.  Now that unique memslots generation numbers
      are assigned to each address space, only address space 0 will get a
      generation number of exactly zero when wrapping.  E.g. when address
      space 1 goes from 0x7fffe to 0x80002, the MMIO generation number will
      wrap to 0x2.  Adjust the MMIO generation to strip the address space
      modifier prior to checking for a wrap.
      
      Fixes: 4bd518f1 ("KVM: use separate generations for each address space")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      656e9e5d
    • S
      KVM: Call kvm_arch_memslots_updated() before updating memslots · 23ad135a
      Sean Christopherson 提交于
      commit 152482580a1b0accb60676063a1ac57b2d12daf6 upstream.
      
      kvm_arch_memslots_updated() is at this point in time an x86-specific
      hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
      the memslots generation number in its MMIO sptes in order to avoid
      full page fault walks for repeat faults on emulated MMIO addresses.
      Because only 19 bits are used, wrapping the MMIO generation number is
      possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
      the generation has changed so that it can invalidate all MMIO sptes in
      case the effective MMIO generation has wrapped so as to avoid using a
      stale spte, e.g. a (very) old spte that was created with generation==0.
      
      Given that the purpose of kvm_arch_memslots_updated() is to prevent
      consuming stale entries, it needs to be called before the new generation
      is propagated to memslots.  Invalidating the MMIO sptes after updating
      memslots means that there is a window where a vCPU could dereference
      the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
      spte that was created with (pre-wrap) generation==0.
      
      Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23ad135a
  7. 06 3月, 2019 2 次提交
  8. 27 2月, 2019 1 次提交
  9. 20 2月, 2019 3 次提交
  10. 13 2月, 2019 4 次提交
    • J
      cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM · 97a7fa90
      Josh Poimboeuf 提交于
      commit b284909abad48b07d3071a9fc9b5692b3e64914b upstream.
      
      With the following commit:
      
        73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      
      ... the hotplug code attempted to detect when SMT was disabled by BIOS,
      in which case it reported SMT as permanently disabled.  However, that
      code broke a virt hotplug scenario, where the guest is booted with only
      primary CPU threads, and a sibling is brought online later.
      
      The problem is that there doesn't seem to be a way to reliably
      distinguish between the HW "SMT disabled by BIOS" case and the virt
      "sibling not yet brought online" case.  So the above-mentioned commit
      was a bit misguided, as it permanently disabled SMT for both cases,
      preventing future virt sibling hotplugs.
      
      Going back and reviewing the original problems which were attempted to
      be solved by that commit, when SMT was disabled in BIOS:
      
        1) /sys/devices/system/cpu/smt/control showed "on" instead of
           "notsupported"; and
      
        2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.
      
      I'd propose that we instead consider #1 above to not actually be a
      problem.  Because, at least in the virt case, it's possible that SMT
      wasn't disabled by BIOS and a sibling thread could be brought online
      later.  So it makes sense to just always default the smt control to "on"
      to allow for that possibility (assuming cpuid indicates that the CPU
      supports SMT).
      
      The real problem is #2, which has a simple fix: change vmx_vm_init() to
      query the actual current SMT state -- i.e., whether any siblings are
      currently online -- instead of looking at the SMT "control" sysfs value.
      
      So fix it by:
      
        a) reverting the original "fix" and its followup fix:
      
           73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
           bc2d8d26 ("cpu/hotplug: Fix SMT supported evaluation")
      
           and
      
        b) changing vmx_vm_init() to query the actual current SMT state --
           instead of the sysfs control value -- to determine whether the L1TF
           warning is needed.  This also requires the 'sched_smt_present'
           variable to exported, instead of 'cpu_smt_control'.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      97a7fa90
    • P
      KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221) · 236fd677
      Peter Shier 提交于
      commit ecec76885bcfe3294685dc363fd1273df0d5d65f upstream.
      
      Bugzilla: 1671904
      
      There are multiple code paths where an hrtimer may have been started to
      emulate an L1 VMX preemption timer that can result in a call to free_nested
      without an intervening L2 exit where the hrtimer is normally
      cancelled. Unconditionally cancel in free_nested to cover all cases.
      
      Embargoed until Feb 7th 2019.
      Signed-off-by: NPeter Shier <pshier@google.com>
      Reported-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Reported-by: NFelix Wilhelm <fwilhelm@google.com>
      Cc: stable@kernel.org
      Message-Id: <20181011184646.154065-1-pshier@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      236fd677
    • P
      KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222) · 5a45d372
      Paolo Bonzini 提交于
      commit 353c0956a618a07ba4bbe7ad00ff29fe70e8412a upstream.
      
      Bugzilla: 1671930
      
      Emulation of certain instructions (VMXON, VMCLEAR, VMPTRLD, VMWRITE with
      memory operand, INVEPT, INVVPID) can incorrectly inject a page fault
      when passed an operand that points to an MMIO address.  The page fault
      will use uninitialized kernel stack memory as the CR2 and error code.
      
      The right behavior would be to abort the VM with a KVM_EXIT_INTERNAL_ERROR
      exit to userspace; however, it is not an easy fix, so for now just
      ensure that the error code and CR2 are zero.
      
      Embargoed until Feb 7th 2019.
      Reported-by: NFelix Wilhelm <fwilhelm@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5a45d372
    • V
      KVM: x86: svm: report MSR_IA32_MCG_EXT_CTL as unsupported · 84a4572b
      Vitaly Kuznetsov 提交于
      [ Upstream commit e87555e550cef4941579cd879759a7c0dee24e68 ]
      
      AMD doesn't seem to implement MSR_IA32_MCG_EXT_CTL and svm code in kvm
      knows nothing about it, however, this MSR is among emulated_msrs and
      thus returned with KVM_GET_MSR_INDEX_LIST. The consequent KVM_GET_MSRS,
      of course, fails.
      
      Report the MSR as unsupported to not confuse userspace.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      84a4572b
  11. 31 1月, 2019 4 次提交
  12. 10 1月, 2019 1 次提交
  13. 29 12月, 2018 3 次提交
    • C
      KVM: Fix UAF in nested posted interrupt processing · 1972ca04
      Cfir Cohen 提交于
      commit c2dd5146e9fe1f22c77c1b011adf84eea0245806 upstream.
      
      nested_get_vmcs12_pages() processes the posted_intr address in vmcs12. It
      caches the kmap()ed page object and pointer, however, it doesn't handle
      errors correctly: it's possible to cache a valid pointer, then release
      the page and later dereference the dangling pointer.
      
      I was able to reproduce with the following steps:
      
      1. Call vmlaunch with valid posted_intr_desc_addr but an invalid
      MSR_EFER. This causes nested_get_vmcs12_pages() to cache the kmap()ed
      pi_desc_page and pi_desc. Later the invalid EFER value fails
      check_vmentry_postreqs() which fails the first vmlaunch.
      
      2. Call vmlanuch with a valid EFER but an invalid posted_intr_desc_addr
      (I set it to 2G - 0x80). The second time we call nested_get_vmcs12_pages
      pi_desc_page is unmapped and released and pi_desc_page is set to NULL
      (the "shouldn't happen" clause). Due to the invalid
      posted_intr_desc_addr, kvm_vcpu_gpa_to_page() fails and
      nested_get_vmcs12_pages() returns. It doesn't return an error value so
      vmlaunch proceeds. Note that at this time we have a dangling pointer in
      vmx->nested.pi_desc and POSTED_INTR_DESC_ADDR in L0's vmcs.
      
      3. Issue an IPI in L2 guest code. This triggers a call to
      vmx_complete_nested_posted_interrupt() and pi_test_and_clear_on() which
      dereferences the dangling pointer.
      
      Vulnerable code requires nested and enable_apicv variables to be set to
      true. The host CPU must also support posted interrupts.
      
      Fixes: 5e2f30b7 "KVM: nVMX: get rid of nested_get_page()"
      Cc: stable@vger.kernel.org
      Reviewed-by: NAndy Honig <ahonig@google.com>
      Signed-off-by: NCfir Cohen <cfir@google.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1972ca04
    • E
      kvm: x86: Add AMD's EX_CFG to the list of ignored MSRs · 229468c6
      Eduardo Habkost 提交于
      commit 0e1b869fff60c81b510c2d00602d778f8f59dd9a upstream.
      
      Some guests OSes (including Windows 10) write to MSR 0xc001102c
      on some cases (possibly while trying to apply a CPU errata).
      Make KVM ignore reads and writes to that MSR, so the guest won't
      crash.
      
      The MSR is documented as "Execution Unit Configuration (EX_CFG)",
      at AMD's "BIOS and Kernel Developer's Guide (BKDG) for AMD Family
      15h Models 00h-0Fh Processors".
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      229468c6
    • W
      KVM: X86: Fix NULL deref in vcpu_scan_ioapic · 76281d12
      Wanpeng Li 提交于
      commit dcbd3e49c2f0b2c2d8a321507ff8f3de4af76d7c upstream.
      
      Reported by syzkaller:
      
          CPU: 1 PID: 5962 Comm: syz-executor118 Not tainted 4.20.0-rc6+ #374
          Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
          RIP: 0010:kvm_apic_hw_enabled arch/x86/kvm/lapic.h:169 [inline]
          RIP: 0010:vcpu_scan_ioapic arch/x86/kvm/x86.c:7449 [inline]
          RIP: 0010:vcpu_enter_guest arch/x86/kvm/x86.c:7602 [inline]
          RIP: 0010:vcpu_run arch/x86/kvm/x86.c:7874 [inline]
          RIP: 0010:kvm_arch_vcpu_ioctl_run+0x5296/0x7320 arch/x86/kvm/x86.c:8074
          Call Trace:
      	 kvm_vcpu_ioctl+0x5c8/0x1150 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2596
      	 vfs_ioctl fs/ioctl.c:46 [inline]
      	 file_ioctl fs/ioctl.c:509 [inline]
      	 do_vfs_ioctl+0x1de/0x1790 fs/ioctl.c:696
      	 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:713
      	 __do_sys_ioctl fs/ioctl.c:720 [inline]
      	 __se_sys_ioctl fs/ioctl.c:718 [inline]
      	 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
      	 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The reason is that the testcase writes hyperv synic HV_X64_MSR_SINT14 msr
      and triggers scan ioapic logic to load synic vectors into EOI exit bitmap.
      However, irqchip is not initialized by this simple testcase, ioapic/apic
      objects should not be accessed.
      
      This patch fixes it by also considering whether or not apic is present.
      
      Reported-by: syzbot+39810e6c400efadfef71@syzkaller.appspotmail.com
      Cc: stable@vger.kernel.org
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      76281d12
  14. 17 12月, 2018 3 次提交
    • Y
      x86/kvm/vmx: fix old-style function declaration · bf1b47f3
      Yi Wang 提交于
      [ Upstream commit 1e4329ee2c52692ea42cc677fb2133519718b34a ]
      
      The inline keyword which is not at the beginning of the function
      declaration may trigger the following build warnings, so let's fix it:
      
      arch/x86/kvm/vmx.c:1309:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
      arch/x86/kvm/vmx.c:5947:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
      arch/x86/kvm/vmx.c:5985:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
      arch/x86/kvm/vmx.c:6023:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
      Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bf1b47f3
    • Y
      KVM: x86: fix empty-body warnings · d6b1692d
      Yi Wang 提交于
      [ Upstream commit 354cb410d87314e2eda344feea84809e4261570a ]
      
      We get the following warnings about empty statements when building
      with 'W=1':
      
      arch/x86/kvm/lapic.c:632:53: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1907:42: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1936:65: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1975:44: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      
      Rework the debug helper macro to get rid of these warnings.
      Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d6b1692d
    • L
      KVM: VMX: Update shared MSRs to be saved/restored on MSR_EFER.LMA changes · 3c7670d5
      Liran Alon 提交于
      [ Upstream commit f48b4711dd6e1cf282f9dfd159c14a305909c97c ]
      
      When guest transitions from/to long-mode by modifying MSR_EFER.LMA,
      the list of shared MSRs to be saved/restored on guest<->host
      transitions is updated (See vmx_set_efer() call to setup_msrs()).
      
      On every entry to guest, vcpu_enter_guest() calls
      vmx_prepare_switch_to_guest(). This function should also take care
      of setting the shared MSRs to be saved/restored. However, the
      function does nothing in case we are already running with loaded
      guest state (vmx->loaded_cpu_state != NULL).
      
      This means that even when guest modifies MSR_EFER.LMA which results
      in updating the list of shared MSRs, it isn't being taken into account
      by vmx_prepare_switch_to_guest() because it happens while we are
      running with loaded guest state.
      
      To fix above mentioned issue, add a flag to mark that the list of
      shared MSRs has been updated and modify vmx_prepare_switch_to_guest()
      to set shared MSRs when running with host state *OR* list of shared
      MSRs has been updated.
      
      Note that this issue was mistakenly introduced by commit
      678e315e ("KVM: vmx: add dedicated utility to access guest's
      kernel_gs_base") because previously vmx_set_efer() always called
      vmx_load_host_state() which resulted in vmx_prepare_switch_to_guest() to
      set shared MSRs.
      
      Fixes: 678e315e ("KVM: vmx: add dedicated utility to access guest's kernel_gs_base")
      Reported-by: NEyal Moscovici <eyal.moscovici@oracle.com>
      Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3c7670d5
  15. 08 12月, 2018 1 次提交
  16. 06 12月, 2018 1 次提交