1. 17 4月, 2019 4 次提交
  2. 03 4月, 2019 2 次提交
    • S
      KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts · 3a18eaba
      Sean Christopherson 提交于
      commit 0cf9135b773bf32fba9dd8e6699c1b331ee4b749 upstream.
      
      The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
      userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
      regardless of hardware support under the pretense that KVM fully
      emulates MSR_IA32_ARCH_CAPABILITIES.  Unfortunately, only VMX hosts
      handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
      also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).
      
      Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
      that it's emulated on AMD hosts.
      
      Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
      Cc: stable@vger.kernel.org
      Reported-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a18eaba
    • S
      KVM: x86: update %rip after emulating IO · b9733a74
      Sean Christopherson 提交于
      commit 45def77ebf79e2e8942b89ed79294d97ce914fa0 upstream.
      
      Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
      OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
      vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
      that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
      
      To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
      Skip pio instruction when it is emulated, not executed") changed the
      behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
      instruction prior to exiting to userspace.  Full emulation doesn't need
      such tricks becase re-emulating the instruction will naturally handle
      %rip being changed to point at the reset vector.
      
      Updating %rip prior to executing to userspace has several drawbacks:
      
        - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
          fails it will likely yell about the wrong address.
        - Single step exits to userspace for are effectively dropped as
          KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
        - Behavior of PIO emulation is different depending on whether it
          goes down the fast path or the slow path.
      
      Rather than skip the PIO instruction before exiting to userspace,
      snapshot the linear %rip and cancel PIO completion if the current
      value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
      common scenario, the snapshot and comparison has negligible overhead
      as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
      VMREAD in this case.
      
      All other alternatives to snapshotting the linear %rip that don't
      rely on an explicit reset announcenment suffer from one corner case
      or another.  For example, canceling PIO completion on any write to
      %rip fails if userspace does a save/restore of %rip, and attempting to
      avoid that issue by canceling PIO only if %rip changed then fails if PIO
      collides with the reset %rip.  Attempting to zero in on the exact reset
      vector won't work for APs, which means adding more hooks such as the
      vCPU's MP_STATE, and so on and so forth.
      
      Checking for a linear %rip match technically suffers from corner cases,
      e.g. userspace could theoretically rewrite the underlying code page and
      expect a different instruction to execute, or the guest hardcodes a PIO
      reset at 0xfffffff0, but those are far, far outside of what can be
      considered normal operation.
      
      Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
      Cc: <stable@vger.kernel.org>
      Reported-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9733a74
  3. 24 3月, 2019 6 次提交
    • S
      KVM: nVMX: Ignore limit checks on VMX instructions using flat segments · 5ffb710b
      Sean Christopherson 提交于
      commit 34333cc6c2cb021662fd32e24e618d1b86de95bf upstream.
      
      Regarding segments with a limit==0xffffffff, the SDM officially states:
      
          When the effective limit is FFFFFFFFH (4 GBytes), these accesses may
          or may not cause the indicated exceptions.  Behavior is
          implementation-specific and may vary from one execution to another.
      
      In practice, all CPUs that support VMX ignore limit checks for "flat
      segments", i.e. an expand-up data or code segment with base=0 and
      limit=0xffffffff.  This is subtly different than wrapping the effective
      address calculation based on the address size, as the flat segment
      behavior also applies to accesses that would wrap the 4g boundary, e.g.
      a 4-byte access starting at 0xffffffff will access linear addresses
      0xffffffff, 0x0, 0x1 and 0x2.
      
      Fixes: f9eb4af6 ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ffb710b
    • S
      KVM: nVMX: Apply addr size mask to effective address for VMX instructions · 29b515c2
      Sean Christopherson 提交于
      commit 8570f9e881e3fde98801bb3a47eef84dd934d405 upstream.
      
      The address size of an instruction affects the effective address, not
      the virtual/linear address.  The final address may still be truncated,
      e.g. to 32-bits outside of long mode, but that happens irrespective of
      the address size, e.g. a 32-bit address size can yield a 64-bit virtual
      address when using FS/GS with a non-zero base.
      
      Fixes: 064aea77 ("KVM: nVMX: Decoding memory operands of VMX instructions")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29b515c2
    • S
      KVM: nVMX: Sign extend displacements of VMX instr's mem operands · 9ce0ffeb
      Sean Christopherson 提交于
      commit 946c522b603f281195af1df91837a1d4d1eb3bc9 upstream.
      
      The VMCS.EXIT_QUALIFCATION field reports the displacements of memory
      operands for various instructions, including VMX instructions, as a
      naturally sized unsigned value, but masks the value by the addr size,
      e.g. given a ModRM encoded as -0x28(%ebp), the -0x28 displacement is
      reported as 0xffffffd8 for a 32-bit address size.  Despite some weird
      wording regarding sign extension, the SDM explicitly states that bits
      beyond the instructions address size are undefined:
      
          In all cases, bits of this field beyond the instruction’s address
          size are undefined.
      
      Failure to sign extend the displacement results in KVM incorrectly
      treating a negative displacement as a large positive displacement when
      the address size of the VMX instruction is smaller than KVM's native
      size, e.g. a 32-bit address size on a 64-bit KVM.
      
      The very original decoding, added by commit 064aea77 ("KVM: nVMX:
      Decoding memory operands of VMX instructions"), sort of modeled sign
      extension by truncating the final virtual/linear address for a 32-bit
      address size.  I.e. it messed up the effective address but made it work
      by adjusting the final address.
      
      When segmentation checks were added, the truncation logic was kept
      as-is and no sign extension logic was introduced.  In other words, it
      kept calculating the wrong effective address while mostly generating
      the correct virtual/linear address.  As the effective address is what's
      used in the segment limit checks, this results in KVM incorreclty
      injecting #GP/#SS faults due to non-existent segment violations when
      a nested VMM uses negative displacements with an address size smaller
      than KVM's native address size.
      
      Using the -0x28(%ebp) example, an EBP value of 0x1000 will result in
      KVM using 0x100000fd8 as the effective address when checking for a
      segment limit violation.  This causes a 100% failure rate when running
      a 32-bit KVM build as L1 on top of a 64-bit KVM L0.
      
      Fixes: f9eb4af6 ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9ce0ffeb
    • S
      KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux · c235af5a
      Sean Christopherson 提交于
      commit ddfd1730fd829743e41213e32ccc8b4aa6dc8325 upstream.
      
      When installing new memslots, KVM sets bit 0 of the generation number to
      indicate that an update is in-progress.  Until the update is complete,
      there are no guarantees as to whether a vCPU will see the old or the new
      memslots.  Explicity prevent caching MMIO accesses so as to avoid using
      an access cached from the old memslots after the new memslots have been
      installed.
      
      Note that it is unclear whether or not disabling caching during the
      update window is strictly necessary as there is no definitive
      documentation as to what ordering guarantees KVM provides with respect
      to updating memslots.  That being said, the MMIO spte code does not
      allow reusing sptes created while an update is in-progress, and the
      associated documentation explicitly states:
      
          We do not want to use an MMIO sptes created with an odd generation
          number, ...  If KVM is unlucky and creates an MMIO spte while the
          low bit is 1, the next access to the spte will always be a cache miss.
      
      At the very least, disabling the per-vCPU MMIO cache during updates will
      make its behavior consistent with the MMIO spte behavior and
      documentation.
      
      Fixes: 56f17dd3 ("kvm: x86: fix stale mmio cache bug")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c235af5a
    • S
      KVM: x86/mmu: Detect MMIO generation wrap in any address space · 656e9e5d
      Sean Christopherson 提交于
      commit e1359e2beb8b0a1188abc997273acbaedc8ee791 upstream.
      
      The check to detect a wrap of the MMIO generation explicitly looks for a
      generation number of zero.  Now that unique memslots generation numbers
      are assigned to each address space, only address space 0 will get a
      generation number of exactly zero when wrapping.  E.g. when address
      space 1 goes from 0x7fffe to 0x80002, the MMIO generation number will
      wrap to 0x2.  Adjust the MMIO generation to strip the address space
      modifier prior to checking for a wrap.
      
      Fixes: 4bd518f1 ("KVM: use separate generations for each address space")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      656e9e5d
    • S
      KVM: Call kvm_arch_memslots_updated() before updating memslots · 23ad135a
      Sean Christopherson 提交于
      commit 152482580a1b0accb60676063a1ac57b2d12daf6 upstream.
      
      kvm_arch_memslots_updated() is at this point in time an x86-specific
      hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
      the memslots generation number in its MMIO sptes in order to avoid
      full page fault walks for repeat faults on emulated MMIO addresses.
      Because only 19 bits are used, wrapping the MMIO generation number is
      possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
      the generation has changed so that it can invalidate all MMIO sptes in
      case the effective MMIO generation has wrapped so as to avoid using a
      stale spte, e.g. a (very) old spte that was created with generation==0.
      
      Given that the purpose of kvm_arch_memslots_updated() is to prevent
      consuming stale entries, it needs to be called before the new generation
      is propagated to memslots.  Invalidating the MMIO sptes after updating
      memslots means that there is a window where a vCPU could dereference
      the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
      spte that was created with (pre-wrap) generation==0.
      
      Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23ad135a
  4. 06 3月, 2019 2 次提交
  5. 27 2月, 2019 1 次提交
  6. 20 2月, 2019 3 次提交
  7. 13 2月, 2019 4 次提交
    • J
      cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM · 97a7fa90
      Josh Poimboeuf 提交于
      commit b284909abad48b07d3071a9fc9b5692b3e64914b upstream.
      
      With the following commit:
      
        73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      
      ... the hotplug code attempted to detect when SMT was disabled by BIOS,
      in which case it reported SMT as permanently disabled.  However, that
      code broke a virt hotplug scenario, where the guest is booted with only
      primary CPU threads, and a sibling is brought online later.
      
      The problem is that there doesn't seem to be a way to reliably
      distinguish between the HW "SMT disabled by BIOS" case and the virt
      "sibling not yet brought online" case.  So the above-mentioned commit
      was a bit misguided, as it permanently disabled SMT for both cases,
      preventing future virt sibling hotplugs.
      
      Going back and reviewing the original problems which were attempted to
      be solved by that commit, when SMT was disabled in BIOS:
      
        1) /sys/devices/system/cpu/smt/control showed "on" instead of
           "notsupported"; and
      
        2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.
      
      I'd propose that we instead consider #1 above to not actually be a
      problem.  Because, at least in the virt case, it's possible that SMT
      wasn't disabled by BIOS and a sibling thread could be brought online
      later.  So it makes sense to just always default the smt control to "on"
      to allow for that possibility (assuming cpuid indicates that the CPU
      supports SMT).
      
      The real problem is #2, which has a simple fix: change vmx_vm_init() to
      query the actual current SMT state -- i.e., whether any siblings are
      currently online -- instead of looking at the SMT "control" sysfs value.
      
      So fix it by:
      
        a) reverting the original "fix" and its followup fix:
      
           73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
           bc2d8d26 ("cpu/hotplug: Fix SMT supported evaluation")
      
           and
      
        b) changing vmx_vm_init() to query the actual current SMT state --
           instead of the sysfs control value -- to determine whether the L1TF
           warning is needed.  This also requires the 'sched_smt_present'
           variable to exported, instead of 'cpu_smt_control'.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      97a7fa90
    • P
      KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221) · 236fd677
      Peter Shier 提交于
      commit ecec76885bcfe3294685dc363fd1273df0d5d65f upstream.
      
      Bugzilla: 1671904
      
      There are multiple code paths where an hrtimer may have been started to
      emulate an L1 VMX preemption timer that can result in a call to free_nested
      without an intervening L2 exit where the hrtimer is normally
      cancelled. Unconditionally cancel in free_nested to cover all cases.
      
      Embargoed until Feb 7th 2019.
      Signed-off-by: NPeter Shier <pshier@google.com>
      Reported-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Reported-by: NFelix Wilhelm <fwilhelm@google.com>
      Cc: stable@kernel.org
      Message-Id: <20181011184646.154065-1-pshier@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      236fd677
    • P
      KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222) · 5a45d372
      Paolo Bonzini 提交于
      commit 353c0956a618a07ba4bbe7ad00ff29fe70e8412a upstream.
      
      Bugzilla: 1671930
      
      Emulation of certain instructions (VMXON, VMCLEAR, VMPTRLD, VMWRITE with
      memory operand, INVEPT, INVVPID) can incorrectly inject a page fault
      when passed an operand that points to an MMIO address.  The page fault
      will use uninitialized kernel stack memory as the CR2 and error code.
      
      The right behavior would be to abort the VM with a KVM_EXIT_INTERNAL_ERROR
      exit to userspace; however, it is not an easy fix, so for now just
      ensure that the error code and CR2 are zero.
      
      Embargoed until Feb 7th 2019.
      Reported-by: NFelix Wilhelm <fwilhelm@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5a45d372
    • V
      KVM: x86: svm: report MSR_IA32_MCG_EXT_CTL as unsupported · 84a4572b
      Vitaly Kuznetsov 提交于
      [ Upstream commit e87555e550cef4941579cd879759a7c0dee24e68 ]
      
      AMD doesn't seem to implement MSR_IA32_MCG_EXT_CTL and svm code in kvm
      knows nothing about it, however, this MSR is among emulated_msrs and
      thus returned with KVM_GET_MSR_INDEX_LIST. The consequent KVM_GET_MSRS,
      of course, fails.
      
      Report the MSR as unsupported to not confuse userspace.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      84a4572b
  8. 31 1月, 2019 4 次提交
  9. 10 1月, 2019 1 次提交
  10. 29 12月, 2018 3 次提交
    • C
      KVM: Fix UAF in nested posted interrupt processing · 1972ca04
      Cfir Cohen 提交于
      commit c2dd5146e9fe1f22c77c1b011adf84eea0245806 upstream.
      
      nested_get_vmcs12_pages() processes the posted_intr address in vmcs12. It
      caches the kmap()ed page object and pointer, however, it doesn't handle
      errors correctly: it's possible to cache a valid pointer, then release
      the page and later dereference the dangling pointer.
      
      I was able to reproduce with the following steps:
      
      1. Call vmlaunch with valid posted_intr_desc_addr but an invalid
      MSR_EFER. This causes nested_get_vmcs12_pages() to cache the kmap()ed
      pi_desc_page and pi_desc. Later the invalid EFER value fails
      check_vmentry_postreqs() which fails the first vmlaunch.
      
      2. Call vmlanuch with a valid EFER but an invalid posted_intr_desc_addr
      (I set it to 2G - 0x80). The second time we call nested_get_vmcs12_pages
      pi_desc_page is unmapped and released and pi_desc_page is set to NULL
      (the "shouldn't happen" clause). Due to the invalid
      posted_intr_desc_addr, kvm_vcpu_gpa_to_page() fails and
      nested_get_vmcs12_pages() returns. It doesn't return an error value so
      vmlaunch proceeds. Note that at this time we have a dangling pointer in
      vmx->nested.pi_desc and POSTED_INTR_DESC_ADDR in L0's vmcs.
      
      3. Issue an IPI in L2 guest code. This triggers a call to
      vmx_complete_nested_posted_interrupt() and pi_test_and_clear_on() which
      dereferences the dangling pointer.
      
      Vulnerable code requires nested and enable_apicv variables to be set to
      true. The host CPU must also support posted interrupts.
      
      Fixes: 5e2f30b7 "KVM: nVMX: get rid of nested_get_page()"
      Cc: stable@vger.kernel.org
      Reviewed-by: NAndy Honig <ahonig@google.com>
      Signed-off-by: NCfir Cohen <cfir@google.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1972ca04
    • E
      kvm: x86: Add AMD's EX_CFG to the list of ignored MSRs · 229468c6
      Eduardo Habkost 提交于
      commit 0e1b869fff60c81b510c2d00602d778f8f59dd9a upstream.
      
      Some guests OSes (including Windows 10) write to MSR 0xc001102c
      on some cases (possibly while trying to apply a CPU errata).
      Make KVM ignore reads and writes to that MSR, so the guest won't
      crash.
      
      The MSR is documented as "Execution Unit Configuration (EX_CFG)",
      at AMD's "BIOS and Kernel Developer's Guide (BKDG) for AMD Family
      15h Models 00h-0Fh Processors".
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      229468c6
    • W
      KVM: X86: Fix NULL deref in vcpu_scan_ioapic · 76281d12
      Wanpeng Li 提交于
      commit dcbd3e49c2f0b2c2d8a321507ff8f3de4af76d7c upstream.
      
      Reported by syzkaller:
      
          CPU: 1 PID: 5962 Comm: syz-executor118 Not tainted 4.20.0-rc6+ #374
          Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
          RIP: 0010:kvm_apic_hw_enabled arch/x86/kvm/lapic.h:169 [inline]
          RIP: 0010:vcpu_scan_ioapic arch/x86/kvm/x86.c:7449 [inline]
          RIP: 0010:vcpu_enter_guest arch/x86/kvm/x86.c:7602 [inline]
          RIP: 0010:vcpu_run arch/x86/kvm/x86.c:7874 [inline]
          RIP: 0010:kvm_arch_vcpu_ioctl_run+0x5296/0x7320 arch/x86/kvm/x86.c:8074
          Call Trace:
      	 kvm_vcpu_ioctl+0x5c8/0x1150 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2596
      	 vfs_ioctl fs/ioctl.c:46 [inline]
      	 file_ioctl fs/ioctl.c:509 [inline]
      	 do_vfs_ioctl+0x1de/0x1790 fs/ioctl.c:696
      	 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:713
      	 __do_sys_ioctl fs/ioctl.c:720 [inline]
      	 __se_sys_ioctl fs/ioctl.c:718 [inline]
      	 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
      	 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The reason is that the testcase writes hyperv synic HV_X64_MSR_SINT14 msr
      and triggers scan ioapic logic to load synic vectors into EOI exit bitmap.
      However, irqchip is not initialized by this simple testcase, ioapic/apic
      objects should not be accessed.
      
      This patch fixes it by also considering whether or not apic is present.
      
      Reported-by: syzbot+39810e6c400efadfef71@syzkaller.appspotmail.com
      Cc: stable@vger.kernel.org
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      76281d12
  11. 17 12月, 2018 3 次提交
    • Y
      x86/kvm/vmx: fix old-style function declaration · bf1b47f3
      Yi Wang 提交于
      [ Upstream commit 1e4329ee2c52692ea42cc677fb2133519718b34a ]
      
      The inline keyword which is not at the beginning of the function
      declaration may trigger the following build warnings, so let's fix it:
      
      arch/x86/kvm/vmx.c:1309:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
      arch/x86/kvm/vmx.c:5947:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
      arch/x86/kvm/vmx.c:5985:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
      arch/x86/kvm/vmx.c:6023:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
      Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bf1b47f3
    • Y
      KVM: x86: fix empty-body warnings · d6b1692d
      Yi Wang 提交于
      [ Upstream commit 354cb410d87314e2eda344feea84809e4261570a ]
      
      We get the following warnings about empty statements when building
      with 'W=1':
      
      arch/x86/kvm/lapic.c:632:53: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1907:42: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1936:65: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1975:44: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      
      Rework the debug helper macro to get rid of these warnings.
      Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d6b1692d
    • L
      KVM: VMX: Update shared MSRs to be saved/restored on MSR_EFER.LMA changes · 3c7670d5
      Liran Alon 提交于
      [ Upstream commit f48b4711dd6e1cf282f9dfd159c14a305909c97c ]
      
      When guest transitions from/to long-mode by modifying MSR_EFER.LMA,
      the list of shared MSRs to be saved/restored on guest<->host
      transitions is updated (See vmx_set_efer() call to setup_msrs()).
      
      On every entry to guest, vcpu_enter_guest() calls
      vmx_prepare_switch_to_guest(). This function should also take care
      of setting the shared MSRs to be saved/restored. However, the
      function does nothing in case we are already running with loaded
      guest state (vmx->loaded_cpu_state != NULL).
      
      This means that even when guest modifies MSR_EFER.LMA which results
      in updating the list of shared MSRs, it isn't being taken into account
      by vmx_prepare_switch_to_guest() because it happens while we are
      running with loaded guest state.
      
      To fix above mentioned issue, add a flag to mark that the list of
      shared MSRs has been updated and modify vmx_prepare_switch_to_guest()
      to set shared MSRs when running with host state *OR* list of shared
      MSRs has been updated.
      
      Note that this issue was mistakenly introduced by commit
      678e315e ("KVM: vmx: add dedicated utility to access guest's
      kernel_gs_base") because previously vmx_set_efer() always called
      vmx_load_host_state() which resulted in vmx_prepare_switch_to_guest() to
      set shared MSRs.
      
      Fixes: 678e315e ("KVM: vmx: add dedicated utility to access guest's kernel_gs_base")
      Reported-by: NEyal Moscovici <eyal.moscovici@oracle.com>
      Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3c7670d5
  12. 08 12月, 2018 1 次提交
  13. 06 12月, 2018 6 次提交
    • L
      KVM: VMX: re-add ple_gap module parameter · bbe23c4b
      Luiz Capitulino 提交于
      commit a87c99e61236ba8ca962ce97a19fab5ebd588d35 upstream.
      
      Apparently, the ple_gap parameter was accidentally removed
      by commit c8e88717. Add it
      back.
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: c8e88717Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bbe23c4b
    • W
      KVM: X86: Fix scan ioapic use-before-initialization · 61c42d65
      Wanpeng Li 提交于
      commit e97f852fd4561e77721bb9a4e0ea9d98305b1e93 upstream.
      
      Reported by syzkaller:
      
       BUG: unable to handle kernel NULL pointer dereference at 00000000000001c8
       PGD 80000003ec4da067 P4D 80000003ec4da067 PUD 3f7bfa067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP PTI
       CPU: 7 PID: 5059 Comm: debug Tainted: G           OE     4.19.0-rc5 #16
       RIP: 0010:__lock_acquire+0x1a6/0x1990
       Call Trace:
        lock_acquire+0xdb/0x210
        _raw_spin_lock+0x38/0x70
        kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
        vcpu_enter_guest+0x167e/0x1910 [kvm]
        kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm]
        kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm]
        do_vfs_ioctl+0xa5/0x690
        ksys_ioctl+0x6d/0x80
        __x64_sys_ioctl+0x1a/0x20
        do_syscall_64+0x83/0x6e0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The reason is that the testcase writes hyperv synic HV_X64_MSR_SINT6 msr
      and triggers scan ioapic logic to load synic vectors into EOI exit bitmap.
      However, irqchip is not initialized by this simple testcase, ioapic/apic
      objects should not be accessed.
      This can be triggered by the following program:
      
          #define _GNU_SOURCE
      
          #include <endian.h>
          #include <stdint.h>
          #include <stdio.h>
          #include <stdlib.h>
          #include <string.h>
          #include <sys/syscall.h>
          #include <sys/types.h>
          #include <unistd.h>
      
          uint64_t r[3] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff};
      
          int main(void)
          {
          	syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
          	long res = 0;
          	memcpy((void*)0x20000040, "/dev/kvm", 9);
          	res = syscall(__NR_openat, 0xffffffffffffff9c, 0x20000040, 0, 0);
          	if (res != -1)
          		r[0] = res;
          	res = syscall(__NR_ioctl, r[0], 0xae01, 0);
          	if (res != -1)
          		r[1] = res;
          	res = syscall(__NR_ioctl, r[1], 0xae41, 0);
          	if (res != -1)
          		r[2] = res;
          	memcpy(
          			(void*)0x20000080,
          			"\x01\x00\x00\x00\x00\x5b\x61\xbb\x96\x00\x00\x40\x00\x00\x00\x00\x01\x00"
          			"\x08\x00\x00\x00\x00\x00\x0b\x77\xd1\x78\x4d\xd8\x3a\xed\xb1\x5c\x2e\x43"
          			"\xaa\x43\x39\xd6\xff\xf5\xf0\xa8\x98\xf2\x3e\x37\x29\x89\xde\x88\xc6\x33"
          			"\xfc\x2a\xdb\xb7\xe1\x4c\xac\x28\x61\x7b\x9c\xa9\xbc\x0d\xa0\x63\xfe\xfe"
          			"\xe8\x75\xde\xdd\x19\x38\xdc\x34\xf5\xec\x05\xfd\xeb\x5d\xed\x2e\xaf\x22"
          			"\xfa\xab\xb7\xe4\x42\x67\xd0\xaf\x06\x1c\x6a\x35\x67\x10\x55\xcb",
          			106);
          	syscall(__NR_ioctl, r[2], 0x4008ae89, 0x20000080);
          	syscall(__NR_ioctl, r[2], 0xae80, 0);
          	return 0;
          }
      
      This patch fixes it by bailing out scan ioapic if ioapic is not initialized in
      kernel.
      Reported-by: NWei Wu <ww9210@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wei Wu <ww9210@gmail.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61c42d65
    • W
      KVM: LAPIC: Fix pv ipis use-before-initialization · ffb01e73
      Wanpeng Li 提交于
      commit 38ab012f109caf10f471db1adf284e620dd8d701 upstream.
      
      Reported by syzkaller:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
       PGD 800000040410c067 P4D 800000040410c067 PUD 40410d067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP PTI
       CPU: 3 PID: 2567 Comm: poc Tainted: G           OE     4.19.0-rc5 #16
       RIP: 0010:kvm_pv_send_ipi+0x94/0x350 [kvm]
       Call Trace:
        kvm_emulate_hypercall+0x3cc/0x700 [kvm]
        handle_vmcall+0xe/0x10 [kvm_intel]
        vmx_handle_exit+0xc1/0x11b0 [kvm_intel]
        vcpu_enter_guest+0x9fb/0x1910 [kvm]
        kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm]
        kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm]
        do_vfs_ioctl+0xa5/0x690
        ksys_ioctl+0x6d/0x80
        __x64_sys_ioctl+0x1a/0x20
        do_syscall_64+0x83/0x6e0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The reason is that the apic map has not yet been initialized, the testcase
      triggers pv_send_ipi interface by vmcall which results in kvm->arch.apic_map
      is dereferenced. This patch fixes it by checking whether or not apic map is
      NULL and bailing out immediately if that is the case.
      
      Fixes: 4180bf1b (KVM: X86: Implement "send IPI" hypercall)
      Reported-by: NWei Wu <ww9210@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wei Wu <ww9210@gmail.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ffb01e73
    • L
      KVM: x86: Fix kernel info-leak in KVM_HC_CLOCK_PAIRING hypercall · 6d772df4
      Liran Alon 提交于
      commit bcbfbd8ec21096027f1ee13ce6c185e8175166f6 upstream.
      
      kvm_pv_clock_pairing() allocates local var
      "struct kvm_clock_pairing clock_pairing" on stack and initializes
      all it's fields besides padding (clock_pairing.pad[]).
      
      Because clock_pairing var is written completely (including padding)
      to guest memory, failure to init struct padding results in kernel
      info-leak.
      
      Fix the issue by making sure to also init the padding with zeroes.
      
      Fixes: 55dd00a7 ("KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall")
      Reported-by: syzbot+a8ef68d71211ba264f56@syzkaller.appspotmail.com
      Reviewed-by: NMark Kanda <mark.kanda@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6d772df4
    • L
      KVM: nVMX/nSVM: Fix bug which sets vcpu->arch.tsc_offset to L1 tsc_offset · 76c8476c
      Leonid Shatz 提交于
      commit 326e742533bf0a23f0127d8ea62fb558ba665f08 upstream.
      
      Since commit e79f245d ("X86/KVM: Properly update 'tsc_offset' to
      represent the running guest"), vcpu->arch.tsc_offset meaning was
      changed to always reflect the tsc_offset value set on active VMCS.
      Regardless if vCPU is currently running L1 or L2.
      
      However, above mentioned commit failed to also change
      kvm_vcpu_write_tsc_offset() to set vcpu->arch.tsc_offset correctly.
      This is because vmx_write_tsc_offset() could set the tsc_offset value
      in active VMCS to given offset parameter *plus vmcs12->tsc_offset*.
      However, kvm_vcpu_write_tsc_offset() just sets vcpu->arch.tsc_offset
      to given offset parameter. Without taking into account the possible
      addition of vmcs12->tsc_offset. (Same is true for SVM case).
      
      Fix this issue by changing kvm_x86_ops->write_tsc_offset() to return
      actually set tsc_offset in active VMCS and modify
      kvm_vcpu_write_tsc_offset() to set returned value in
      vcpu->arch.tsc_offset.
      In addition, rename write_tsc_offset() callback to write_l1_tsc_offset()
      to make it clear that it is meant to set L1 TSC offset.
      
      Fixes: e79f245d ("X86/KVM: Properly update 'tsc_offset' to represent the running guest")
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NLeonid Shatz <leonid.shatz@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      76c8476c
    • J
      kvm: svm: Ensure an IBPB on all affected CPUs when freeing a vmcb · b8b0c871
      Jim Mattson 提交于
      commit fd65d3142f734bc4376053c8d75670041903134d upstream.
      
      Previously, we only called indirect_branch_prediction_barrier on the
      logical CPU that freed a vmcb. This function should be called on all
      logical CPUs that last loaded the vmcb in question.
      
      Fixes: 15d45071 ("KVM/x86: Add IBPB support")
      Reported-by: NNeel Natu <neelnatu@google.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b8b0c871