1. 05 8月, 2019 1 次提交
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li 提交于
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
  2. 22 7月, 2019 2 次提交
    • W
      KVM: X86: Dynamically allocate user_fpu · d9a710e5
      Wanpeng Li 提交于
      After reverting commit 240c35a3 (kvm: x86: Use task structs fpu field
      for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
      is the order at which allocations are deemed costly to service. In serveless
      scenario, one host can service hundreds/thoudands firecracker/kata-container
      instances, howerver, new instance will fail to launch after memory is too
      fragmented to allocate kvm_vcpu struct on host, this was observed in some
      cloud provider product environments.
      
      This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
      Skylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d9a710e5
    • P
      Revert "kvm: x86: Use task structs fpu field for user" · ec269475
      Paolo Bonzini 提交于
      This reverts commit 240c35a3
      ("kvm: x86: Use task structs fpu field for user", 2018-11-06).
      The commit is broken and causes QEMU's FPU state to be destroyed
      when KVM_RUN is preempted.
      
      Fixes: 240c35a3 ("kvm: x86: Use task structs fpu field for user")
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ec269475
  3. 11 7月, 2019 1 次提交
    • E
      KVM: x86: PMU Event Filter · 66bb8a06
      Eric Hankland 提交于
      Some events can provide a guest with information about other guests or the
      host (e.g. L3 cache stats); providing the capability to restrict access
      to a "safe" set of events would limit the potential for the PMU to be used
      in any side channel attacks. This change introduces a new VM ioctl that
      sets an event filter. If the guest attempts to program a counter for
      any blacklisted or non-whitelisted event, the kernel counter won't be
      created, so any RDPMC/RDMSR will show 0 instances of that event.
      Signed-off-by: NEric Hankland <ehankland@google.com>
      [Lots of changes. All remaining bugs are probably mine. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      66bb8a06
  4. 19 6月, 2019 1 次提交
  5. 18 6月, 2019 3 次提交
  6. 05 6月, 2019 3 次提交
  7. 01 5月, 2019 4 次提交
  8. 16 4月, 2019 2 次提交
    • S
      KVM: x86: Load SMRAM in a single shot when leaving SMM · ed19321f
      Sean Christopherson 提交于
      RSM emulation is currently broken on VMX when the interrupted guest has
      CR4.VMXE=1.  Rather than dance around the issue of HF_SMM_MASK being set
      when loading SMSTATE into architectural state, ideally RSM emulation
      itself would be reworked to clear HF_SMM_MASK prior to loading non-SMM
      architectural state.
      
      Ostensibly, the only motivation for having HF_SMM_MASK set throughout
      the loading of state from the SMRAM save state area is so that the
      memory accesses from GET_SMSTATE() are tagged with role.smm.  Load
      all of the SMRAM save state area from guest memory at the beginning of
      RSM emulation, and load state from the buffer instead of reading guest
      memory one-by-one.
      
      This paves the way for clearing HF_SMM_MASK prior to loading state,
      and also aligns RSM with the enter_smm() behavior, which fills a
      buffer and writes SMRAM save state in a single go.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed19321f
    • B
      kvm: mmu: Fix overflow on kvm mmu page limit calculation · bc8a3d89
      Ben Gardon 提交于
      KVM bases its memory usage limits on the total number of guest pages
      across all memslots. However, those limits, and the calculations to
      produce them, use 32 bit unsigned integers. This can result in overflow
      if a VM has more guest pages that can be represented by a u32. As a
      result of this overflow, KVM can use a low limit on the number of MMU
      pages it will allocate. This makes KVM unable to map all of guest memory
      at once, prompting spurious faults.
      
      Tested: Ran all kvm-unit-tests on an Intel Haswell machine. This patch
      	introduced no new failures.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bc8a3d89
  9. 29 3月, 2019 5 次提交
    • S
      KVM: x86: update %rip after emulating IO · 45def77e
      Sean Christopherson 提交于
      Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
      OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
      vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
      that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
      
      To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
      Skip pio instruction when it is emulated, not executed") changed the
      behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
      instruction prior to exiting to userspace.  Full emulation doesn't need
      such tricks becase re-emulating the instruction will naturally handle
      %rip being changed to point at the reset vector.
      
      Updating %rip prior to executing to userspace has several drawbacks:
      
        - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
          fails it will likely yell about the wrong address.
        - Single step exits to userspace for are effectively dropped as
          KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
        - Behavior of PIO emulation is different depending on whether it
          goes down the fast path or the slow path.
      
      Rather than skip the PIO instruction before exiting to userspace,
      snapshot the linear %rip and cancel PIO completion if the current
      value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
      common scenario, the snapshot and comparison has negligible overhead
      as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
      VMREAD in this case.
      
      All other alternatives to snapshotting the linear %rip that don't
      rely on an explicit reset announcenment suffer from one corner case
      or another.  For example, canceling PIO completion on any write to
      %rip fails if userspace does a save/restore of %rip, and attempting to
      avoid that issue by canceling PIO only if %rip changed then fails if PIO
      collides with the reset %rip.  Attempting to zero in on the exact reset
      vector won't work for APs, which means adding more hooks such as the
      vCPU's MP_STATE, and so on and so forth.
      
      Checking for a linear %rip match technically suffers from corner cases,
      e.g. userspace could theoretically rewrite the underlying code page and
      expect a different instruction to execute, or the guest hardcodes a PIO
      reset at 0xfffffff0, but those are far, far outside of what can be
      considered normal operation.
      
      Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
      Cc: <stable@vger.kernel.org>
      Reported-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      45def77e
    • S
      KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts · 0cf9135b
      Sean Christopherson 提交于
      The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
      userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
      regardless of hardware support under the pretense that KVM fully
      emulates MSR_IA32_ARCH_CAPABILITIES.  Unfortunately, only VMX hosts
      handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
      also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).
      
      Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
      that it's emulated on AMD hosts.
      
      Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
      Cc: stable@vger.kernel.org
      Reported-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0cf9135b
    • W
      KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region() · 4d66623c
      Wei Yang 提交于
      * nr_mmu_pages would be non-zero only if kvm->arch.n_requested_mmu_pages is
        non-zero.
      
      * nr_mmu_pages is always non-zero, since kvm_mmu_calculate_mmu_pages()
        never return zero.
      
      Based on these two reasons, we can merge the two *if* clause and use the
      return value from kvm_mmu_calculate_mmu_pages() directly. This simplify
      the code and also eliminate the possibility for reader to believe
      nr_mmu_pages would be zero.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d66623c
    • S
      KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation) · 05d5a486
      Singh, Brijesh 提交于
      Errata#1096:
      
      On a nested data page fault when CR.SMAP=1 and the guest data read
      generates a SMAP violation, GuestInstrBytes field of the VMCB on a
      VMEXIT will incorrectly return 0h instead the correct guest
      instruction bytes .
      
      Recommend Workaround:
      
      To determine what instruction the guest was executing the hypervisor
      will have to decode the instruction at the instruction pointer.
      
      The recommended workaround can not be implemented for the SEV
      guest because guest memory is encrypted with the guest specific key,
      and instruction decoder will not be able to decode the instruction
      bytes. If we hit this errata in the SEV guest then log the message
      and request a guest shutdown.
      Reported-by: NVenkatesh Srinivas <venkateshs@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      05d5a486
    • S
      KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size' · 47c42e6b
      Sean Christopherson 提交于
      The cr4_pae flag is a bit of a misnomer, its purpose is really to track
      whether the guest PTE that is being shadowed is a 4-byte entry or an
      8-byte entry.  Prior to supporting nested EPT, the size of the gpte was
      reflected purely by CR4.PAE.  KVM fudged things a bit for direct sptes,
      but it was mostly harmless since the size of the gpte never mattered.
      Now that a spte may be tracking an indirect EPT entry, relying on
      CR4.PAE is wrong and ill-named.
      
      For direct shadow pages, force the gpte_size to '1' as they are always
      8-byte entries; EPT entries can only be 8-bytes and KVM always uses
      8-byte entries for NPT and its identity map (when running with EPT but
      not unrestricted guest).
      
      Likewise, nested EPT entries are always 8-bytes.  Nested EPT presents a
      unique scenario as the size of the entries are not dictated by CR4.PAE,
      but neither is the shadow page a direct map.  To handle this scenario,
      set cr0_wp=1 and smap_andnot_wp=1, an otherwise impossible combination,
      to denote a nested EPT shadow page.  Use the information to avoid
      incorrectly zapping an unsync'd indirect page in __kvm_sync_page().
      
      Providing a consistent and accurate gpte_size fixes a bug reported by
      Vitaly where fast_cr3_switch() always fails when switching from L2 to
      L1 as kvm_mmu_get_page() would force role.cr4_pae=0 for direct pages,
      whereas kvm_calc_mmu_role_common() would set it according to CR4.PAE.
      
      Fixes: 7dcd5755 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
      Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      47c42e6b
  10. 23 2月, 2019 2 次提交
    • Y
      KVM: MMU: record maximum physical address width in kvm_mmu_extended_role · de3ccd26
      Yu Zhang 提交于
      Previously, commit 7dcd5755 ("x86/kvm/mmu: check if tdp/shadow
      MMU reconfiguration is needed") offered some optimization to avoid
      the unnecessary reconfiguration. Yet one scenario is broken - when
      cpuid changes VM's maximum physical address width, reconfiguration
      is needed to reset the reserved bits.  Also, the TDP may need to
      reset its shadow_root_level when this value is changed.
      
      To fix this, a new field, maxphyaddr, is introduced in the extended
      role structure to keep track of the configured guest physical address
      width.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      de3ccd26
    • V
      x86/kvm/mmu: fix switch between root and guest MMUs · ad7dc69a
      Vitaly Kuznetsov 提交于
      Commit 14c07ad8 ("x86/kvm/mmu: introduce guest_mmu") brought one subtle
      change: previously, when switching back from L2 to L1, we were resetting
      MMU hooks (like mmu->get_cr3()) in kvm_init_mmu() called from
      nested_vmx_load_cr3() and now we do that in nested_ept_uninit_mmu_context()
      when we re-target vcpu->arch.mmu pointer.
      The change itself looks logical: if nested_ept_init_mmu_context() changes
      something than nested_ept_uninit_mmu_context() restores it back. There is,
      however, one thing: the following call chain:
      
       nested_vmx_load_cr3()
        kvm_mmu_new_cr3()
          __kvm_mmu_new_cr3()
            fast_cr3_switch()
              cached_root_available()
      
      now happens with MMU hooks pointing to the new MMU (root MMU in our case)
      while previously it was happening with the old one. cached_root_available()
      tries to stash current root but it is incorrect to read current CR3 with
      mmu->get_cr3(), we need to use old_mmu->get_cr3() which in case we're
      switching from L2 to L1 is guest_mmu. (BTW, in shadow page tables case this
      is a non-issue because we don't switch MMU).
      
      While we could've tried to guess that we're switching between MMUs and call
      the right ->get_cr3() from cached_root_available() this seems to be overly
      complicated. Instead, just stash the corresponding CR3 when setting
      root_hpa and make cached_root_available() use the stashed value.
      
      Fixes: 14c07ad8 ("x86/kvm/mmu: introduce guest_mmu")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad7dc69a
  11. 21 2月, 2019 6 次提交
  12. 21 12月, 2018 4 次提交
    • S
      KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup · e8143499
      Sean Christopherson 提交于
      ____kvm_handle_fault_on_reboot() provides a generic exception fixup
      handler that is used to cleanly handle faults on VMX/SVM instructions
      during reboot (or at least try to).  If there isn't a reboot in
      progress, ____kvm_handle_fault_on_reboot() treats any exception as
      fatal to KVM and invokes kvm_spurious_fault(), which in turn generates
      a BUG() to get a stack trace and die.
      
      When it was originally added by commit 4ecac3fd ("KVM: Handle
      virtualization instruction #UD faults during reboot"), the "call" to
      kvm_spurious_fault() was handcoded as PUSH+JMP, where the PUSH'd value
      is the RIP of the faulting instructing.
      
      The PUSH+JMP trickery is necessary because the exception fixup handler
      code lies outside of its associated function, e.g. right after the
      function.  An actual CALL from the .fixup code would show a slightly
      bogus stack trace, e.g. an extra "random" function would be inserted
      into the trace, as the return RIP on the stack would point to no known
      function (and the unwinder will likely try to guess who owns the RIP).
      
      Unfortunately, the JMP was replaced with a CALL when the macro was
      reworked to not spin indefinitely during reboot (commit b7c4145b
      "KVM: Don't spin on virt instruction faults during reboot").  This
      causes the aforementioned behavior where a bogus function is inserted
      into the stack trace, e.g. my builds like to blame free_kvm_area().
      
      Revert the CALL back to a JMP.  The changelog for commit b7c4145b
      ("KVM: Don't spin on virt instruction faults during reboot") contains
      nothing that indicates the switch to CALL was deliberate.  This is
      backed up by the fact that the PUSH <insn RIP> was left intact.
      
      Note that an alternative to the PUSH+JMP magic would be to JMP back
      to the "real" code and CALL from there, but that would require adding
      a JMP in the non-faulting path to avoid calling kvm_spurious_fault()
      and would add no value, i.e. the stack trace would be the same.
      
      Using CALL:
      
      ------------[ cut here ]------------
      kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
      invalid opcode: 0000 [#1] SMP
      CPU: 4 PID: 1057 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
      Code: <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
      RSP: 0018:ffffc900004bbcc8 EFLAGS: 00010046
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff888273fd8000 R08: 00000000000003e8 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000371fb0
      R13: 0000000000000000 R14: 000000026d763cf4 R15: ffff888273fd8000
      FS:  00007f3d69691700(0000) GS:ffff888277800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055f89bc56fe0 CR3: 0000000271a5a001 CR4: 0000000000362ee0
      Call Trace:
       free_kvm_area+0x1044/0x43ea [kvm_intel]
       ? vmx_vcpu_run+0x156/0x630 [kvm_intel]
       ? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? __set_task_blocked+0x38/0x90
       ? __set_current_blocked+0x50/0x60
       ? __fpu__restore_sig+0x97/0x490
       ? do_vfs_ioctl+0xa1/0x620
       ? __x64_sys_futex+0x89/0x180
       ? ksys_ioctl+0x66/0x70
       ? __x64_sys_ioctl+0x16/0x20
       ? do_syscall_64+0x4f/0x100
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
      ---[ end trace 9775b14b123b1713 ]---
      
      Using JMP:
      
      ------------[ cut here ]------------
      kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
      invalid opcode: 0000 [#1] SMP
      CPU: 6 PID: 1067 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
      Code: <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
      RSP: 0018:ffffc90000497cd0 EFLAGS: 00010046
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff88827058bd40 R08: 00000000000003e8 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000369fb0
      R13: 0000000000000000 R14: 00000003c8fc6642 R15: ffff88827058bd40
      FS:  00007f3d7219e700(0000) GS:ffff888277900000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f3d64001000 CR3: 0000000271c6b004 CR4: 0000000000362ee0
      Call Trace:
       vmx_vcpu_run+0x156/0x630 [kvm_intel]
       ? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? __set_task_blocked+0x38/0x90
       ? __set_current_blocked+0x50/0x60
       ? __fpu__restore_sig+0x97/0x490
       ? do_vfs_ioctl+0xa1/0x620
       ? __x64_sys_futex+0x89/0x180
       ? ksys_ioctl+0x66/0x70
       ? __x64_sys_ioctl+0x16/0x20
       ? do_syscall_64+0x4f/0x100
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
      ---[ end trace f9daedb85ab3ddba ]---
      
      Fixes: b7c4145b ("KVM: Don't spin on virt instruction faults during reboot")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e8143499
    • L
      KVM: Make kvm_set_spte_hva() return int · 748c0e31
      Lan Tianyu 提交于
      The patch is to make kvm_set_spte_hva() return int and caller can
      check return value to determine flush tlb or not.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      748c0e31
    • L
      KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops · a49b9635
      Lan Tianyu 提交于
      Add flush range call back in the kvm_x86_ops and platform can use it
      to register its associated function. The parameter "kvm_tlb_range"
      accepts a single range and flush list which contains a list of ranges.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a49b9635
    • C
      KVM: x86: Add Intel Processor Trace cpuid emulation · 86f5201d
      Chao Peng 提交于
      Expose Intel Processor Trace to guest only when
      the PT works in Host-Guest mode.
      Signed-off-by: NChao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: NLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      86f5201d
  13. 15 12月, 2018 4 次提交
  14. 27 11月, 2018 1 次提交
    • L
      KVM: nVMX/nSVM: Fix bug which sets vcpu->arch.tsc_offset to L1 tsc_offset · 326e7425
      Leonid Shatz 提交于
      Since commit e79f245d ("X86/KVM: Properly update 'tsc_offset' to
      represent the running guest"), vcpu->arch.tsc_offset meaning was
      changed to always reflect the tsc_offset value set on active VMCS.
      Regardless if vCPU is currently running L1 or L2.
      
      However, above mentioned commit failed to also change
      kvm_vcpu_write_tsc_offset() to set vcpu->arch.tsc_offset correctly.
      This is because vmx_write_tsc_offset() could set the tsc_offset value
      in active VMCS to given offset parameter *plus vmcs12->tsc_offset*.
      However, kvm_vcpu_write_tsc_offset() just sets vcpu->arch.tsc_offset
      to given offset parameter. Without taking into account the possible
      addition of vmcs12->tsc_offset. (Same is true for SVM case).
      
      Fix this issue by changing kvm_x86_ops->write_tsc_offset() to return
      actually set tsc_offset in active VMCS and modify
      kvm_vcpu_write_tsc_offset() to set returned value in
      vcpu->arch.tsc_offset.
      In addition, rename write_tsc_offset() callback to write_l1_tsc_offset()
      to make it clear that it is meant to set L1 TSC offset.
      
      Fixes: e79f245d ("X86/KVM: Properly update 'tsc_offset' to represent the running guest")
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NLeonid Shatz <leonid.shatz@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      326e7425
  15. 18 10月, 2018 1 次提交
    • J
      kvm: x86: Add exception payload fields to kvm_vcpu_events · 59073aaf
      Jim Mattson 提交于
      The per-VM capability KVM_CAP_EXCEPTION_PAYLOAD (to be introduced in a
      later commit) adds the following fields to struct kvm_vcpu_events:
      exception_has_payload, exception_payload, and exception.pending.
      
      With this capability set, all of the details of vcpu->arch.exception,
      including the payload for a pending exception, are reported to
      userspace in response to KVM_GET_VCPU_EVENTS.
      
      With this capability clear, the original ABI is preserved, and the
      exception.injected field is set for either pending or injected
      exceptions.
      
      When userspace calls KVM_SET_VCPU_EVENTS with
      KVM_CAP_EXCEPTION_PAYLOAD clear, exception.injected is no longer
      translated to exception.pending. KVM_SET_VCPU_EVENTS can now only
      establish a pending exception when KVM_CAP_EXCEPTION_PAYLOAD is set.
      Reported-by: NJim Mattson <jmattson@google.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      59073aaf