1. 24 6月, 2021 4 次提交
  2. 23 6月, 2021 1 次提交
  3. 11 6月, 2021 1 次提交
    • W
      KVM: X86: Fix x86_emulator slab cache leak · dfdc0a71
      Wanpeng Li 提交于
      Commit c9b8b07c (KVM: x86: Dynamically allocate per-vCPU emulation context)
      tries to allocate per-vCPU emulation context dynamically, however, the
      x86_emulator slab cache is still exiting after the kvm module is unload
      as below after destroying the VM and unloading the kvm module.
      
      grep x86_emulator /proc/slabinfo
      x86_emulator          36     36   2672   12    8 : tunables    0    0    0 : slabdata      3      3      0
      
      This patch fixes this slab cache leak by destroying the x86_emulator slab cache
      when the kvm module is unloaded.
      
      Fixes: c9b8b07c (KVM: x86: Dynamically allocate per-vCPU emulation context)
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1623387573-5969-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dfdc0a71
  4. 10 6月, 2021 1 次提交
    • S
      KVM: x86: Immediately reset the MMU context when the SMM flag is cleared · 78fcb2c9
      Sean Christopherson 提交于
      Immediately reset the MMU context when the vCPU's SMM flag is cleared so
      that the SMM flag in the MMU role is always synchronized with the vCPU's
      flag.  If RSM fails (which isn't correctly emulated), KVM will bail
      without calling post_leave_smm() and leave the MMU in a bad state.
      
      The bad MMU role can lead to a NULL pointer dereference when grabbing a
      shadow page's rmap for a page fault as the initial lookups for the gfn
      will happen with the vCPU's SMM flag (=0), whereas the rmap lookup will
      use the shadow page's SMM flag, which comes from the MMU (=1).  SMM has
      an entirely different set of memslots, and so the initial lookup can find
      a memslot (SMM=0) and then explode on the rmap memslot lookup (SMM=1).
      
        general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
        KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
        CPU: 1 PID: 8410 Comm: syz-executor382 Not tainted 5.13.0-rc5-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:__gfn_to_rmap arch/x86/kvm/mmu/mmu.c:935 [inline]
        RIP: 0010:gfn_to_rmap+0x2b0/0x4d0 arch/x86/kvm/mmu/mmu.c:947
        Code: <42> 80 3c 20 00 74 08 4c 89 ff e8 f1 79 a9 00 4c 89 fb 4d 8b 37 44
        RSP: 0018:ffffc90000ffef98 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff888015b9f414 RCX: ffff888019669c40
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
        RBP: 0000000000000001 R08: ffffffff811d9cdb R09: ffffed10065a6002
        R10: ffffed10065a6002 R11: 0000000000000000 R12: dffffc0000000000
        R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000000
        FS:  000000000124b300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000028e31000 CR4: 00000000001526e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         rmap_add arch/x86/kvm/mmu/mmu.c:965 [inline]
         mmu_set_spte+0x862/0xe60 arch/x86/kvm/mmu/mmu.c:2604
         __direct_map arch/x86/kvm/mmu/mmu.c:2862 [inline]
         direct_page_fault+0x1f74/0x2b70 arch/x86/kvm/mmu/mmu.c:3769
         kvm_mmu_do_page_fault arch/x86/kvm/mmu.h:124 [inline]
         kvm_mmu_page_fault+0x199/0x1440 arch/x86/kvm/mmu/mmu.c:5065
         vmx_handle_exit+0x26/0x160 arch/x86/kvm/vmx/vmx.c:6122
         vcpu_enter_guest+0x3bdd/0x9630 arch/x86/kvm/x86.c:9428
         vcpu_run+0x416/0xc20 arch/x86/kvm/x86.c:9494
         kvm_arch_vcpu_ioctl_run+0x4e8/0xa40 arch/x86/kvm/x86.c:9722
         kvm_vcpu_ioctl+0x70f/0xbb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3460
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:1069 [inline]
         __se_sys_ioctl+0xfb/0x170 fs/ioctl.c:1055
         do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x440ce9
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+fb0b6a7e8713aeb0319c@syzkaller.appspotmail.com
      Fixes: 9ec19493 ("KVM: x86: clear SMM flags before loading state while leaving SMM")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609185619.992058-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      78fcb2c9
  5. 09 6月, 2021 2 次提交
    • L
      KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync · b53e84ee
      Lai Jiangshan 提交于
      When using shadow paging, unload the guest MMU when emulating a guest TLB
      flush to ensure all roots are synchronized.  From the guest's perspective,
      flushing the TLB ensures any and all modifications to its PTEs will be
      recognized by the CPU.
      
      Note, unloading the MMU is overkill, but is done to mirror KVM's existing
      handling of INVPCID(all) and ensure the bug is squashed.  Future cleanup
      can be done to more precisely synchronize roots when servicing a guest
      TLB flush.
      
      If TDP is enabled, synchronizing the MMU is unnecessary even if nested
      TDP is in play, as a "legacy" TLB flush from L1 does not invalidate L1's
      TDP mappings.  For EPT, an explicit INVEPT is required to invalidate
      guest-physical mappings; for NPT, guest mappings are always tagged with
      an ASID and thus can only be invalidated via the VMCB's ASID control.
      
      This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB.
      It was only recently exposed after Linux guests stopped flushing the
      local CPU's TLB prior to flushing remote TLBs (see commit 4ce94eab,
      "x86/mm/tlb: Flush remote and local TLBs concurrently"), but is also
      visible in Windows 10 guests.
      Tested-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Fixes: f38a7b75 ("KVM: X86: support paravirtualized help for TLB shootdowns")
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      [sean: massaged comment and changelog]
      Message-Id: <20210531172256.2908-1-jiangshanlai@gmail.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b53e84ee
    • L
      KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior · af3511ff
      Lai Jiangshan 提交于
      In record_steal_time(), st->preempted is read twice, and
      trace_kvm_pv_tlb_flush() might output result inconsistent if
      kvm_vcpu_flush_tlb_guest() see a different st->preempted later.
      
      It is a very trivial problem and hardly has actual harm and can be
      avoided by reseting and reading st->preempted in atomic way via xchg().
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      
      Message-Id: <20210531174628.10265-1-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      af3511ff
  6. 29 5月, 2021 3 次提交
    • W
      KVM: X86: Kill off ctxt->ud · b35491e6
      Wanpeng Li 提交于
      ctxt->ud is consumed only by x86_decode_insn(), we can kill it off by
      passing emulation_type to x86_decode_insn() and dropping ctxt->ud
      altogether. Tracking that info in ctxt for literally one call is silly.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <1622160097-37633-2-git-send-email-wanpengli@tencent.com>
      b35491e6
    • W
      KVM: X86: Fix warning caused by stale emulation context · da6393cd
      Wanpeng Li 提交于
      Reported by syzkaller:
      
        WARNING: CPU: 7 PID: 10526 at linux/arch/x86/kvm//x86.c:7621 x86_emulate_instruction+0x41b/0x510 [kvm]
        RIP: 0010:x86_emulate_instruction+0x41b/0x510 [kvm]
        Call Trace:
         kvm_mmu_page_fault+0x126/0x8f0 [kvm]
         vmx_handle_exit+0x11e/0x680 [kvm_intel]
         vcpu_enter_guest+0xd95/0x1b40 [kvm]
         kvm_arch_vcpu_ioctl_run+0x377/0x6a0 [kvm]
         kvm_vcpu_ioctl+0x389/0x630 [kvm]
         __x64_sys_ioctl+0x8e/0xd0
         do_syscall_64+0x3c/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Commit 4a1e10d5 ("KVM: x86: handle hardware breakpoints during emulation())
      adds hardware breakpoints check before emulation the instruction and parts of
      emulation context initialization, actually we don't have the EMULTYPE_NO_DECODE flag
      here and the emulation context will not be reused. Commit c8848cee ("KVM: x86:
      set ctxt->have_exception in x86_decode_insn()) triggers the warning because it
      catches the stale emulation context has #UD, however, it is not during instruction
      decoding which should result in EMULATION_FAILED. This patch fixes it by moving
      the second part emulation context initialization into init_emulate_ctxt() and
      before hardware breakpoints check. The ctxt->ud will be dropped by a follow-up
      patch.
      
      syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=134683fdd00000
      
      Reported-by: syzbot+71271244f206d17f6441@syzkaller.appspotmail.com
      Fixes: 4a1e10d5 (KVM: x86: handle hardware breakpoints during emulation)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <1622160097-37633-1-git-send-email-wanpengli@tencent.com>
      da6393cd
    • Y
      KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception · e87e46d5
      Yuan Yao 提交于
      The kvm_get_linear_rip() handles x86/long mode cases well and has
      better readability, __kvm_set_rflags() also use the paired
      function kvm_is_linear_rip() to check the vcpu->arch.singlestep_rip
      set in kvm_arch_vcpu_ioctl_set_guest_debug(), so change the
      "CS.BASE + RIP" code in kvm_arch_vcpu_ioctl_set_guest_debug() and
      handle_exception_nmi() to this one.
      Signed-off-by: NYuan Yao <yuan.yao@intel.com>
      Message-Id: <20210526063828.1173-1-yuan.yao@linux.intel.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e87e46d5
  7. 27 5月, 2021 4 次提交
  8. 10 5月, 2021 1 次提交
  9. 07 5月, 2021 10 次提交
  10. 06 5月, 2021 1 次提交
  11. 26 4月, 2021 5 次提交
  12. 24 4月, 2021 1 次提交
  13. 22 4月, 2021 2 次提交
    • W
      KVM: Boost vCPU candidate in user mode which is delivering interrupt · 52acd22f
      Wanpeng Li 提交于
      Both lock holder vCPU and IPI receiver that has halted are condidate for
      boost. However, the PLE handler was originally designed to deal with the
      lock holder preemption problem. The Intel PLE occurs when the spinlock
      waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
      they can be in either kernel or user mode. the vCPU candidate in user mode
      will not be boosted even if they should respond to IPIs. Some benchmarks
      like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
      of the time they are running in user mode. It can lead to a large number
      of continuous PLE events because the IPI sender causes PLE events
      repeatedly until the receiver is scheduled while the receiver is not
      candidate for a boost.
      
      This patch boosts the vCPU candidiate in user mode which is delivery
      interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
      VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
      96 HTs Intel CLX box). There is no performance regression for other
      benchmarks like Unixbench spawn (most of the time contend read/write
      lock in kernel mode), ebizzy (most of the time contend read/write sem
      and TLB shoodtdown in kernel mode).
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1618542490-14756-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      52acd22f
    • N
      KVM: x86: Support KVM VMs sharing SEV context · 54526d1f
      Nathan Tempelman 提交于
      Add a capability for userspace to mirror SEV encryption context from
      one vm to another. On our side, this is intended to support a
      Migration Helper vCPU, but it can also be used generically to support
      other in-guest workloads scheduled by the host. The intention is for
      the primary guest and the mirror to have nearly identical memslots.
      
      The primary benefits of this are that:
      1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
      can't accidentally clobber each other.
      2) The VMs can have different memory-views, which is necessary for post-copy
      migration (the migration vCPUs on the target need to read and write to
      pages, when the primary guest would VMEXIT).
      
      This does not change the threat model for AMD SEV. Any memory involved
      is still owned by the primary guest and its initial state is still
      attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
      to circumvent SEV, they could achieve the same effect by simply attaching
      a vCPU to the primary VM.
      This patch deliberately leaves userspace in charge of the memslots for the
      mirror, as it already has the power to mess with them in the primary guest.
      
      This patch does not support SEV-ES (much less SNP), as it does not
      handle handing off attested VMSAs to the mirror.
      
      For additional context, we need a Migration Helper because SEV PSP
      migration is far too slow for our live migration on its own. Using
      an in-guest migrator lets us speed this up significantly.
      Signed-off-by: NNathan Tempelman <natet@google.com>
      Message-Id: <20210408223214.2582277-1-natet@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54526d1f
  14. 20 4月, 2021 4 次提交