1. 18 6月, 2021 25 次提交
  2. 11 6月, 2021 1 次提交
    • W
      KVM: X86: Fix x86_emulator slab cache leak · dfdc0a71
      Wanpeng Li 提交于
      Commit c9b8b07c (KVM: x86: Dynamically allocate per-vCPU emulation context)
      tries to allocate per-vCPU emulation context dynamically, however, the
      x86_emulator slab cache is still exiting after the kvm module is unload
      as below after destroying the VM and unloading the kvm module.
      
      grep x86_emulator /proc/slabinfo
      x86_emulator          36     36   2672   12    8 : tunables    0    0    0 : slabdata      3      3      0
      
      This patch fixes this slab cache leak by destroying the x86_emulator slab cache
      when the kvm module is unloaded.
      
      Fixes: c9b8b07c (KVM: x86: Dynamically allocate per-vCPU emulation context)
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1623387573-5969-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dfdc0a71
  3. 10 6月, 2021 1 次提交
    • S
      KVM: x86: Immediately reset the MMU context when the SMM flag is cleared · 78fcb2c9
      Sean Christopherson 提交于
      Immediately reset the MMU context when the vCPU's SMM flag is cleared so
      that the SMM flag in the MMU role is always synchronized with the vCPU's
      flag.  If RSM fails (which isn't correctly emulated), KVM will bail
      without calling post_leave_smm() and leave the MMU in a bad state.
      
      The bad MMU role can lead to a NULL pointer dereference when grabbing a
      shadow page's rmap for a page fault as the initial lookups for the gfn
      will happen with the vCPU's SMM flag (=0), whereas the rmap lookup will
      use the shadow page's SMM flag, which comes from the MMU (=1).  SMM has
      an entirely different set of memslots, and so the initial lookup can find
      a memslot (SMM=0) and then explode on the rmap memslot lookup (SMM=1).
      
        general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
        KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
        CPU: 1 PID: 8410 Comm: syz-executor382 Not tainted 5.13.0-rc5-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:__gfn_to_rmap arch/x86/kvm/mmu/mmu.c:935 [inline]
        RIP: 0010:gfn_to_rmap+0x2b0/0x4d0 arch/x86/kvm/mmu/mmu.c:947
        Code: <42> 80 3c 20 00 74 08 4c 89 ff e8 f1 79 a9 00 4c 89 fb 4d 8b 37 44
        RSP: 0018:ffffc90000ffef98 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff888015b9f414 RCX: ffff888019669c40
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
        RBP: 0000000000000001 R08: ffffffff811d9cdb R09: ffffed10065a6002
        R10: ffffed10065a6002 R11: 0000000000000000 R12: dffffc0000000000
        R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000000
        FS:  000000000124b300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000028e31000 CR4: 00000000001526e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         rmap_add arch/x86/kvm/mmu/mmu.c:965 [inline]
         mmu_set_spte+0x862/0xe60 arch/x86/kvm/mmu/mmu.c:2604
         __direct_map arch/x86/kvm/mmu/mmu.c:2862 [inline]
         direct_page_fault+0x1f74/0x2b70 arch/x86/kvm/mmu/mmu.c:3769
         kvm_mmu_do_page_fault arch/x86/kvm/mmu.h:124 [inline]
         kvm_mmu_page_fault+0x199/0x1440 arch/x86/kvm/mmu/mmu.c:5065
         vmx_handle_exit+0x26/0x160 arch/x86/kvm/vmx/vmx.c:6122
         vcpu_enter_guest+0x3bdd/0x9630 arch/x86/kvm/x86.c:9428
         vcpu_run+0x416/0xc20 arch/x86/kvm/x86.c:9494
         kvm_arch_vcpu_ioctl_run+0x4e8/0xa40 arch/x86/kvm/x86.c:9722
         kvm_vcpu_ioctl+0x70f/0xbb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3460
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:1069 [inline]
         __se_sys_ioctl+0xfb/0x170 fs/ioctl.c:1055
         do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x440ce9
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+fb0b6a7e8713aeb0319c@syzkaller.appspotmail.com
      Fixes: 9ec19493 ("KVM: x86: clear SMM flags before loading state while leaving SMM")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609185619.992058-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      78fcb2c9
  4. 09 6月, 2021 2 次提交
    • L
      KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync · b53e84ee
      Lai Jiangshan 提交于
      When using shadow paging, unload the guest MMU when emulating a guest TLB
      flush to ensure all roots are synchronized.  From the guest's perspective,
      flushing the TLB ensures any and all modifications to its PTEs will be
      recognized by the CPU.
      
      Note, unloading the MMU is overkill, but is done to mirror KVM's existing
      handling of INVPCID(all) and ensure the bug is squashed.  Future cleanup
      can be done to more precisely synchronize roots when servicing a guest
      TLB flush.
      
      If TDP is enabled, synchronizing the MMU is unnecessary even if nested
      TDP is in play, as a "legacy" TLB flush from L1 does not invalidate L1's
      TDP mappings.  For EPT, an explicit INVEPT is required to invalidate
      guest-physical mappings; for NPT, guest mappings are always tagged with
      an ASID and thus can only be invalidated via the VMCB's ASID control.
      
      This bug has existed since the introduction of KVM_VCPU_FLUSH_TLB.
      It was only recently exposed after Linux guests stopped flushing the
      local CPU's TLB prior to flushing remote TLBs (see commit 4ce94eab,
      "x86/mm/tlb: Flush remote and local TLBs concurrently"), but is also
      visible in Windows 10 guests.
      Tested-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Fixes: f38a7b75 ("KVM: X86: support paravirtualized help for TLB shootdowns")
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      [sean: massaged comment and changelog]
      Message-Id: <20210531172256.2908-1-jiangshanlai@gmail.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b53e84ee
    • L
      KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior · af3511ff
      Lai Jiangshan 提交于
      In record_steal_time(), st->preempted is read twice, and
      trace_kvm_pv_tlb_flush() might output result inconsistent if
      kvm_vcpu_flush_tlb_guest() see a different st->preempted later.
      
      It is a very trivial problem and hardly has actual harm and can be
      avoided by reseting and reading st->preempted in atomic way via xchg().
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      
      Message-Id: <20210531174628.10265-1-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      af3511ff
  5. 29 5月, 2021 3 次提交
    • W
      KVM: X86: Kill off ctxt->ud · b35491e6
      Wanpeng Li 提交于
      ctxt->ud is consumed only by x86_decode_insn(), we can kill it off by
      passing emulation_type to x86_decode_insn() and dropping ctxt->ud
      altogether. Tracking that info in ctxt for literally one call is silly.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <1622160097-37633-2-git-send-email-wanpengli@tencent.com>
      b35491e6
    • W
      KVM: X86: Fix warning caused by stale emulation context · da6393cd
      Wanpeng Li 提交于
      Reported by syzkaller:
      
        WARNING: CPU: 7 PID: 10526 at linux/arch/x86/kvm//x86.c:7621 x86_emulate_instruction+0x41b/0x510 [kvm]
        RIP: 0010:x86_emulate_instruction+0x41b/0x510 [kvm]
        Call Trace:
         kvm_mmu_page_fault+0x126/0x8f0 [kvm]
         vmx_handle_exit+0x11e/0x680 [kvm_intel]
         vcpu_enter_guest+0xd95/0x1b40 [kvm]
         kvm_arch_vcpu_ioctl_run+0x377/0x6a0 [kvm]
         kvm_vcpu_ioctl+0x389/0x630 [kvm]
         __x64_sys_ioctl+0x8e/0xd0
         do_syscall_64+0x3c/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Commit 4a1e10d5 ("KVM: x86: handle hardware breakpoints during emulation())
      adds hardware breakpoints check before emulation the instruction and parts of
      emulation context initialization, actually we don't have the EMULTYPE_NO_DECODE flag
      here and the emulation context will not be reused. Commit c8848cee ("KVM: x86:
      set ctxt->have_exception in x86_decode_insn()) triggers the warning because it
      catches the stale emulation context has #UD, however, it is not during instruction
      decoding which should result in EMULATION_FAILED. This patch fixes it by moving
      the second part emulation context initialization into init_emulate_ctxt() and
      before hardware breakpoints check. The ctxt->ud will be dropped by a follow-up
      patch.
      
      syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=134683fdd00000
      
      Reported-by: syzbot+71271244f206d17f6441@syzkaller.appspotmail.com
      Fixes: 4a1e10d5 (KVM: x86: handle hardware breakpoints during emulation)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <1622160097-37633-1-git-send-email-wanpengli@tencent.com>
      da6393cd
    • Y
      KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception · e87e46d5
      Yuan Yao 提交于
      The kvm_get_linear_rip() handles x86/long mode cases well and has
      better readability, __kvm_set_rflags() also use the paired
      function kvm_is_linear_rip() to check the vcpu->arch.singlestep_rip
      set in kvm_arch_vcpu_ioctl_set_guest_debug(), so change the
      "CS.BASE + RIP" code in kvm_arch_vcpu_ioctl_set_guest_debug() and
      handle_exception_nmi() to this one.
      Signed-off-by: NYuan Yao <yuan.yao@intel.com>
      Message-Id: <20210526063828.1173-1-yuan.yao@linux.intel.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e87e46d5
  6. 27 5月, 2021 4 次提交
  7. 07 5月, 2021 4 次提交
    • T
      KVM: x86: Prevent deadlock against tk_core.seq · 3f804f6d
      Thomas Gleixner 提交于
      syzbot reported a possible deadlock in pvclock_gtod_notify():
      
      CPU 0  		  	   	    	    CPU 1
      write_seqcount_begin(&tk_core.seq);
        pvclock_gtod_notify()			    spin_lock(&pool->lock);
          queue_work(..., &pvclock_gtod_work)	    ktime_get()
           spin_lock(&pool->lock);		      do {
           						seq = read_seqcount_begin(tk_core.seq)
      						...
      				              } while (read_seqcount_retry(&tk_core.seq, seq);
      
      While this is unlikely to happen, it's possible.
      
      Delegate queue_work() to irq_work() which postpones it until the
      tk_core.seq write held region is left and interrupts are reenabled.
      
      Fixes: 16e8d74d ("KVM: x86: notifier for clocksource changes")
      Reported-by: syzbot+6beae4000559d41d80f8@syzkaller.appspotmail.com
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Message-Id: <87h7jgm1zy.ffs@nanos.tec.linutronix.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3f804f6d
    • T
      KVM: x86: Cancel pvclock_gtod_work on module removal · 594b27e6
      Thomas Gleixner 提交于
      Nothing prevents the following:
      
        pvclock_gtod_notify()
          queue_work(system_long_wq, &pvclock_gtod_work);
        ...
        remove_module(kvm);
        ...
        work_queue_run()
          pvclock_gtod_work()	<- UAF
      
      Ditto for any other operation on that workqueue list head which touches
      pvclock_gtod_work after module removal.
      
      Cancel the work in kvm_arch_exit() to prevent that.
      
      Fixes: 16e8d74d ("KVM: x86: notifier for clocksource changes")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Message-Id: <87czu4onry.ffs@nanos.tec.linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      594b27e6
    • C
      KVM: X86: Add support for the emulation of DR6_BUS_LOCK bit · e8ea85fb
      Chenyi Qiang 提交于
      Bus lock debug exception introduces a new bit DR6_BUS_LOCK (bit 11 of
      DR6) to indicate that bus lock #DB exception is generated. The set/clear
      of DR6_BUS_LOCK is similar to the DR6_RTM. The processor clears
      DR6_BUS_LOCK when the exception is generated. For all other #DB, the
      processor sets this bit to 1. Software #DB handler should set this bit
      before returning to the interrupted task.
      
      In VMM, to avoid breaking the CPUs without bus lock #DB exception
      support, activate the DR6_BUS_LOCK conditionally in DR6_FIXED_1 bits.
      When intercepting the #DB exception caused by bus locks, bit 11 of the
      exit qualification is set to identify it. The VMM should emulate the
      exception by clearing the bit 11 of the guest DR6.
      Co-developed-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20210202090433.13441-3-chenyi.qiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e8ea85fb
    • S
      KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU model · 61a05d44
      Sean Christopherson 提交于
      Squish the Intel and AMD emulation of MSR_TSC_AUX together and tie it to
      the guest CPU model instead of the host CPU behavior.  While not strictly
      necessary to avoid guest breakage, emulating cross-vendor "architecture"
      will provide consistent behavior for the guest, e.g. WRMSR fault behavior
      won't change if the vCPU is migrated to a host with divergent behavior.
      
      Note, the "new" kvm_is_supported_user_return_msr() checks do not add new
      functionality on either SVM or VMX.  On SVM, the equivalent was
      "tsc_aux_uret_slot < 0", and on VMX the check was buried in the
      vmx_find_uret_msr() call at the find_uret_msr label.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210504171734.1434054-15-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      61a05d44