1. 11 2月, 2022 8 次提交
    • D
      KVM: x86/mmu: Rename TDP MMU functions that handle shadow pages · c298a30c
      David Matlack 提交于
      Rename 3 functions in tdp_mmu.c that handle shadow pages:
      
        alloc_tdp_mmu_page()  -> tdp_mmu_alloc_sp()
        tdp_mmu_link_page()   -> tdp_mmu_link_sp()
        tdp_mmu_unlink_page() -> tdp_mmu_unlink_sp()
      
      These changed make tdp_mmu a consistent prefix before the verb in the
      function name, and make it more clear that these functions deal with
      kvm_mmu_page structs rather than struct pages.
      
      One could argue that "shadow page" is the wrong term for a page table in
      the TDP MMU since it never actually shadows a guest page table.
      However, "shadow page" (or "sp" for short) has evolved to become the
      standard term in KVM when referring to a kvm_mmu_page struct, and its
      associated page table and other metadata, regardless of whether the page
      table shadows a guest page table. So this commit just makes the TDP MMU
      more consistent with the rest of KVM.
      
      No functional change intended.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-6-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c298a30c
    • D
      KVM: x86/mmu: Change tdp_mmu_{set,zap}_spte_atomic() to return 0/-EBUSY · 3e72c791
      David Matlack 提交于
      tdp_mmu_set_spte_atomic() and tdp_mmu_zap_spte_atomic() return a bool
      with true indicating the SPTE modification was successful and false
      indicating failure. Change these functions to return an int instead
      since that is the common practice.
      
      Opportunistically fix up the kernel-doc style for the Return section
      above tdp_mmu_set_spte_atomic().
      
      No functional change intended.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-5-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3e72c791
    • D
      KVM: x86/mmu: Automatically update iter->old_spte if cmpxchg fails · 3255530a
      David Matlack 提交于
      Consolidate a bunch of code that was manually re-reading the spte if the
      cmpxchg failed. There is no extra cost of doing this because we already
      have the spte value as a result of the cmpxchg (and in fact this
      eliminates re-reading the spte), and none of the call sites depend on
      iter->old_spte retaining the stale spte value.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-4-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3255530a
    • D
      KVM: x86/mmu: Check SPTE writable invariants when setting leaf SPTEs · 115111ef
      David Matlack 提交于
      Check SPTE writable invariants when setting SPTEs rather than in
      spte_can_locklessly_be_made_writable(). By the time KVM checks
      spte_can_locklessly_be_made_writable(), the SPTE has long been since
      corrupted.
      
      Note that these invariants only apply to shadow-present leaf SPTEs (i.e.
      not to MMIO SPTEs, non-leaf SPTEs, etc.). Add a comment explaining the
      restriction and only instrument the code paths that set shadow-present
      leaf SPTEs.
      
      To account for access tracking, also check the SPTE writable invariants
      when marking an SPTE as an access track SPTE. This also lets us remove
      a redundant WARN from mark_spte_for_access_track().
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220125230518.1697048-3-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      115111ef
    • J
      KVM: x86/tdp_mmu: Remove unused "kvm" of kvm_tdp_mmu_get_root() · ad6d6b94
      Jinrong Liang 提交于
      The "struct kvm *kvm" parameter of kvm_tdp_mmu_get_root() is not used,
      so remove it. No functional change intended.
      Signed-off-by: NJinrong Liang <cloudliang@tencent.com>
      Message-Id: <20220125095909.38122-5-cloudliang@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad6d6b94
    • S
      KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU · d62007ed
      Sean Christopherson 提交于
      Zap both valid and invalid roots when zapping/unmapping a gfn range, as
      KVM must ensure it holds no references to the freed page after returning
      from the unmap operation.  Most notably, the TDP MMU doesn't zap invalid
      roots in mmu_notifier callbacks.  This leads to use-after-free and other
      issues if the mmu_notifier runs to completion while an invalid root
      zapper yields as KVM fails to honor the requirement that there must be
      _no_ references to the page after the mmu_notifier returns.
      
      The bug is most easily reproduced by hacking KVM to cause a collision
      between set_nx_huge_pages() and kvm_mmu_notifier_release(), but the bug
      exists between kvm_mmu_notifier_invalidate_range_start() and memslot
      updates as well.  Invalidating a root ensures pages aren't accessible by
      the guest, and KVM won't read or write page data itself, but KVM will
      trigger e.g. kvm_set_pfn_dirty() when zapping SPTEs, and thus completing
      a zap of an invalid root _after_ the mmu_notifier returns is fatal.
      
        WARNING: CPU: 24 PID: 1496 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:173 [kvm]
        RIP: 0010:kvm_is_zone_device_pfn+0x96/0xa0 [kvm]
        Call Trace:
         <TASK>
         kvm_set_pfn_dirty+0xa8/0xe0 [kvm]
         __handle_changed_spte+0x2ab/0x5e0 [kvm]
         __handle_changed_spte+0x2ab/0x5e0 [kvm]
         __handle_changed_spte+0x2ab/0x5e0 [kvm]
         zap_gfn_range+0x1f3/0x310 [kvm]
         kvm_tdp_mmu_zap_invalidated_roots+0x50/0x90 [kvm]
         kvm_mmu_zap_all_fast+0x177/0x1a0 [kvm]
         set_nx_huge_pages+0xb4/0x190 [kvm]
         param_attr_store+0x70/0x100
         module_attr_store+0x19/0x30
         kernfs_fop_write_iter+0x119/0x1b0
         new_sync_write+0x11c/0x1b0
         vfs_write+0x1cc/0x270
         ksys_write+0x5f/0xe0
         do_syscall_64+0x38/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Fixes: b7cccd39 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211215011557.399940-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d62007ed
    • S
      KVM: x86/mmu: Move "invalid" check out of kvm_tdp_mmu_get_root() · 04dc4e6c
      Sean Christopherson 提交于
      Move the check for an invalid root out of kvm_tdp_mmu_get_root() and into
      the one place it actually matters, tdp_mmu_next_root(), as the other user
      already has an implicit validity check.  A future bug fix will need to
      get references to invalid roots to honor mmu_notifier requests; there's
      no point in forcing what will be a common path to open code getting a
      reference to a root.
      
      No functional change intended.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211215011557.399940-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      04dc4e6c
    • S
      KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap hook · 83b83a02
      Sean Christopherson 提交于
      Use the common TDP MMU zap helper when handling an MMU notifier unmap
      event, the two flows are semantically identical.  Consolidate the code in
      preparation for a future bug fix, as both kvm_tdp_mmu_unmap_gfn_range()
      and __kvm_tdp_mmu_zap_gfn_range() are guilty of not zapping SPTEs in
      invalid roots.
      
      No functional change intended.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211215011557.399940-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      83b83a02
  2. 20 1月, 2022 1 次提交
    • D
      KVM: x86/mmu: Fix write-protection of PTs mapped by the TDP MMU · 7c8a4742
      David Matlack 提交于
      When the TDP MMU is write-protection GFNs for page table protection (as
      opposed to for dirty logging, or due to the HVA not being writable), it
      checks if the SPTE is already write-protected and if so skips modifying
      the SPTE and the TLB flush.
      
      This behavior is incorrect because it fails to check if the SPTE
      is write-protected for page table protection, i.e. fails to check
      that MMU-writable is '0'.  If the SPTE was write-protected for dirty
      logging but not page table protection, the SPTE could locklessly be made
      writable, and vCPUs could still be running with writable mappings cached
      in their TLB.
      
      Fix this by only skipping setting the SPTE if the SPTE is already
      write-protected *and* MMU-writable is already clear.  Technically,
      checking only MMU-writable would suffice; a SPTE cannot be writable
      without MMU-writable being set.  But check both to be paranoid and
      because it arguably yields more readable code.
      
      Fixes: 46044f72 ("kvm: x86/mmu: Support write protection for nesting in tdp MMU")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220113233020.3986005-2-dmatlack@google.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7c8a4742
  3. 20 12月, 2021 1 次提交
    • S
      KVM: x86/mmu: Don't advance iterator after restart due to yielding · 3a0f64de
      Sean Christopherson 提交于
      After dropping mmu_lock in the TDP MMU, restart the iterator during
      tdp_iter_next() and do not advance the iterator.  Advancing the iterator
      results in skipping the top-level SPTE and all its children, which is
      fatal if any of the skipped SPTEs were not visited before yielding.
      
      When zapping all SPTEs, i.e. when min_level == root_level, restarting the
      iter and then invoking tdp_iter_next() is always fatal if the current gfn
      has as a valid SPTE, as advancing the iterator results in try_step_side()
      skipping the current gfn, which wasn't visited before yielding.
      
      Sprinkle WARNs on iter->yielded being true in various helpers that are
      often used in conjunction with yielding, and tag the helper with
      __must_check to reduce the probabily of improper usage.
      
      Failing to zap a top-level SPTE manifests in one of two ways.  If a valid
      SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
      the shadow page will be leaked and KVM will WARN accordingly.
      
        WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
        RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
        Call Trace:
         <TASK>
         kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
         kvm_destroy_vm+0x162/0x2a0 [kvm]
         kvm_vcpu_release+0x34/0x60 [kvm]
         __fput+0x82/0x240
         task_work_run+0x5c/0x90
         do_exit+0x364/0xa10
         ? futex_unqueue+0x38/0x60
         do_group_exit+0x33/0xa0
         get_signal+0x155/0x850
         arch_do_signal_or_restart+0xed/0x750
         exit_to_user_mode_prepare+0xc5/0x120
         syscall_exit_to_user_mode+0x1d/0x40
         do_syscall_64+0x48/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
      kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
      marking a struct page as dirty/accessed after it has been put back on the
      free list.  This directly triggers a WARN due to encountering a page with
      page_count() == 0, but it can also lead to data corruption and additional
      errors in the kernel.
      
        WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
        RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
        Call Trace:
         <TASK>
         kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
         __handle_changed_spte+0x92e/0xca0 [kvm]
         __handle_changed_spte+0x63c/0xca0 [kvm]
         __handle_changed_spte+0x63c/0xca0 [kvm]
         __handle_changed_spte+0x63c/0xca0 [kvm]
         zap_gfn_range+0x549/0x620 [kvm]
         kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
         mmu_free_root_page+0x219/0x2c0 [kvm]
         kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
         kvm_mmu_unload+0x1c/0xa0 [kvm]
         kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
         kvm_put_kvm+0x3b1/0x8b0 [kvm]
         kvm_vcpu_release+0x4e/0x70 [kvm]
         __fput+0x1f7/0x8c0
         task_work_run+0xf8/0x1a0
         do_exit+0x97b/0x2230
         do_group_exit+0xda/0x2a0
         get_signal+0x3be/0x1e50
         arch_do_signal_or_restart+0x244/0x17f0
         exit_to_user_mode_prepare+0xcb/0x120
         syscall_exit_to_user_mode+0x1d/0x40
         do_syscall_64+0x4d/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Note, the underlying bug existed even before commit 1af4a960 ("KVM:
      x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
      tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
      incorrectly advance past a top-level entry when yielding on a lower-level
      entry.  But with respect to leaking shadow pages, the bug was introduced
      by yielding before processing the current gfn.
      
      Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
      callers could jump to their "retry" label.  The downside of that approach
      is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
      in the loop, and there's no easy way to enfornce that requirement.
      
      Ideally, KVM would handling the cond_resched() fully within the iterator
      macro (the code is actually quite clean) and avoid this entire class of
      bugs, but that is extremely difficult do while also supporting yielding
      after tdp_mmu_set_spte_atomic() fails.  Yielding after failing to set a
      SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
      bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
      may block operations on the SPTE for a significant amount of time.
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Fixes: 1af4a960 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
      Reported-by: NIgnat Korchagin <ignat@cloudflare.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211214033528.123268-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3a0f64de
  4. 08 12月, 2021 1 次提交
  5. 30 11月, 2021 2 次提交
  6. 18 11月, 2021 2 次提交
  7. 11 11月, 2021 1 次提交
  8. 22 10月, 2021 1 次提交
  9. 01 10月, 2021 12 次提交
  10. 30 9月, 2021 1 次提交
    • L
      KVM: X86: Don't flush current tlb on shadow page modification · bd047e54
      Lai Jiangshan 提交于
      After any shadow page modification, flushing tlb only on current VCPU
      is weird due to other VCPU's tlb might still be stale.
      
      In other words, if there is any mandatory tlb-flushing after shadow page
      modification, SET_SPTE_NEED_REMOTE_TLB_FLUSH or remote_flush should be
      set and the tlbs of all VCPUs should be flushed.  There is not point to
      only flush current tlb except when the request is from vCPU's or pCPU's
      activities.
      
      If there was any bug that mandatory tlb-flushing is required and
      SET_SPTE_NEED_REMOTE_TLB_FLUSH/remote_flush is failed to set, this patch
      would expose the bug in a more destructive way.  The related code paths
      are checked and no missing SET_SPTE_NEED_REMOTE_TLB_FLUSH is found yet.
      
      Currently, there is no optional tlb-flushing after sync page related code
      is changed to flush tlb timely.  So we can just remove these local flushing
      code.
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210918005636.3675-5-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bd047e54
  11. 21 8月, 2021 4 次提交
  12. 13 8月, 2021 2 次提交
    • S
      KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs · 0103098f
      Sean Christopherson 提交于
      Set the min_level for the TDP iterator at the root level when zapping all
      SPTEs to optimize the iterator's try_step_down().  Zapping a non-leaf
      SPTE will recursively zap all its children, thus there is no need for the
      iterator to attempt to step down.  This avoids rereading the top-level
      SPTEs after they are zapped by causing try_step_down() to short-circuit.
      
      In most cases, optimizing try_step_down() will be in the noise as the cost
      of zapping SPTEs completely dominates the overall time.  The optimization
      is however helpful if the zap occurs with relatively few SPTEs, e.g. if KVM
      is zapping in response to multiple memslot updates when userspace is adding
      and removing read-only memslots for option ROMs.  In that case, the task
      doing the zapping likely isn't a vCPU thread, but it still holds mmu_lock
      for read and thus can be a noisy neighbor of sorts.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0103098f
    • S
      KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs · 524a1e4e
      Sean Christopherson 提交于
      Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and
      really zap all SPTEs in this case.  As is, zap_gfn_range() skips non-leaf
      SPTEs whose range exceeds the range to be zapped.  If shadow_phys_bits is
      not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level
      paging, the "zap all" flows will skip top-level SPTEs whose range extends
      beyond shadow_phys_bits and leak their SPs when the VM is destroyed.
      
      Use the current upper bound (based on host.MAXPHYADDR) to detect that the
      caller wants to zap all SPTEs, e.g. instead of using the max theoretical
      gfn, 1 << (52 - 12).  The more precise upper bound allows the TDP iterator
      to terminate its walk earlier when running on hosts with MAXPHYADDR < 52.
      
      Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to
      help future debuggers should KVM decide to leak SPTEs again.
      
      The bug is most easily reproduced by running (and unloading!) KVM in a
      VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped.
      
        =============================================================================
        BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
        Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1)
        CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack_lvl+0x45/0x59
         slab_err+0x95/0xc9
         __kmem_cache_shutdown.cold+0x3c/0x158
         kmem_cache_destroy+0x3d/0xf0
         kvm_mmu_module_exit+0xa/0x30 [kvm]
         kvm_arch_exit+0x5d/0x90 [kvm]
         kvm_exit+0x78/0x90 [kvm]
         vmx_exit+0x1a/0x50 [kvm_intel]
         __x64_sys_delete_module+0x13f/0x220
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      524a1e4e
  13. 06 8月, 2021 1 次提交
    • D
      KVM: x86/mmu: Leverage vcpu->last_used_slot in tdp_mmu_map_handle_target_level · 081de470
      David Matlack 提交于
      The existing TDP MMU methods to handle dirty logging are vcpu-agnostic
      since they can be driven by MMU notifiers and other non-vcpu-specific
      events in addition to page faults. However this means that the TDP MMU
      is not benefiting from the new vcpu->last_used_slot. Fix that by
      introducing a tdp_mmu_map_set_spte_atomic() which is only called during
      a TDP page fault and has access to the kvm_vcpu for fast slot lookups.
      
      This improves "Populate memory time" in dirty_log_perf_test by 5%:
      
      Command                         | Before           | After
      ------------------------------- | ---------------- | -------------
      ./dirty_log_perf_test -v64 -x64 | 5.472321072s     | 5.169832886s
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20210804222844.1419481-5-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      081de470
  14. 03 8月, 2021 1 次提交
  15. 02 8月, 2021 2 次提交