1. 15 10月, 2021 1 次提交
  2. 13 10月, 2021 1 次提交
  3. 03 7月, 2021 1 次提交
    • L
      KVM: X86: MMU: Use the correct inherited permissions to get shadow page · 26a4c1e7
      Lai Jiangshan 提交于
      stable inclusion
      from stable-5.10.44
      commit 6b6ff4d1f349cb35a7c7d2057819af1b14f80437
      bugzilla: 109295
      CVE: NA
      
      --------------------------------
      
      commit b1bd5cba upstream.
      
      When computing the access permissions of a shadow page, use the effective
      permissions of the walk up to that point, i.e. the logic AND of its parents'
      permissions.  Two guest PxE entries that point at the same table gfn need to
      be shadowed with different shadow pages if their parents' permissions are
      different.  KVM currently uses the effective permissions of the last
      non-leaf entry for all non-leaf entries.  Because all non-leaf SPTEs have
      full ("uwx") permissions, and the effective permissions are recorded only
      in role.access and merged into the leaves, this can lead to incorrect
      reuse of a shadow page and eventually to a missing guest protection page
      fault.
      
      For example, here is a shared pagetable:
      
         pgd[]   pud[]        pmd[]            virtual address pointers
                           /->pmd1(u--)->pte1(uw-)->page1 <- ptr1 (u--)
              /->pud1(uw-)--->pmd2(uw-)->pte2(uw-)->page2 <- ptr2 (uw-)
         pgd-|           (shared pmd[] as above)
              \->pud2(u--)--->pmd1(u--)->pte1(uw-)->page1 <- ptr3 (u--)
                           \->pmd2(uw-)->pte2(uw-)->page2 <- ptr4 (u--)
      
        pud1 and pud2 point to the same pmd table, so:
        - ptr1 and ptr3 points to the same page.
        - ptr2 and ptr4 points to the same page.
      
      (pud1 and pud2 here are pud entries, while pmd1 and pmd2 here are pmd entries)
      
      - First, the guest reads from ptr1 first and KVM prepares a shadow
        page table with role.access=u--, from ptr1's pud1 and ptr1's pmd1.
        "u--" comes from the effective permissions of pgd, pud1 and
        pmd1, which are stored in pt->access.  "u--" is used also to get
        the pagetable for pud1, instead of "uw-".
      
      - Then the guest writes to ptr2 and KVM reuses pud1 which is present.
        The hypervisor set up a shadow page for ptr2 with pt->access is "uw-"
        even though the pud1 pmd (because of the incorrect argument to
        kvm_mmu_get_page in the previous step) has role.access="u--".
      
      - Then the guest reads from ptr3.  The hypervisor reuses pud1's
        shadow pmd for pud2, because both use "u--" for their permissions.
        Thus, the shadow pmd already includes entries for both pmd1 and pmd2.
      
      - At last, the guest writes to ptr4.  This causes no vmexit or pagefault,
        because pud1's shadow page structures included an "uw-" page even though
        its role.access was "u--".
      
      Any kind of shared pagetable might have the similar problem when in
      virtual machine without TDP enabled if the permissions are different
      from different ancestors.
      
      In order to fix the problem, we change pt->access to be an array, and
      any access in it will not include permissions ANDed from child ptes.
      
      The test code is: https://lore.kernel.org/kvm/20210603050537.19605-1-jiangshanlai@gmail.com/
      Remember to test it with TDP disabled.
      
      The problem had existed long before the commit 41074d07 ("KVM: MMU:
      Fix inherited permissions for emulated guest pte updates"), and it
      is hard to find which is the culprit.  So there is no fixes tag here.
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20210603052455.21023-1-jiangshanlai@gmail.com>
      Cc: stable@vger.kernel.org
      Fixes: cea0f0e7 ("[PATCH] KVM: MMU: Shadow page table caching")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      26a4c1e7
  4. 22 10月, 2020 1 次提交
  5. 28 9月, 2020 9 次提交
  6. 17 7月, 2020 1 次提交
  7. 10 7月, 2020 2 次提交
  8. 09 7月, 2020 4 次提交
  9. 23 6月, 2020 2 次提交
  10. 10 6月, 2020 1 次提交
  11. 16 5月, 2020 2 次提交
  12. 21 4月, 2020 1 次提交
  13. 17 3月, 2020 2 次提交
  14. 16 2月, 2020 1 次提交
  15. 13 2月, 2020 1 次提交
    • S
      KVM: x86/mmu: Fix struct guest_walker arrays for 5-level paging · f6ab0107
      Sean Christopherson 提交于
      Define PT_MAX_FULL_LEVELS as PT64_ROOT_MAX_LEVEL, i.e. 5, to fix shadow
      paging for 5-level guest page tables.  PT_MAX_FULL_LEVELS is used to
      size the arrays that track guest pages table information, i.e. using a
      "max levels" of 4 causes KVM to access garbage beyond the end of an
      array when querying state for level 5 entries.  E.g. FNAME(gpte_changed)
      will read garbage and most likely return %true for a level 5 entry,
      soft-hanging the guest because FNAME(fetch) will restart the guest
      instead of creating SPTEs because it thinks the guest PTE has changed.
      
      Note, KVM doesn't yet support 5-level nested EPT, so PT_MAX_FULL_LEVELS
      gets to stay "4" for the PTTYPE_EPT case.
      
      Fixes: 855feb67 ("KVM: MMU: Add 5 level EPT & Shadow page table support.")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f6ab0107
  16. 28 1月, 2020 4 次提交
  17. 21 1月, 2020 2 次提交
    • S
      KVM: x86/mmu: Micro-optimize nEPT's bad memptype/XWR checks · b5c3c1b3
      Sean Christopherson 提交于
      Rework the handling of nEPT's bad memtype/XWR checks to micro-optimize
      the checks as much as possible.  Move the check to a separate helper,
      __is_bad_mt_xwr(), which allows the guest_rsvd_check usage in
      paging_tmpl.h to omit the check entirely for paging32/64 (bad_mt_xwr is
      always zero for non-nEPT) while retaining the bitwise-OR of the current
      code for the shadow_zero_check in walk_shadow_page_get_mmio_spte().
      
      Add a comment for the bitwise-OR usage in the mmio spte walk to avoid
      future attempts to "fix" the code, which is what prompted this
      optimization in the first place[*].
      
      Opportunistically remove the superfluous '!= 0' and parantheses, and
      use BIT_ULL() instead of open coding its equivalent.
      
      The net effect is that code generation is largely unchanged for
      walk_shadow_page_get_mmio_spte(), marginally better for
      ept_prefetch_invalid_gpte(), and significantly improved for
      paging32/64_prefetch_invalid_gpte().
      
      Note, walk_shadow_page_get_mmio_spte() can't use a templated version of
      the memtype/XRW as it works on the host's shadow PTEs, e.g. checks that
      KVM hasn't borked its EPT tables.  Even if it could be templated, the
      benefits of having a single implementation far outweight the few uops
      that would be saved for NPT or non-TDP paging, e.g. most compilers
      inline it all the way to up kvm_mmu_page_fault().
      
      [*] https://lkml.kernel.org/r/20200108001859.25254-1-sean.j.christopherson@intel.com
      
      Cc: Jim Mattson <jmattson@google.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Arvind Sankar <nivedita@alum.mit.edu>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b5c3c1b3
    • S
      KVM: x86/mmu: Reorder the reserved bit check in prefetch_invalid_gpte() · f8052a05
      Sean Christopherson 提交于
      Move the !PRESENT and !ACCESSED checks in FNAME(prefetch_invalid_gpte)
      above the call to is_rsvd_bits_set().  For a well behaved guest, the
      !PRESENT and !ACCESSED are far more likely to evaluate true than the
      reserved bit checks, and they do not require additional memory accesses.
      
      Before:
       Dump of assembler code for function paging32_prefetch_invalid_gpte:
         0x0000000000044240 <+0>:     callq  0x44245 <paging32_prefetch_invalid_gpte+5>
         0x0000000000044245 <+5>:     mov    %rcx,%rax
         0x0000000000044248 <+8>:     shr    $0x7,%rax
         0x000000000004424c <+12>:    and    $0x1,%eax
         0x000000000004424f <+15>:    lea    0x0(,%rax,4),%r8
         0x0000000000044257 <+23>:    add    %r8,%rax
         0x000000000004425a <+26>:    mov    %rcx,%r8
         0x000000000004425d <+29>:    and    0x120(%rsi,%rax,8),%r8
         0x0000000000044265 <+37>:    mov    0x170(%rsi),%rax
         0x000000000004426c <+44>:    shr    %cl,%rax
         0x000000000004426f <+47>:    and    $0x1,%eax
         0x0000000000044272 <+50>:    or     %rax,%r8
         0x0000000000044275 <+53>:    jne    0x4427c <paging32_prefetch_invalid_gpte+60>
         0x0000000000044277 <+55>:    test   $0x1,%cl
         0x000000000004427a <+58>:    jne    0x4428a <paging32_prefetch_invalid_gpte+74>
         0x000000000004427c <+60>:    mov    %rdx,%rsi
         0x000000000004427f <+63>:    callq  0x44080 <drop_spte>
         0x0000000000044284 <+68>:    mov    $0x1,%eax
         0x0000000000044289 <+73>:    retq
         0x000000000004428a <+74>:    xor    %eax,%eax
         0x000000000004428c <+76>:    and    $0x20,%ecx
         0x000000000004428f <+79>:    jne    0x44289 <paging32_prefetch_invalid_gpte+73>
         0x0000000000044291 <+81>:    mov    %rdx,%rsi
         0x0000000000044294 <+84>:    callq  0x44080 <drop_spte>
         0x0000000000044299 <+89>:    mov    $0x1,%eax
         0x000000000004429e <+94>:    jmp    0x44289 <paging32_prefetch_invalid_gpte+73>
       End of assembler dump.
      
      After:
       Dump of assembler code for function paging32_prefetch_invalid_gpte:
         0x0000000000044240 <+0>:     callq  0x44245 <paging32_prefetch_invalid_gpte+5>
         0x0000000000044245 <+5>:     test   $0x1,%cl
         0x0000000000044248 <+8>:     je     0x4424f <paging32_prefetch_invalid_gpte+15>
         0x000000000004424a <+10>:    test   $0x20,%cl
         0x000000000004424d <+13>:    jne    0x4425d <paging32_prefetch_invalid_gpte+29>
         0x000000000004424f <+15>:    mov    %rdx,%rsi
         0x0000000000044252 <+18>:    callq  0x44080 <drop_spte>
         0x0000000000044257 <+23>:    mov    $0x1,%eax
         0x000000000004425c <+28>:    retq
         0x000000000004425d <+29>:    mov    %rcx,%rax
         0x0000000000044260 <+32>:    mov    (%rsi),%rsi
         0x0000000000044263 <+35>:    shr    $0x7,%rax
         0x0000000000044267 <+39>:    and    $0x1,%eax
         0x000000000004426a <+42>:    lea    0x0(,%rax,4),%r8
         0x0000000000044272 <+50>:    add    %r8,%rax
         0x0000000000044275 <+53>:    mov    %rcx,%r8
         0x0000000000044278 <+56>:    and    0x120(%rsi,%rax,8),%r8
         0x0000000000044280 <+64>:    mov    0x170(%rsi),%rax
         0x0000000000044287 <+71>:    shr    %cl,%rax
         0x000000000004428a <+74>:    and    $0x1,%eax
         0x000000000004428d <+77>:    mov    %rax,%rcx
         0x0000000000044290 <+80>:    xor    %eax,%eax
         0x0000000000044292 <+82>:    or     %rcx,%r8
         0x0000000000044295 <+85>:    je     0x4425c <paging32_prefetch_invalid_gpte+28>
         0x0000000000044297 <+87>:    mov    %rdx,%rsi
         0x000000000004429a <+90>:    callq  0x44080 <drop_spte>
         0x000000000004429f <+95>:    mov    $0x1,%eax
         0x00000000000442a4 <+100>:   jmp    0x4425c <paging32_prefetch_invalid_gpte+28>
       End of assembler dump.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f8052a05
  18. 09 1月, 2020 4 次提交
    • S
      KVM: x86/mmu: WARN on an invalid root_hpa · 0c7a98e3
      Sean Christopherson 提交于
      WARN on the existing invalid root_hpa checks in __direct_map() and
      FNAME(fetch).  The "legitimate" path that invalidated root_hpa in the
      middle of a page fault is long since gone, i.e. it should no longer be
      impossible to invalidate in the middle of a page fault[*].
      
      The root_hpa checks were added by two related commits
      
        989c6b34 ("KVM: MMU: handle invalid root_hpa at __direct_map")
        37f6a4e2 ("KVM: x86: handle invalid root_hpa everywhere")
      
      to fix a bug where nested_vmx_vmexit() could be called *in the middle*
      of a page fault.  At the time, vmx_interrupt_allowed(), which was and
      still is used by kvm_can_do_async_pf() via ->interrupt_allowed(),
      directly invoked nested_vmx_vmexit() to switch from L2 to L1 to emulate
      a VM-Exit on a pending interrupt.  Emulating the nested VM-Exit resulted
      in root_hpa being invalidated by kvm_mmu_reset_context() without
      explicitly terminating the page fault.
      
      Now that root_hpa is checked for validity by kvm_mmu_page_fault(), WARN
      on an invalid root_hpa to detect any flows that reset the MMU while
      handling a page fault.  The broken vmx_interrupt_allowed() behavior has
      long since been fixed and resetting the MMU during a page fault should
      not be considered legal behavior.
      
      [*] It's actually technically possible in FNAME(page_fault)() because it
          calls inject_page_fault() when the guest translation is invalid, but
          in that case the page fault handling is immediately terminated.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0c7a98e3
    • S
      KVM: x86/mmu: Move calls to thp_adjust() down a level · 4cd071d1
      Sean Christopherson 提交于
      Move the calls to thp_adjust() down a level from the page fault handlers
      to the map/fetch helpers and remove the page count shuffling done in
      thp_adjust().
      
      Despite holding a reference to the underlying page while processing a
      page fault, the page fault flows don't actually rely on holding a
      reference to the page when thp_adjust() is called.  At that point, the
      fault handlers hold mmu_lock, which prevents mmu_notifier from completing
      any invalidations, and have verified no invalidations from mmu_notifier
      have occurred since the page reference was acquired (which is done prior
      to taking mmu_lock).
      
      The kvm_release_pfn_clean()/kvm_get_pfn() dance in thp_adjust() is a
      quirk that is necessitated because thp_adjust() modifies the pfn that is
      consumed by its caller.  Because the page fault handlers call
      kvm_release_pfn_clean() on said pfn, thp_adjust() needs to transfer the
      reference to the correct pfn purely for correctness when the pfn is
      released.
      
      Calling thp_adjust() from __direct_map() and FNAME(fetch) means the pfn
      adjustment doesn't change the pfn as seen by the page fault handlers,
      i.e. the pfn released by the page fault handlers is the same pfn that
      was returned by gfn_to_pfn().
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4cd071d1
    • S
      KVM: x86/mmu: Incorporate guest's page level into max level for shadow MMU · cbe1e6f0
      Sean Christopherson 提交于
      Restrict the max level for a shadow page based on the guest's level
      instead of capping the level after the fact for host-mapped huge pages,
      e.g. hugetlbfs pages.  Explicitly capping the max level using the guest
      mapping level also eliminates FNAME(page_fault)'s subtle dependency on
      THP only supporting 2mb pages.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cbe1e6f0
    • S
      KVM: x86/mmu: Refactor handling of forced 4k pages in page faults · 39ca1ecb
      Sean Christopherson 提交于
      Refactor the page fault handlers and mapping_level() to track the max
      allowed page level instead of only tracking if a 4k page is mandatory
      due to one restriction or another.  This paves the way for cleanly
      consolidating tdp_page_fault() and nonpaging_page_fault(), and for
      eliminating a redundant check on mmu_gfn_lpage_is_disallowed().
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      39ca1ecb