1. 21 10月, 2021 1 次提交
  2. 23 9月, 2021 1 次提交
  3. 06 9月, 2021 3 次提交
    • P
      KVM: MMU: mark role_regs and role accessors as maybe unused · 4ac21457
      Paolo Bonzini 提交于
      It is reasonable for these functions to be used only in some configurations,
      for example only if the host is 64-bits (and therefore supports 64-bit
      guests).  It is also reasonable to keep the role_regs and role accessors
      in sync even though some of the accessors may be used only for one of the
      two sets (as is the case currently for CR4.LA57)..
      
      Because clang reports warnings for unused inlines declared in a .c file,
      mark both sets of accessors as __maybe_unused.
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4ac21457
    • S
      Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()" · e7177339
      Sean Christopherson 提交于
      Revert a misguided illegal GPA check when "translating" a non-nested GPA.
      The check is woefully incomplete as it does not fill in @exception as
      expected by all callers, which leads to KVM attempting to inject a bogus
      exception, potentially exposing kernel stack information in the process.
      
       WARNING: CPU: 0 PID: 8469 at arch/x86/kvm/x86.c:525 exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525
       CPU: 1 PID: 8469 Comm: syz-executor531 Not tainted 5.14.0-rc7-syzkaller #0
       RIP: 0010:exception_type+0x98/0xb0 arch/x86/kvm/x86.c:525
       Call Trace:
        x86_emulate_instruction+0xef6/0x1460 arch/x86/kvm/x86.c:7853
        kvm_mmu_page_fault+0x2f0/0x1810 arch/x86/kvm/mmu/mmu.c:5199
        handle_ept_misconfig+0xdf/0x3e0 arch/x86/kvm/vmx/vmx.c:5336
        __vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6021 [inline]
        vmx_handle_exit+0x336/0x1800 arch/x86/kvm/vmx/vmx.c:6038
        vcpu_enter_guest+0x2a1c/0x4430 arch/x86/kvm/x86.c:9712
        vcpu_run arch/x86/kvm/x86.c:9779 [inline]
        kvm_arch_vcpu_ioctl_run+0x47d/0x1b20 arch/x86/kvm/x86.c:10010
        kvm_vcpu_ioctl+0x49e/0xe50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3652
      
      The bug has escaped notice because practically speaking the GPA check is
      useless.  The GPA check in question only comes into play when KVM is
      walking guest page tables (or "translating" CR3), and KVM already handles
      illegal GPA checks by setting reserved bits in rsvd_bits_mask for each
      PxE, or in the case of CR3 for loading PTDPTRs, manually checks for an
      illegal CR3.  This particular failure doesn't hit the existing reserved
      bits checks because syzbot sets guest.MAXPHYADDR=1, and IA32 architecture
      simply doesn't allow for such an absurd MAXPHYADDR, e.g. 32-bit paging
      doesn't define any reserved PA bits checks, which KVM emulates by only
      incorporating the reserved PA bits into the "high" bits, i.e. bits 63:32.
      
      Simply remove the bogus check.  There is zero meaningful value and no
      architectural justification for supporting guest.MAXPHYADDR < 32, and
      properly filling the exception would introduce non-trivial complexity.
      
      This reverts commit ec7771ab.
      
      Fixes: ec7771ab ("KVM: x86: mmu: Add guest physical address check in translate_gpa()")
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+200c08e88ae818f849ce@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210831164224.1119728-2-seanjc@google.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e7177339
    • S
      KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host · a717a780
      Sean Christopherson 提交于
      Include pml5_root in the set of special roots if and only if the host,
      and thus NPT, is using 5-level paging.  mmu_alloc_special_roots() expects
      special roots to be allocated as a bundle, i.e. they're either all valid
      or all NULL.  But for pml5_root, that expectation only holds true if the
      host uses 5-level paging, which causes KVM to WARN about pml5_root being
      NULL when the other special roots are valid.
      
      The silver lining of 4-level vs. 5-level NPT being tied to the host
      kernel's paging level is that KVM's shadow root level is constant; unlike
      VMX's EPT, KVM can't choose 4-level NPT based on guest.MAXPHYADDR.  That
      means KVM can still expect pml5_root to be bundled with the other special
      roots, it just needs to be conditioned on the shadow root level.
      
      Fixes: cb0f722a ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host")
      Reported-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210824005824.205536-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a717a780
  4. 21 8月, 2021 13 次提交
  5. 13 8月, 2021 1 次提交
    • S
      KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock · ce25681d
      Sean Christopherson 提交于
      Add yet another spinlock for the TDP MMU and take it when marking indirect
      shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
      nested TDP, KVM may encounter shadow pages for the TDP entries managed by
      L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
      is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
      misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
      which runs with mmu_lock held for read, not write.
      
      Lack of a critical section manifests most visibly as an underflow of
      unsync_children in clear_unsync_child_bit() due to unsync_children being
      corrupted when multiple CPUs write it without a critical section and
      without atomic operations.  But underflow is the best case scenario.  The
      worst case scenario is that unsync_children prematurely hits '0' and
      leads to guest memory corruption due to KVM neglecting to properly sync
      shadow pages.
      
      Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
      would functionally be ok.  Usurping the lock could degrade performance when
      building upper level page tables on different vCPUs, especially since the
      unsync flow could hold the lock for a comparatively long time depending on
      the number of indirect shadow pages and the depth of the paging tree.
      
      For simplicity, take the lock for all MMUs, even though KVM could fairly
      easily know that mmu_lock is held for write.  If mmu_lock is held for
      write, there cannot be contention for the inner spinlock, and marking
      shadow pages unsync across multiple vCPUs will be slow enough that
      bouncing the kvm_arch cacheline should be in the noise.
      
      Note, even though L2 could theoretically be given access to its own EPT
      entries, a nested MMU must hold mmu_lock for write and thus cannot race
      against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
      be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
      that is running with the TDP MMU enabled.  Holding mmu_lock for read also
      prevents the indirect shadow page from being freed.  But as above, keep
      it simple and always take the lock.
      
      Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
      effectively disable unsync behavior for nested TDP.  Write protecting leaf
      shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
      VMMs typically don't modify TDP entries, but the same may not hold true for
      non-standard use cases and/or VMMs that are migrating physical pages (from
      L1's perspective).
      
      Alternative #2, the unsync logic could be made thread safe.  In theory,
      simply converting all relevant kvm_mmu_page fields to atomics and using
      atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
      would be required, (b) the code churn would be substantial, and (c) legacy
      shadow paging would incur additional atomic operations in performance
      sensitive paths for no benefit (to legacy shadow paging).
      
      Fixes: a2855afc ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181815.3378104-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce25681d
  6. 06 8月, 2021 2 次提交
  7. 05 8月, 2021 1 次提交
    • S
      KVM: x86/mmu: Fix per-cpu counter corruption on 32-bit builds · d5aaad6f
      Sean Christopherson 提交于
      Take a signed 'long' instead of an 'unsigned long' for the number of
      pages to add/subtract to the total number of pages used by the MMU.  This
      fixes a zero-extension bug on 32-bit kernels that effectively corrupts
      the per-cpu counter used by the shrinker.
      
      Per-cpu counters take a signed 64-bit value on both 32-bit and 64-bit
      kernels, whereas kvm_mod_used_mmu_pages() takes an unsigned long and thus
      an unsigned 32-bit value on 32-bit kernels.  As a result, the value used
      to adjust the per-cpu counter is zero-extended (unsigned -> signed), not
      sign-extended (signed -> signed), and so KVM's intended -1 gets morphed to
      4294967295 and effectively corrupts the counter.
      
      This was found by a staggering amount of sheer dumb luck when running
      kvm-unit-tests on a 32-bit KVM build.  The shrinker just happened to kick
      in while running tests and do_shrink_slab() logged an error about trying
      to free a negative number of objects.  The truly lucky part is that the
      kernel just happened to be a slightly stale build, as the shrinker no
      longer yells about negative objects as of commit 18bb473e ("mm:
      vmscan: shrink deferred objects proportional to priority").
      
       vmscan: shrink_slab: mmu_shrink_scan+0x0/0x210 [kvm] negative objects to delete nr=-858993460
      
      Fixes: bc8a3d89 ("kvm: mmu: Fix overflow on kvm mmu page limit calculation")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210804214609.1096003-1-seanjc@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d5aaad6f
  8. 04 8月, 2021 3 次提交
    • P
      KVM: X86: Optimize zapping rmap · a75b5404
      Peter Xu 提交于
      Using rmap_get_first() and rmap_remove() for zapping a huge rmap list could be
      slow.  The easy way is to travers the rmap list, collecting the a/d bits and
      free the slots along the way.
      
      Provide a pte_list_destroy() and do exactly that.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220605.26377-1-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a75b5404
    • P
      KVM: X86: Optimize pte_list_desc with per-array counter · 13236e25
      Peter Xu 提交于
      Add a counter field into pte_list_desc, so as to simplify the add/remove/loop
      logic.  E.g., we don't need to loop over the array any more for most reasons.
      
      This will make more sense after we've switched the array size to be larger
      otherwise the counter will be a waste.
      
      Initially I wanted to store a tail pointer at the head of the array list so we
      don't need to traverse the list at least for pushing new ones (if without the
      counter we traverse both the list and the array).  However that'll need
      slightly more change without a huge lot benefit, e.g., after we grow entry
      numbers per array the list traversing is not so expensive.
      
      So let's be simple but still try to get as much benefit as we can with just
      these extra few lines of changes (not to mention the code looks easier too
      without looping over arrays).
      
      I used the same a test case to fork 500 child and recycle them ("./rmap_fork
      500" [1]), this patch further speeds up the total fork time of about 4%, which
      is a total of 33% of vanilla kernel:
      
              Vanilla:      473.90 (+-5.93%)
              3->15 slots:  366.10 (+-4.94%)
              Add counter:  351.00 (+-3.70%)
      
      [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220602.26327-1-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      13236e25
    • P
      KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger · dc1cff96
      Peter Xu 提交于
      Currently rmap array element only contains 3 entries.  However for EPT=N there
      could have a lot of guest pages that got tens of even hundreds of rmap entry.
      
      A normal distribution of a 6G guest (even if idle) shows this with rmap count
      statistics:
      
      Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
      Level=4K:       3089171 49005   14016   1363    235     212     15      7       0       0       0
      Level=2M:       5951    227     0       0       0       0       0       0       0       0       0
      Level=1G:       32      0       0       0       0       0       0       0       0       0       0
      
      If we do some more fork some pages will grow even larger rmap counts.
      
      This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general
      use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT
      will be slightly more efficient; but still not too large so less waste when
      array not full.
      
      It should not affecting EPT=Y since EPT normally only has zero or one rmap
      entry for each page, so no array is even allocated.
      
      With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]),
      this patch speeds up fork time of about 29%.
      
          Before: 473.90 (+-5.93%)
          After:  366.10 (+-4.94%)
      
      [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220455.26054-6-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dc1cff96
  9. 03 8月, 2021 1 次提交
  10. 02 8月, 2021 7 次提交
  11. 15 7月, 2021 1 次提交
    • S
      KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs · fc9bf2e0
      Sean Christopherson 提交于
      Ignore "dynamic" host adjustments to the physical address mask when
      generating the masks for guest PTEs, i.e. the guest PA masks.  The host
      physical address space and guest physical address space are two different
      beasts, e.g. even though SEV's C-bit is the same bit location for both
      host and guest, disabling SME in the host (which clears shadow_me_mask)
      does not affect the guest PTE->GPA "translation".
      
      For non-SEV guests, not dropping bits is the correct behavior.  Assuming
      KVM and userspace correctly enumerate/configure guest MAXPHYADDR, bits
      that are lost as collateral damage from memory encryption are treated as
      reserved bits, i.e. KVM will never get to the point where it attempts to
      generate a gfn using the affected bits.  And if userspace wants to create
      a bogus vCPU, then userspace gets to deal with the fallout of hardware
      doing odd things with bad GPAs.
      
      For SEV guests, not dropping the C-bit is technically wrong, but it's a
      moot point because KVM can't read SEV guest's page tables in any case
      since they're always encrypted.  Not to mention that the current KVM code
      is also broken since sme_me_mask does not have to be non-zero for SEV to
      be supported by KVM.  The proper fix would be to teach all of KVM to
      correctly handle guest private memory, but that's a task for the future.
      
      Fixes: d0ec49d4 ("kvm/x86/svm: Support Secure Memory Encryption within KVM")
      Cc: stable@vger.kernel.org
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210623230552.4027702-5-seanjc@google.com>
      [Use a new header instead of adding header guards to paging_tmpl.h. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc9bf2e0
  12. 25 6月, 2021 6 次提交