1. 16 8月, 2021 1 次提交
  2. 13 8月, 2021 7 次提交
    • S
      KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock · ce25681d
      Sean Christopherson 提交于
      Add yet another spinlock for the TDP MMU and take it when marking indirect
      shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
      nested TDP, KVM may encounter shadow pages for the TDP entries managed by
      L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
      is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
      misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
      which runs with mmu_lock held for read, not write.
      
      Lack of a critical section manifests most visibly as an underflow of
      unsync_children in clear_unsync_child_bit() due to unsync_children being
      corrupted when multiple CPUs write it without a critical section and
      without atomic operations.  But underflow is the best case scenario.  The
      worst case scenario is that unsync_children prematurely hits '0' and
      leads to guest memory corruption due to KVM neglecting to properly sync
      shadow pages.
      
      Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
      would functionally be ok.  Usurping the lock could degrade performance when
      building upper level page tables on different vCPUs, especially since the
      unsync flow could hold the lock for a comparatively long time depending on
      the number of indirect shadow pages and the depth of the paging tree.
      
      For simplicity, take the lock for all MMUs, even though KVM could fairly
      easily know that mmu_lock is held for write.  If mmu_lock is held for
      write, there cannot be contention for the inner spinlock, and marking
      shadow pages unsync across multiple vCPUs will be slow enough that
      bouncing the kvm_arch cacheline should be in the noise.
      
      Note, even though L2 could theoretically be given access to its own EPT
      entries, a nested MMU must hold mmu_lock for write and thus cannot race
      against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
      be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
      that is running with the TDP MMU enabled.  Holding mmu_lock for read also
      prevents the indirect shadow page from being freed.  But as above, keep
      it simple and always take the lock.
      
      Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
      effectively disable unsync behavior for nested TDP.  Write protecting leaf
      shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
      VMMs typically don't modify TDP entries, but the same may not hold true for
      non-standard use cases and/or VMMs that are migrating physical pages (from
      L1's perspective).
      
      Alternative #2, the unsync logic could be made thread safe.  In theory,
      simply converting all relevant kvm_mmu_page fields to atomics and using
      atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
      would be required, (b) the code churn would be substantial, and (c) legacy
      shadow paging would incur additional atomic operations in performance
      sensitive paths for no benefit (to legacy shadow paging).
      
      Fixes: a2855afc ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181815.3378104-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce25681d
    • S
      KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs · 0103098f
      Sean Christopherson 提交于
      Set the min_level for the TDP iterator at the root level when zapping all
      SPTEs to optimize the iterator's try_step_down().  Zapping a non-leaf
      SPTE will recursively zap all its children, thus there is no need for the
      iterator to attempt to step down.  This avoids rereading the top-level
      SPTEs after they are zapped by causing try_step_down() to short-circuit.
      
      In most cases, optimizing try_step_down() will be in the noise as the cost
      of zapping SPTEs completely dominates the overall time.  The optimization
      is however helpful if the zap occurs with relatively few SPTEs, e.g. if KVM
      is zapping in response to multiple memslot updates when userspace is adding
      and removing read-only memslots for option ROMs.  In that case, the task
      doing the zapping likely isn't a vCPU thread, but it still holds mmu_lock
      for read and thus can be a noisy neighbor of sorts.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0103098f
    • S
      KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs · 524a1e4e
      Sean Christopherson 提交于
      Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and
      really zap all SPTEs in this case.  As is, zap_gfn_range() skips non-leaf
      SPTEs whose range exceeds the range to be zapped.  If shadow_phys_bits is
      not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level
      paging, the "zap all" flows will skip top-level SPTEs whose range extends
      beyond shadow_phys_bits and leak their SPs when the VM is destroyed.
      
      Use the current upper bound (based on host.MAXPHYADDR) to detect that the
      caller wants to zap all SPTEs, e.g. instead of using the max theoretical
      gfn, 1 << (52 - 12).  The more precise upper bound allows the TDP iterator
      to terminate its walk earlier when running on hosts with MAXPHYADDR < 52.
      
      Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to
      help future debuggers should KVM decide to leak SPTEs again.
      
      The bug is most easily reproduced by running (and unloading!) KVM in a
      VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped.
      
        =============================================================================
        BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
        Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1)
        CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack_lvl+0x45/0x59
         slab_err+0x95/0xc9
         __kmem_cache_shutdown.cold+0x3c/0x158
         kmem_cache_destroy+0x3d/0xf0
         kvm_mmu_module_exit+0xa/0x30 [kvm]
         kvm_arch_exit+0x5d/0x90 [kvm]
         kvm_exit+0x78/0x90 [kvm]
         vmx_exit+0x1a/0x50 [kvm_intel]
         __x64_sys_delete_module+0x13f/0x220
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      524a1e4e
    • S
      KVM: nVMX: Use vmx_need_pf_intercept() when deciding if L0 wants a #PF · 18712c13
      Sean Christopherson 提交于
      Use vmx_need_pf_intercept() when determining if L0 wants to handle a #PF
      in L2 or if the VM-Exit should be forwarded to L1.  The current logic fails
      to account for the case where #PF is intercepted to handle
      guest.MAXPHYADDR < host.MAXPHYADDR and ends up reflecting all #PFs into
      L1.  At best, L1 will complain and inject the #PF back into L2.  At
      worst, L1 will eat the unexpected fault and cause L2 to hang on infinite
      page faults.
      
      Note, while the bug was technically introduced by the commit that added
      support for the MAXPHYADDR madness, the shame is all on commit
      a0c13434 ("KVM: VMX: introduce vmx_need_pf_intercept").
      
      Fixes: 1dbf5d68 ("KVM: VMX: Add guest physical address check in EPT violation and misconfig")
      Cc: stable@vger.kernel.org
      Cc: Peter Shier <pshier@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812045615.3167686-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      18712c13
    • J
      kvm: vmx: Sync all matching EPTPs when injecting nested EPT fault · 85aa8889
      Junaid Shahid 提交于
      When a nested EPT violation/misconfig is injected into the guest,
      the shadow EPT PTEs associated with that address need to be synced.
      This is done by kvm_inject_emulated_page_fault() before it calls
      nested_ept_inject_page_fault(). However, that will only sync the
      shadow EPT PTE associated with the current L1 EPTP. Since the ASID
      is based on EP4TA rather than the full EPTP, so syncing the current
      EPTP is not enough. The SPTEs associated with any other L1 EPTPs
      in the prev_roots cache with the same EP4TA also need to be synced.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Message-Id: <20210806222229.1645356-1-junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      85aa8889
    • P
      KVM: x86: remove dead initialization · ffbe17ca
      Paolo Bonzini 提交于
      hv_vcpu is initialized again a dozen lines below, and at this
      point vcpu->arch.hyperv is not valid.  Remove the initializer.
      Reported-by: Nkernel test robot <lkp@intel.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ffbe17ca
    • S
      KVM: x86: Allow guest to set EFER.NX=1 on non-PAE 32-bit kernels · 1383279c
      Sean Christopherson 提交于
      Remove an ancient restriction that disallowed exposing EFER.NX to the
      guest if EFER.NX=0 on the host, even if NX is fully supported by the CPU.
      The motivation of the check, added by commit 2cc51560 ("KVM: VMX:
      Avoid saving and restoring msr_efer on lightweight vmexit"), was to rule
      out the case of host.EFER.NX=0 and guest.EFER.NX=1 so that KVM could run
      the guest with the host's EFER.NX and thus avoid context switching EFER
      if the only divergence was the NX bit.
      
      Fast forward to today, and KVM has long since stopped running the guest
      with the host's EFER.NX.  Not only does KVM context switch EFER if
      host.EFER.NX=1 && guest.EFER.NX=0, KVM also forces host.EFER.NX=0 &&
      guest.EFER.NX=1 when using shadow paging (to emulate SMEP).  Furthermore,
      the entire motivation for the restriction was made obsolete over a decade
      ago when Intel added dedicated host and guest EFER fields in the VMCS
      (Nehalem timeframe), which reduced the overhead of context switching EFER
      from 400+ cycles (2 * WRMSR + 1 * RDMSR) to a mere ~2 cycles.
      
      In practice, the removed restriction only affects non-PAE 32-bit kernels,
      as EFER.NX is set during boot if NX is supported and the kernel will use
      PAE paging (32-bit or 64-bit), regardless of whether or not the kernel
      will actually use NX itself (mark PTEs non-executable).
      
      Alternatively and/or complementarily, startup_32_smp() in head_32.S could
      be modified to set EFER.NX=1 regardless of paging mode, thus eliminating
      the scenario where NX is supported but not enabled.  However, that runs
      the risk of breaking non-KVM non-PAE kernels (though the risk is very,
      very low as there are no known EFER.NX errata), and also eliminates an
      easy-to-use mechanism for stressing KVM's handling of guest vs. host EFER
      across nested virtualization transitions.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210805183804.1221554-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1383279c
  3. 11 8月, 2021 1 次提交
    • S
      KVM: VMX: Use current VMCS to query WAITPKG support for MSR emulation · 7b9cae02
      Sean Christopherson 提交于
      Use the secondary_exec_controls_get() accessor in vmx_has_waitpkg() to
      effectively get the controls for the current VMCS, as opposed to using
      vmx->secondary_exec_controls, which is the cached value of KVM's desired
      controls for vmcs01 and truly not reflective of any particular VMCS.
      
      While the waitpkg control is not dynamic, i.e. vmcs01 will always hold
      the same waitpkg configuration as vmx->secondary_exec_controls, the same
      does not hold true for vmcs02 if the L1 VMM hides the feature from L2.
      If L1 hides the feature _and_ does not intercept MSR_IA32_UMWAIT_CONTROL,
      L2 could incorrectly read/write L1's virtual MSR instead of taking a #GP.
      
      Fixes: 6e3ba4ab ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210810171952.2758100-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7b9cae02
  4. 05 8月, 2021 1 次提交
    • S
      KVM: x86/mmu: Fix per-cpu counter corruption on 32-bit builds · d5aaad6f
      Sean Christopherson 提交于
      Take a signed 'long' instead of an 'unsigned long' for the number of
      pages to add/subtract to the total number of pages used by the MMU.  This
      fixes a zero-extension bug on 32-bit kernels that effectively corrupts
      the per-cpu counter used by the shrinker.
      
      Per-cpu counters take a signed 64-bit value on both 32-bit and 64-bit
      kernels, whereas kvm_mod_used_mmu_pages() takes an unsigned long and thus
      an unsigned 32-bit value on 32-bit kernels.  As a result, the value used
      to adjust the per-cpu counter is zero-extended (unsigned -> signed), not
      sign-extended (signed -> signed), and so KVM's intended -1 gets morphed to
      4294967295 and effectively corrupts the counter.
      
      This was found by a staggering amount of sheer dumb luck when running
      kvm-unit-tests on a 32-bit KVM build.  The shrinker just happened to kick
      in while running tests and do_shrink_slab() logged an error about trying
      to free a negative number of objects.  The truly lucky part is that the
      kernel just happened to be a slightly stale build, as the shrinker no
      longer yells about negative objects as of commit 18bb473e ("mm:
      vmscan: shrink deferred objects proportional to priority").
      
       vmscan: shrink_slab: mmu_shrink_scan+0x0/0x210 [kvm] negative objects to delete nr=-858993460
      
      Fixes: bc8a3d89 ("kvm: mmu: Fix overflow on kvm mmu page limit calculation")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210804214609.1096003-1-seanjc@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d5aaad6f
  5. 04 8月, 2021 2 次提交
    • M
      KVM: SVM: improve the code readability for ASID management · bb2baeb2
      Mingwei Zhang 提交于
      KVM SEV code uses bitmaps to manage ASID states. ASID 0 was always skipped
      because it is never used by VM. Thus, in existing code, ASID value and its
      bitmap postion always has an 'offset-by-1' relationship.
      
      Both SEV and SEV-ES shares the ASID space, thus KVM uses a dynamic range
      [min_asid, max_asid] to handle SEV and SEV-ES ASIDs separately.
      
      Existing code mixes the usage of ASID value and its bitmap position by
      using the same variable called 'min_asid'.
      
      Fix the min_asid usage: ensure that its usage is consistent with its name;
      allocate extra size for ASID 0 to ensure that each ASID has the same value
      with its bitmap position. Add comments on ASID bitmap allocation to clarify
      the size change.
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Alper Gun <alpergun@google.com>
      Cc: Dionna Glaze <dionnaglaze@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Vipin Sharma <vipinsh@google.com>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Message-Id: <20210802180903.159381-1-mizhang@google.com>
      [Fix up sev_asid_free to also index by ASID, as suggested by Sean
       Christopherson, and use nr_asids in sev_cpu_init. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb2baeb2
    • S
      KVM: SVM: Fix off-by-one indexing when nullifying last used SEV VMCB · 179c6c27
      Sean Christopherson 提交于
      Use the raw ASID, not ASID-1, when nullifying the last used VMCB when
      freeing an SEV ASID.  The consumer, pre_sev_run(), indexes the array by
      the raw ASID, thus KVM could get a false negative when checking for a
      different VMCB if KVM manages to reallocate the same ASID+VMCB combo for
      a new VM.
      
      Note, this cannot cause a functional issue _in the current code_, as
      pre_sev_run() also checks which pCPU last did VMRUN for the vCPU, and
      last_vmentry_cpu is initialized to -1 during vCPU creation, i.e. is
      guaranteed to mismatch on the first VMRUN.  However, prior to commit
      8a14fe4f ("kvm: x86: Move last_cpu into kvm_vcpu_arch as
      last_vmentry_cpu"), SVM tracked pCPU on its own and zero-initialized the
      last_cpu variable.  Thus it's theoretically possible that older versions
      of KVM could miss a TLB flush if the first VMRUN is on pCPU0 and the ASID
      and VMCB exactly match those of a prior VM.
      
      Fixes: 70cd94e6 ("KVM: SVM: VMRUN should use associated ASID when SEV is enabled")
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      179c6c27
  6. 03 8月, 2021 3 次提交
  7. 30 7月, 2021 3 次提交
  8. 28 7月, 2021 6 次提交
  9. 26 7月, 2021 3 次提交
  10. 17 7月, 2021 8 次提交
  11. 16 7月, 2021 4 次提交
    • M
      arm64: entry: fix KCOV suppression · e6f85cbe
      Mark Rutland 提交于
      We suppress KCOV for entry.o rather than entry-common.o. As entry.o is
      built from entry.S, this is pointless, and permits instrumentation of
      entry-common.o, which is built from entry-common.c.
      
      Fix the Makefile to suppress KCOV for entry-common.o, as we had intended
      to begin with. I've verified with objdump that this is working as
      expected.
      
      Fixes: bf6fa2c0 ("arm64: entry: don't instrument entry code with KCOV")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20210715123049.9990-1-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      e6f85cbe
    • M
      arm64: entry: add missing noinstr · 31a7f0f6
      Mark Rutland 提交于
      We intend that all the early exception handling code is marked as
      `noinstr`, but we forgot this for __el0_error_handler_common(), which is
      called before we have completed entry from user mode. If it were
      instrumented, we could run into problems with RCU, lockdep, etc.
      
      Mark it as `noinstr` to prevent this.
      
      The few other functions in entry-common.c which do not have `noinstr` are
      called once we've completed entry, and are safe to instrument.
      
      Fixes: bb8e93a2 ("arm64: entry: convert SError handlers to C")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Joey Gouly <joey.gouly@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20210714172801.16475-1-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      31a7f0f6
    • M
      arm64: mte: fix restoration of GCR_EL1 from suspend · 59f44069
      Mark Rutland 提交于
      Since commit:
      
        bad1e1c6 ("arm64: mte: switch GCR_EL1 in kernel entry and exit")
      
      we saved/restored the user GCR_EL1 value at exception boundaries, and
      update_gcr_el1_excl() is no longer used for this. However it is used to
      restore the kernel's GCR_EL1 value when returning from a suspend state.
      Thus, the comment is misleading (and an ISB is necessary).
      
      When restoring the kernel's GCR value, we need an ISB to ensure this is
      used by subsequent instructions. We don't necessarily get an ISB by
      other means (e.g. if the kernel is built without support for pointer
      authentication). As __cpu_setup() initialised GCR_EL1.Exclude to 0xffff,
      until a context synchronization event, allocation tag 0 may be used
      rather than the desired set of tags.
      
      This patch drops the misleading comment, adds the missing ISB, and for
      clarity folds update_gcr_el1_excl() into its only user.
      
      Fixes: bad1e1c6 ("arm64: mte: switch GCR_EL1 in kernel entry and exit")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20210714143843.56537-2-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      59f44069
    • R
      arm64: Avoid premature usercopy failure · 295cf156
      Robin Murphy 提交于
      Al reminds us that the usercopy API must only return complete failure
      if absolutely nothing could be copied. Currently, if userspace does
      something silly like giving us an unaligned pointer to Device memory,
      or a size which overruns MTE tag bounds, we may fail to honour that
      requirement when faulting on a multi-byte access even though a smaller
      access could have succeeded.
      
      Add a mitigation to the fixup routines to fall back to a single-byte
      copy if we faulted on a larger access before anything has been written
      to the destination, to guarantee making *some* forward progress. We
      needn't be too concerned about the overall performance since this should
      only occur when callers are doing something a bit dodgy in the first
      place. Particularly broken userspace might still be able to trick
      generic_perform_write() into an infinite loop by targeting write() at
      an mmap() of some read-only device register where the fault-in load
      succeeds but any store synchronously aborts such that copy_to_user() is
      genuinely unable to make progress, but, well, don't do that...
      
      CC: stable@vger.kernel.org
      Reported-by: NChen Huang <chenhuang5@huawei.com>
      Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
      Link: https://lore.kernel.org/r/dc03d5c675731a1f24a62417dba5429ad744234e.1626098433.git.robin.murphy@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      295cf156
  12. 15 7月, 2021 1 次提交