提交 · be911771330a2ae0938d452fd89a8df085533134 · openeuler / Kernel

24 6月, 2022 10 次提交

KVM: x86/mmu: Move guest PT write-protection to account_shadowed() · be911771

由 David Matlack 提交于 6月 22, 2022

Move the code that write-protects newly-shadowed guest page tables into
account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
more logical place for this code to live. But most importantly, this
reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
kvm_vcpu pointer, which will be necessary when creating new shadow pages
during VM ioctls for eager page splitting.

Note, it is safe to drop the role.level == PG_LEVEL_4K check since
account_shadowed() returns early if role.level > PG_LEVEL_4K.

No functional change intended.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-10-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

be911771

KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages · 87654643

由 David Matlack 提交于 6月 22, 2022

Rename 2 functions:

  kvm_mmu_get_page() -> kvm_mmu_get_shadow_page()
  kvm_mmu_free_page() -> kvm_mmu_free_shadow_page()

This change makes it clear that these functions deal with shadow pages
rather than struct pages. It also aligns these functions with the naming
scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page().

Prefer "shadow_page" over the shorter "sp" since these are core
functions and the line lengths aren't terrible.

No functional change intended.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-9-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

87654643

KVM: x86/mmu: Consolidate shadow page allocation and initialization · c306aec8

由 David Matlack 提交于 6月 22, 2022

Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under
the latter so that all shadow page allocation and initialization happens
in one place.

No functional change intended.
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-8-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c306aec8

KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions · 94c81364

由 David Matlack 提交于 6月 22, 2022

Decompose kvm_mmu_get_page() into separate helper functions to increase
readability and prepare for allocating shadow pages without a vcpu
pointer.

Specifically, pull the guts of kvm_mmu_get_page() into 2 helper
functions:

kvm_mmu_find_shadow_page() -
  Walks the page hash checking for any existing mmu pages that match the
  given gfn and role.

kvm_mmu_alloc_shadow_page()
  Allocates and initializes an entirely new kvm_mmu_page. This currently
  requries a vcpu pointer for allocation and looking up the memslot but
  that will be removed in a future commit.

No functional change intended.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-7-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

94c81364

KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytes · 7f497775

由 David Matlack 提交于 6月 22, 2022

The quadrant is only used when gptes are 4 bytes, but
mmu_alloc_{direct,shadow}_roots() pass in a non-zero quadrant for PAE
page directories regardless. Make this less confusing by only passing in
a non-zero quadrant when it is actually necessary.
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-6-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7f497775

KVM: x86/mmu: Derive shadow MMU page role from parent · 2e65e842

由 David Matlack 提交于 6月 22, 2022

Instead of computing the shadow page role from scratch for every new
page, derive most of the information from the parent shadow page.  This
eliminates the dependency on the vCPU root role to allocate shadow page
tables, and reduces the number of parameters to kvm_mmu_get_page().

Preemptively split out the role calculation to a separate function for
use in a following commit.

Note that when calculating the MMU root role, we can take
@role.passthrough, @role.direct, and @role.access directly from
@vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still
must be overridden for PAE page directories, when shadowing 32-bit
guest page tables with PAE page tables.

No functional change intended.
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-5-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2e65e842

KVM: x86/mmu: Stop passing "direct" to mmu_alloc_root() · 86938ab6

由 David Matlack 提交于 6月 22, 2022

The "direct" argument is vcpu->arch.mmu->root_role.direct,
because unlike non-root page tables, it's impossible to have
a direct root in an indirect MMU.  So just use that.
Suggested-by: NLai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-4-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

86938ab6

KVM: x86/mmu: Use a bool for direct · 27a59d57

由 David Matlack 提交于 6月 22, 2022

The parameter "direct" can either be true or false, and all of the
callers pass in a bool variable or true/false literal, so just use the
type bool.

No functional change intended.
Reviewed-by: NLai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-3-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

27a59d57

KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs · bb924ca6

由 David Matlack 提交于 6月 22, 2022

Commit fb58a9c3 ("KVM: x86/mmu: Optimize MMU page cache lookup for
fully direct MMUs") skipped the unsync checks and write flood clearing
for full direct MMUs. We can extend this further to skip the checks for
all direct shadow pages. Direct shadow pages in indirect MMUs (i.e.
shadow paging) are used when shadowing a guest huge page with smaller
pages. Such direct shadow pages, like their counterparts in fully direct
MMUs, are never marked unsynced or have a non-zero write-flooding count.

Checking sp->role.direct also generates better code than checking
direct_map because, due to register pressure, direct_map has to get
shoved onto the stack and then pulled back off.

No functional change intended.
Reviewed-by: NLai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-2-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bb924ca6

KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis · 084cc29f

由 Ben Gardon 提交于 6月 13, 2022

In some cases, the NX hugepage mitigation for iTLB multihit is not
needed for all guests on a host. Allow disabling the mitigation on a
per-VM basis to avoid the performance hit of NX hugepages on trusted
workloads.

In order to disable NX hugepages on a VM, ensure that the userspace
actor has permission to reboot the system. Since disabling NX hugepages
would allow a guest to crash the system, it is similar to reboot
permissions.

Ideally, KVM would require userspace to prove it has access to KVM's
nx_huge_pages module param, e.g. so that userspace can opt out without
needing full reboot permissions.  But getting access to the module param
file info is difficult because it is buried in layers of sysfs and module
glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases.
Suggested-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NBen Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-9-bgardon@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

084cc29f

20 6月, 2022 9 次提交

KVM: x86/mmu: Shove refcounted page dependency into host_pfn_mapping_level() · 5d49f08c

由 Sean Christopherson 提交于 4月 29, 2022

Move the check that restricts mapping huge pages into the guest to pfns
that are backed by refcounted 'struct page' memory into the helper that
actually "requires" a 'struct page', host_pfn_mapping_level().  In
addition to deduplicating code, moving the check to the helper eliminates
the subtle requirement that the caller check that the incoming pfn is
backed by a refcounted struct page, and as an added bonus avoids an extra
pfn_to_page() lookup.

Note, the is_error_noslot_pfn() check in kvm_mmu_hugepage_adjust() needs
to stay where it is, as it guards against dereferencing a NULL memslot in
the kvm_slot_dirty_track_enabled() that follows.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-11-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5d49f08c

KVM: Rename/refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page() · b14b2690

由 Sean Christopherson 提交于 4月 29, 2022

Rename and refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()
to better reflect what KVM is actually checking, and to eliminate extra
pfn_to_page() lookups.  The kvm_release_pfn_*() an kvm_try_get_pfn()
helpers in particular benefit from "refouncted" nomenclature, as it's not
all that obvious why KVM needs to get/put refcounts for some PG_reserved
pages (ZERO_PAGE and ZONE_DEVICE).

Add a comment to call out that the list of exceptions to PG_reserved is
all but guaranteed to be incomplete.  The list has mostly been compiled
by people throwing noodles at KVM and finding out they stick a little too
well, e.g. the ZERO_PAGE's refcount overflowed and ZONE_DEVICE pages
didn't get freed.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-10-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b14b2690

KVM: Take a 'struct page', not a pfn in kvm_is_zone_device_page() · 284dc493

由 Sean Christopherson 提交于 4月 29, 2022

Operate on a 'struct page' instead of a pfn when checking if a page is a
ZONE_DEVICE page, and rename the helper accordingly. Generally speaking,
KVM doesn't actually care about ZONE_DEVICE memory, i.e. shouldn't do
anything special for ZONE_DEVICE memory. Rather, KVM wants to treat
ZONE_DEVICE memory like regular memory, and the need to identify
ZONE_DEVICE memory only arises as an exception to PG_reserved pages. In
other words, KVM should only ever check for ZONE_DEVICE memory after KVM
has already verified that there is a struct page associated with the pfn.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-9-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

284dc493

KVM: x86/mmu: Use common logic for computing the 32/64-bit base PA mask · 70e41c31

由 Sean Christopherson 提交于 6月 14, 2022

Use common logic for computing PT_BASE_ADDR_MASK for 32-bit, 64-bit, and
EPT paging.  Both PAGE_MASK and the new-common logic are supsersets of
what is actually needed for 32-bit paging.  PAGE_MASK sets bits 63:12 and
the former GUEST_PT64_BASE_ADDR_MASK sets bits 51:12, so regardless of
which value is used, the result will always be bits 31:12.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-9-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

70e41c31

KVM: x86/mmu: Truncate paging32's PT_BASE_ADDR_MASK to 32 bits · f7384b88

由 Sean Christopherson 提交于 6月 14, 2022

Truncate paging32's PT_BASE_ADDR_MASK to a pt_element_t, i.e. to 32 bits.
Ignoring PSE huge pages, the mask is only used in conjunction with gPTEs,
which are 32 bits, and so the address is limited to bits 31:12.

PSE huge pages encoded PA bits 39:32 in PTE bits 20:13, i.e. need custom
logic to handle their funky encoding regardless of PT_BASE_ADDR_MASK.

Note, PT_LVL_OFFSET_MASK is somewhat confusing in that it computes the
offset of the _gfn_, not of the gpa, i.e. not having bits 63:32 set in
PT_BASE_ADDR_MASK is again correct.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-8-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f7384b88

KVM: x86/mmu: Use common macros to compute 32/64-bit paging masks · f6b8ea6d

由 Paolo Bonzini 提交于 6月 15, 2022

Dedup the code for generating (most of) the per-type PT_* masks in
paging_tmpl.h.  The relevant macros only vary based on the number of bits
per level, and that smidge of info is already provided in a common form
as PT_LEVEL_BITS.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-7-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f6b8ea6d

KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEs · 2ca3129e

由 Sean Christopherson 提交于 6月 14, 2022

Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs
(PT64).  SPTE and PT64 are _mostly_ the same, but the few differences are
quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and
guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is
very much specific to SPTEs.

Opportunistically (and temporarily) move most guest macros into paging.h
to clearly associate them with shadow paging, and to ensure that they're
not used as of this commit.  A future patch will eliminate them entirely.

Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's
needed for the quadrant calculation in kvm_mmu_get_page().  The quadrant
calculation is hot enough (when using shadow paging with 32-bit guests)
that adding a per-context helper is undesirable, and burying the
computation in paging_tmpl.h with a forward declaration isn't exactly an
improvement.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-6-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2ca3129e

KVM: x86/mmu: Dedup macros for computing various page table masks · 42c88ff8

由 Sean Christopherson 提交于 6月 14, 2022

Provide common helper macros to generate various masks, shifts, etc...
for 32-bit vs. 64-bit page tables.  Only the inputs differ, the actual
calculations are identical.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

42c88ff8

KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.h · b3fcdb04

由 Sean Christopherson 提交于 6月 14, 2022

Move a handful of one-off macros and helpers for 32-bit PSE paging into
paging_tmpl.h and hide them behind "PTTYPE == 32".  Under no circumstance
should anything but 32-bit shadow paging care about PSE paging.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b3fcdb04

15 6月, 2022 5 次提交

KVM: x86/mmu: Use try_cmpxchg64 in fast_pf_fix_direct_spte · 2db2f46f

由 Uros Bizjak 提交于 5月 20, 2022

Use try_cmpxchg64 instead of cmpxchg64 (*ptr, old, new) != old in
fast_pf_fix_direct_spte. cmpxchg returns success in ZF flag, so this
change saves a compare after cmpxchg (and related move instruction
in front of cmpxchg).
Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Message-Id: <20220520144635.63134-1-ubizjak@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2db2f46f

KVM: x86/mmu: Use try_cmpxchg64 in tdp_mmu_set_spte_atomic · aee98a68

由 Uros Bizjak 提交于 5月 18, 2022

Use try_cmpxchg64 instead of cmpxchg64 (*ptr, old, new) != old in
tdp_mmu_set_spte_atomic.  cmpxchg returns success in ZF flag, so this
change saves a compare after cmpxchg (and related move instruction
in front of cmpxchg). Also, remove explicit assignment to iter->old_spte
when cmpxchg fails, this is what try_cmpxchg does implicitly.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220518135111.3535-1-ubizjak@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

aee98a68

KVM: x86/mmu: Drop unused CMPXCHG macro from paging_tmpl.h · 007a369f

由 Sean Christopherson 提交于 6月 13, 2022

Drop the CMPXCHG macro from paging_tmpl.h, it's no longer used now that
KVM uses a common uaccess helper to do 8-byte CMPXCHG.

Fixes: f122dfe4 ("KVM: x86: Use __try_cmpxchg_user() to update guest PTE A/D bits")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220613225723.2734132-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

007a369f

KVM: X86/MMU: Remove useless mmu_topup_memory_caches() in kvm_mmu_pte_write() · 024c3c33

由 Lai Jiangshan 提交于 6月 05, 2022

Since the commit c5e2184d("KVM: x86/mmu: Remove the defunct
update_pte() paging hook"), kvm_mmu_pte_write() no longer uses the rmap
cache.

So remove mmu_topup_memory_caches() in it.

Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220605063417.308311-6-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

024c3c33

KVM: X86/MMU: Remove unused PT32_DIR_BASE_ADDR_MASK from mmu.c · fc10020a

由 Lai Jiangshan 提交于 6月 05, 2022

It is unused.
Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220605063417.308311-3-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fc10020a

09 6月, 2022 1 次提交

KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs · d2263de1

由 Yuan Yao 提交于 6月 08, 2022

Assign shadow_me_value, not shadow_me_mask, to PAE root entries,
a.k.a. shadow PDPTRs, when host memory encryption is supported.  The
"mask" is the set of all possible memory encryption bits, e.g. MKTME
KeyIDs, whereas "value" holds the actual value that needs to be
stuffed into host page tables.

Using shadow_me_mask results in a failed VM-Entry due to setting
reserved PA bits in the PDPTRs, and ultimately causes an OOPS due to
physical addresses with non-zero MKTME bits sending to_shadow_page()
into the weeds:

set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
BUG: unable to handle page fault for address: ffd43f00063049e8
PGD 86dfd8067 P4D 0
Oops: 0000 [#1] PREEMPT SMP
RIP: 0010:mmu_free_root_page+0x3c/0x90 [kvm]
 kvm_mmu_free_roots+0xd1/0x200 [kvm]
 __kvm_mmu_unload+0x29/0x70 [kvm]
 kvm_mmu_unload+0x13/0x20 [kvm]
 kvm_arch_destroy_vm+0x8a/0x190 [kvm]
 kvm_put_kvm+0x197/0x2d0 [kvm]
 kvm_vm_release+0x21/0x30 [kvm]
 __fput+0x8e/0x260
 ____fput+0xe/0x10
 task_work_run+0x6f/0xb0
 do_exit+0x327/0xa90
 do_group_exit+0x35/0xa0
 get_signal+0x911/0x930
 arch_do_signal_or_restart+0x37/0x720
 exit_to_user_mode_prepare+0xb2/0x140
 syscall_exit_to_user_mode+0x16/0x30
 do_syscall_64+0x4e/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: e54f1ff2 ("KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask")
Signed-off-by: NYuan Yao <yuan.yao@intel.com>
Reviewed-by: NKai Huang <kai.huang@intel.com>
Message-Id: <20220608012015.19566-1-yuan.yao@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d2263de1

08 6月, 2022 2 次提交

KVM: x86/mmu: Comment FNAME(sync_page) to document TLB flushing logic · b8b9156e

由 Sean Christopherson 提交于 5月 13, 2022

Add a comment to FNAME(sync_page) to explain why the TLB flushing logic
conspiculously doesn't handle the scenario of guest protections being
reduced. Specifically, if synchronizing a SPTE drops execute protections,
KVM will not emit a TLB flush, whereas dropping writable or clearing A/D
bits does trigger a flush via mmu_spte_update(). Architecturally, until
the GPTE is implicitly or explicitly flushed from the guest's perspective,
KVM is not required to flush any old, stale translations.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Message-Id: <20220513195000.99371-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b8b9156e

KVM: x86/mmu: Drop RWX=0 SPTEs during ept_sync_page() · 9fb35657

由 Sean Christopherson 提交于 5月 13, 2022

All of sync_page()'s existing checks filter out only !PRESENT gPTE,
because without execute-only, all upper levels are guaranteed to be at
least READABLE. However, if EPT with execute-only support is in use by
L1, KVM can create an SPTE that is shadow-present but guest-inaccessible
(RWX=0) if the upper level combined permissions are R (or RW) and
the leaf EPTE is changed from R (or RW) to X. Because the EPTE is
considered present when viewed in isolation, and no reserved bits are set,
FNAME(prefetch_invalid_gpte) will consider the GPTE valid, and cause a
not-present SPTE to be created.

The SPTE is "correct": the guest translation is inaccessible because
the combined protections of all levels yield RWX=0, and KVM will just
redirect any vmexits to the guest. If EPT A/D bits are disabled, KVM
can mistake the SPTE for an access-tracked SPTE, but again such confusion
isn't fatal, as the "saved" protections are also RWX=0. However,
creating a useless SPTE in general means that KVM messed up something,
even if this particular goof didn't manifest as a functional bug.
So, drop SPTEs whose new protections will yield a RWX=0 SPTE, and
add a WARN in make_spte() to detect creation of SPTEs that will
result in RWX=0 protections.

Fixes: d95c5568 ("kvm: mmu: track read permission explicitly for shadow EPT page tables")
Cc: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220513195000.99371-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9fb35657

07 6月, 2022 2 次提交

KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging · 5ba7c4c6

由 Ben Gardon 提交于 5月 25, 2022

Currently disabling dirty logging with the TDP MMU is extremely slow.
On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds
to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with
the shadow MMU.

When disabling dirty logging, zap non-leaf parent entries to allow
replacement with huge pages instead of recursing and zapping all of the
child, leaf entries. This reduces the number of TLB flushes required.
and reduces the disable dirty log time with the TDP MMU to ~3 seconds.

Opportunistically add a WARN() to catch GFNs that are mapped at a
higher level than their max level.
Signed-off-by: NBen Gardon <bgardon@google.com>
Message-Id: <20220525230904.1584480-1-bgardon@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5ba7c4c6

KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots() · cf4a8693

由 Shaoqin Huang 提交于 6月 06, 2022

When freeing obsolete previous roots, check prev_roots as intended, not
the current root.
Signed-off-by: NShaoqin Huang <shaoqin.huang@intel.com>
Fixes: 527d5cd7 ("KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped")
Message-Id: <20220607005905.2933378-1-shaoqin.huang@intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cf4a8693

21 5月, 2022 1 次提交

KVM: x86/mmu: fix NULL pointer dereference on guest INVPCID · 9f46c187

由 Paolo Bonzini 提交于 5月 20, 2022

With shadow paging enabled, the INVPCID instruction results in a call
to kvm_mmu_invpcid_gva.  If INVPCID is executed with CR0.PG=0, the
invlpg callback is not set and the result is a NULL pointer dereference.
Fix it trivially by checking for mmu->invlpg before every call.

There are other possibilities:

- check for CR0.PG, because KVM (like all Intel processors after P5)
  flushes guest TLB on CR0.PG changes so that INVPCID/INVLPG are a
  nop with paging disabled

- check for EFER.LMA, because KVM syncs and flushes when switching
  MMU contexts outside of 64-bit mode

All of these are tricky, go for the simple solution.  This is CVE-2022-1789.
Reported-by: NYongkang Jia <kangel@zju.edu.cn>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9f46c187

12 5月, 2022 10 次提交

KVM: x86/mmu: Update number of zapped pages even if page list is stable · b28cb0cd

由 Sean Christopherson 提交于 5月 11, 2022

When zapping obsolete pages, update the running count of zapped pages
regardless of whether or not the list has become unstable due to zapping
a shadow page with its own child shadow pages.  If the VM is backed by
mostly 4kb pages, KVM can zap an absurd number of SPTEs without bumping
the batch count and thus without yielding.  In the worst case scenario,
this can cause a soft lokcup.

 watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [dirty_log_perf_:13020]
   RIP: 0010:workingset_activation+0x19/0x130
   mark_page_accessed+0x266/0x2e0
   kvm_set_pfn_accessed+0x31/0x40
   mmu_spte_clear_track_bits+0x136/0x1c0
   drop_spte+0x1a/0xc0
   mmu_page_zap_pte+0xef/0x120
   __kvm_mmu_prepare_zap_page+0x205/0x5e0
   kvm_mmu_zap_all_fast+0xd7/0x190
   kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
   kvm_page_track_flush_slot+0x5c/0x80
   kvm_arch_flush_shadow_memslot+0xe/0x10
   kvm_set_memslot+0x1a8/0x5d0
   __kvm_set_memory_region+0x337/0x590
   kvm_vm_ioctl+0xb08/0x1040

Fixes: fbb158cb ("KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""")
Reported-by: NDavid Matlack <dmatlack@google.com>
Reviewed-by: NBen Gardon <bgardon@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220511145122.3133334-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b28cb0cd

KVM: x86/mmu: Speed up slot_rmap_walk_next for sparsely populated rmaps · 6ba1e04f

由 Vipin Sharma 提交于 5月 02, 2022

Avoid calling handlers on empty rmap entries and skip to the next non
empty rmap entry.

Empty rmap entries are noop in handlers.
Signed-off-by: NVipin Sharma <vipinsh@google.com>
Suggested-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220502220347.174664-1-vipinsh@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6ba1e04f

KVM: VMX: Include MKTME KeyID bits in shadow_zero_check · 3c5c3245

由 Kai Huang 提交于 4月 19, 2022

Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should
never be set to SPTE. Set shadow_me_value to 0 and shadow_me_mask to
include all MKTME KeyID bits to include them to shadow_zero_check.
Signed-off-by: NKai Huang <kai.huang@intel.com>
Message-Id: <27bc10e97a3c0b58a4105ff9107448c190328239.1650363789.git.kai.huang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3c5c3245

KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask · e54f1ff2

由 Kai Huang 提交于 4月 19, 2022

Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
high bits of physical address bits as 'KeyID' bits.  Intel Trust Domain
Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
KeyID bits.  TDX private KeyID bits cannot be set in any mapping in the
host kernel since they can only be accessed by software running inside a
new CPU isolated mode.  And unlike to AMD's SME, host kernel doesn't set
any legacy MKTME KeyID bits to any mapping either.  Therefore, it's not
legitimate for KVM to set any KeyID bits in SPTE which maps guest
memory.

KVM maintains shadow_zero_check bits to represent which bits must be
zero for SPTE which maps guest memory.  MKTME KeyID bits should be set
to shadow_zero_check.  Currently, shadow_me_mask is used by AMD to set
the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
shadow_zero_check.  So initializing shadow_me_mask to represent all
MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
to shadow_zero_check).

Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
and repurpose shadow_me_mask as 'all possible memory encryption bits'.
The new schematic of them will be:

 - shadow_me_value: the memory encryption bit(s) that will be set to the
   SPTE (the original shadow_me_mask).
 - shadow_me_mask: all possible memory encryption bits (which is a super
   set of shadow_me_value).
 - For now, shadow_me_value is supposed to be set by SVM and VMX
   respectively, and it is a constant during KVM's life time.  This
   perhaps doesn't fit MKTME but for now host kernel doesn't support it
   (and perhaps will never do).
 - Bits in shadow_me_mask are set to shadow_zero_check, except the bits
   in shadow_me_value.

Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
Replace shadow_me_mask with shadow_me_value in almost all code paths,
except the one in PT64_PERM_MASK, which is used by need_remote_flush()
to determine whether remote TLB flush is needed.  This should still use
shadow_me_mask as any encryption bit change should need a TLB flush.
And for AMD, move initializing shadow_me_value/shadow_me_mask from
kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
Signed-off-by: NKai Huang <kai.huang@intel.com>
Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e54f1ff2

KVM: x86/mmu: Rename reset_rsvds_bits_mask() · c919e881

由 Kai Huang 提交于 4月 19, 2022

Rename reset_rsvds_bits_mask() to reset_guest_rsvds_bits_mask() to make
it clearer that it resets the reserved bits check for guest's page table
entries.
Signed-off-by: NKai Huang <kai.huang@intel.com>
Message-Id: <efdc174b85d55598880064b8bf09245d3791031d.1650363789.git.kai.huang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c919e881

KVM: x86/mmu: Expand and clean up page fault stats · 1075d41e

由 Sean Christopherson 提交于 4月 23, 2022

Expand and clean up the page fault stats. The current stats are at best
incomplete, and at worst misleading. Differentiate between faults that
are actually fixed vs those that result in an MMIO SPTE being created,
track faults that are spurious, faults that trigger emulation, faults
that that are fixed in the fast path, and last but not least, track the
number of faults that are taken.

Note, the number of faults that require emulation for write-protected
shadow pages can roughly be calculated by subtracting the number of MMIO
SPTEs created from the overall number of faults that trigger emulation.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-10-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1075d41e

KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faults · 8d5265b1

由 Sean Christopherson 提交于 4月 23, 2022

Use IS_ENABLED() instead of an #ifdef to activate the anti-RETPOLINE fast
path for TDP page faults.  The generated code is identical, and the #ifdef
makes it dangerously difficult to extend the logic (guess who forgot to
add an "else" inside the #ifdef and ran through the page fault handler
twice).

No functional or binary change intented.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-9-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8d5265b1

KVM: x86/mmu: Make all page fault handlers internal to the MMU · 8a009d5b

由 Sean Christopherson 提交于 4月 23, 2022

Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
of the page fault handling collateral that was in mmu.h purely for the
async #PF handler into mmu_internal.h, where it belongs.  This will allow
kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
expose those enums outside of the MMU.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-8-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8a009d5b

KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns" · 5276c616

由 Sean Christopherson 提交于 4月 23, 2022

Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
kvm_faultin_pfn() to signal that the page fault handler should continue
doing its thing.  Aside from being gross and inefficient, using a boolean
return to signal continue vs. stop makes it extremely difficult to add
more helpers and/or move existing code to a helper.

E.g. hypothetically, if nested MMUs were to gain a separate page fault
handler in the future, everything up to the "is self-modifying PTE" check
can be shared by all shadow MMUs, but communicating up the stack whether
to continue on or stop becomes a nightmare.

More concretely, proposed support for private guest memory ran into a
similar issue, where it'll be forced to forego a helper in order to yield
sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.

No functional change intended.

Cc: David Matlack <dmatlack@google.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-7-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5276c616

KVM: x86/mmu: Drop exec/NX check from "page fault can be fast" · 5c64aba5

由 Sean Christopherson 提交于 4月 23, 2022

Tweak the "page fault can be fast" logic to explicitly check for !PRESENT
faults in the access tracking case, and drop the exec/NX check that
becomes redundant as a result. No sane hardware will generate an access
that is both an instruct fetch and a write, i.e. it's a waste of cycles.
If hardware goes off the rails, or KVM runs under a misguided hypervisor,
spuriously running throught fast path is benign (KVM has been uknowingly
being doing exactly that for years).
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-6-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5c64aba5

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功