提交 · 9688088378d41d7918ee50fc89b6f884212aaa22 · openeuler / Kernel

10 7月, 2020 10 次提交

KVM: x86/mmu: Zero allocate shadow pages (outside of mmu_lock) · 96880883

由 Sean Christopherson 提交于 7月 02, 2020

Set __GFP_ZERO for the shadow page memory cache and drop the explicit
clear_page() from kvm_mmu_get_page().  This moves the cost of zeroing a
page to the allocation time of the physical page, i.e. when topping up
the memory caches, and thus avoids having to zero out an entire page
while holding mmu_lock.

Cc: Peter Feiner <pfeiner@google.com>
Cc: Peter Shier <pshier@google.com>
Cc: Junaid Shahid <junaids@google.com>
Cc: Jim Mattson <jmattson@google.com>
Suggested-by: NBen Gardon <bgardon@google.com>
Reviewed-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-12-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

96880883

KVM: x86/mmu: Make __GFP_ZERO a property of the memory cache · 5f6078f9

由 Sean Christopherson 提交于 7月 02, 2020

Add a gfp_zero flag to 'struct kvm_mmu_memory_cache' and use it to
control __GFP_ZERO instead of hardcoding a call to kmem_cache_zalloc().
A future patch needs such a flag for the __get_free_page() path, as
gfn arrays do not need/want the allocator to zero the memory.  Convert
the kmem_cache paths to __GFP_ZERO now so as to avoid a weird and
inconsistent API in the future.

No functional change intended.
Reviewed-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-11-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5f6078f9

KVM: x86/mmu: Separate the memory caches for shadow pages and gfn arrays · 171a90d7

由 Sean Christopherson 提交于 7月 02, 2020

Use separate caches for allocating shadow pages versus gfn arrays.  This
sets the stage for specifying __GFP_ZERO when allocating shadow pages
without incurring extra cost for gfn arrays.

No functional change intended.
Reviewed-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-10-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

171a90d7

KVM: x86/mmu: Clean up the gorilla math in mmu_topup_memory_caches() · 531281ad

由 Sean Christopherson 提交于 7月 02, 2020

Clean up the minimums in mmu_topup_memory_caches() to document the
driving mechanisms behind the minimums. Now that encountering an empty
cache is unlikely to trigger BUG_ON(), it is less dangerous to be more
precise when defining the minimums.

For rmaps, the logic is 1 parent PTE per level, plus a single rmap, and
prefetched rmaps. The extra objects in the current '8 + PREFETCH'
minimum came about due to an abundance of paranoia in commit
c41ef344 ("KVM: MMU: increase per-vcpu rmap cache alloc size"),
i.e. it could have increased the minimum to 2 rmaps. Furthermore, the
unexpected extra rmap case was killed off entirely by commits
f759e2b4 ("KVM: MMU: avoid pte_list_desc running out in
kvm_mmu_pte_write") and f5a1e9f8 ("KVM: MMU: remove call to
kvm_mmu_pte_write from walk_addr").

For the so called page cache, replace '8' with 2*PT64_ROOT_MAX_LEVEL.
The 2x multiplier is needed because the cache is used for both shadow
pages and gfn arrays for indirect MMUs.

And finally, for page headers, replace '4' with PT64_ROOT_MAX_LEVEL.

Note, KVM now supports 5-level paging, i.e. the old minimums that used a
baseline derived from 4-level paging were technically wrong. But, KVM
always allocates roots in a separate flow, e.g. it's impossible in the
current implementation to actually need 5 new shadow pages in a single
flow. Use PT64_ROOT_MAX_LEVEL unmodified instead of subtracting 1, as
the direct usage is likely more intuitive to uninformed readers, and the
inflated minimum is unlikely to affect functionality in practice.
Reviewed-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-9-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

531281ad

KVM: x86/mmu: Move fast_page_fault() call above mmu_topup_memory_caches() · 83291445

由 Sean Christopherson 提交于 7月 02, 2020

Avoid refilling the memory caches and potentially slow reclaim/swap when
handling a fast page fault, which does not need to allocate any new
objects.
Reviewed-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-7-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

83291445

KVM: x86/mmu: Try to avoid crashing KVM if a MMU memory cache is empty · 53a3f487

由 Sean Christopherson 提交于 7月 02, 2020

Attempt to allocate a new object instead of crashing KVM (and likely the
kernel) if a memory cache is unexpectedly empty.  Use GFP_ATOMIC for the
allocation as the caches are used while holding mmu_lock.  The immediate
BUG_ON() makes the code unnecessarily explosive and led to confusing
minimums being used in the past, e.g. allocating 4 objects where 1 would
suffice.
Reviewed-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-6-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

53a3f487

KVM: x86/mmu: Remove superfluous gotos from mmu_topup_memory_caches() · 284aa868

由 Sean Christopherson 提交于 7月 02, 2020

Return errors directly from mmu_topup_memory_caches() instead of
branching to a label that does the same.

No functional change intended.
Reviewed-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-5-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

284aa868

KVM: x86/mmu: Use consistent "mc" name for kvm_mmu_memory_cache locals · 356ec69a

由 Sean Christopherson 提交于 7月 02, 2020

Use "mc" for local variables to shorten line lengths and provide
consistent names, which will be especially helpful when some of the
helpers are moved to common KVM code in future patches.

No functional change intended.
Reviewed-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-4-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

356ec69a

KVM: x86/mmu: Consolidate "page" variant of memory cache helpers · 45177ccc

由 Sean Christopherson 提交于 7月 02, 2020

Drop the "page" variants of the topup/free memory cache helpers, using
the existence of an associated kmem_cache to select the correct alloc
or free routine.

No functional change intended.
Reviewed-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-3-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

45177ccc

KVM: x86/mmu: Track the associated kmem_cache in the MMU caches · 5962bfb7

由 Sean Christopherson 提交于 7月 02, 2020

Track the kmem_cache used for non-page KVM MMU memory caches instead of
passing in the associated kmem_cache when filling the cache.  This will
allow consolidating code and other cleanups.

No functional change intended.
Reviewed-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703023545.8771-2-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5962bfb7

09 7月, 2020 14 次提交

KVM: x86: take as_id into account when checking PGD · 995decb6

由 Vitaly Kuznetsov 提交于 7月 08, 2020

OVMF booted guest running on shadow pages crashes on TRIPLE FAULT after
enabling paging from SMM. The crash is triggered from mmu_check_root() and
is caused by kvm_is_visible_gfn() searching through memslots with as_id = 0
while vCPU may be in a different context (address space).

Introduce kvm_vcpu_is_visible_gfn() and use it from mmu_check_root().
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200708140023.1476020-1-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

995decb6

KVM: x86/mmu: Rename page_header() to to_shadow_page() · e47c4aee

由 Sean Christopherson 提交于 6月 22, 2020

Rename KVM's accessor for retrieving a 'struct kvm_mmu_page' from the
associated host physical address to better convey what the function is
doing.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622202034.15093-7-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e47c4aee

KVM: x86/mmu: Add sptep_to_sp() helper to wrap shadow page lookup · 57354682

由 Sean Christopherson 提交于 6月 22, 2020

Introduce sptep_to_sp() to reduce the boilerplate code needed to get the
shadow page associated with a spte pointer, and to improve readability
as it's not immediately obvious that "page_header" is a KVM-specific
accessor for retrieving a shadow page.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622202034.15093-6-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

57354682

KVM: x86/mmu: Add MMU-internal header · 6ca9a6f3

由 Sean Christopherson 提交于 6月 22, 2020

Add mmu/mmu_internal.h to hold declarations and definitions that need
to be shared between various mmu/ files, but should not be used by
anything outside of the MMU.

Begin populating mmu_internal.h with declarations of the helpers used by
page_track.c.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622202034.15093-4-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6ca9a6f3

KVM: x86/mmu: Move kvm_mmu_available_pages() into mmu.c · afe8d7e6

由 Sean Christopherson 提交于 6月 22, 2020

Move kvm_mmu_available_pages() from mmu.h to mmu.c, it has a single
caller and has no business being exposed via mmu.h.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622202034.15093-3-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

afe8d7e6

KVM: x86/mmu: Exit to userspace on make_mmu_pages_available() error · 7bd7ded6

由 Sean Christopherson 提交于 6月 23, 2020

Propagate any error returned by make_mmu_pages_available() out to
userspace instead of resuming the guest if the error occurs while
handling a page fault. Now that zapping the oldest MMU pages skips
active roots, i.e. fails if and only if there are no zappable pages,
there is no chance for a false positive, i.e. no chance of returning a
spurious error to userspace.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623193542.7554-5-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7bd7ded6

KVM: x86/mmu: Batch zap MMU pages when shrinking the slab · ebdb292d

由 Sean Christopherson 提交于 6月 23, 2020

Use the recently introduced kvm_mmu_zap_oldest_mmu_pages() to batch zap
MMU pages when shrinking a slab. This fixes a long standing issue where
KVM's shrinker implementation is completely ineffective due to zapping
only a single page. E.g. without batch zapping, forcing a scan via
drop_caches basically has no impact on a VM with ~2k shadow pages. With
batch zapping, the number of shadow pages can be reduced to a few
hundred pages in one or two runs of drop_caches.

Note, if the default batch size (currently 128) is problematic, e.g.
zapping 128 pages holds mmu_lock for too long, KVM can bound the batch
size by setting @batch in mmu_shrinker.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623193542.7554-4-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ebdb292d

KVM: x86/mmu: Batch zap MMU pages when recycling oldest pages · 6b82ef2c

由 Sean Christopherson 提交于 6月 23, 2020

Collect MMU pages for zapping in a loop when making MMU pages available,
and skip over active roots when doing so as zapping an active root can
never immediately free up a page. Batching the zapping avoids multiple
remote TLB flushes and remedies the issue where the loop would bail
early if an active root was encountered.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623193542.7554-3-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6b82ef2c

KVM: x86/mmu: Don't put invalid SPs back on the list of active pages · f95eec9b

由 Sean Christopherson 提交于 6月 23, 2020

Delete a shadow page from the invalidation list instead of throwing it
back on the list of active pages when it's a root shadow page with
active users.  Invalid active root pages will be explicitly freed by
mmu_free_root_page() when the root_count hits zero, i.e. they don't need
to be put on the active list to avoid leakage.

Use sp->role.invalid to detect that a shadow page has already been
zapped, i.e. is not on a list.

WARN if an invalid page is encountered when zapping pages, as it should
now be impossible.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623193542.7554-2-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f95eec9b

KVM: x86/mmu: Optimize MMU page cache lookup for fully direct MMUs · fb58a9c3

由 Sean Christopherson 提交于 6月 23, 2020

Skip the unsync checks and the write flooding clearing for fully direct
MMUs, which are guaranteed to not have unsync'd or indirect pages (write
flooding detection only applies to indirect pages). For TDP, this
avoids unnecessary memory reads and writes, and for the write flooding
count will also avoid dirtying a cache line (unsync_child_bitmap itself
consumes a cache line, i.e. write_flooding_count is guaranteed to be in
a different cache line than parent_ptes).
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623194027.23135-3-sean.j.christopherson@intel.com>
Reviewed-By: NJon Cargille <jcargill@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fb58a9c3

KVM: x86/mmu: Avoid multiple hash lookups in kvm_get_mmu_page() · ac101b7c

由 Sean Christopherson 提交于 6月 23, 2020

Refactor for_each_valid_sp() to take the list of shadow pages instead of
retrieving it from a gfn to avoid doing the gfn->list hash and lookup
multiple times during kvm_get_mmu_page().

Cc: Peter Feiner <pfeiner@google.com>
Cc: Jon Cargille <jcargill@google.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623194027.23135-2-sean.j.christopherson@intel.com>
Reviewed-By: NJon Cargille <jcargill@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ac101b7c

KVM: x86/mmu: Drop kvm_arch_write_log_dirty() wrapper · f25a9dec

由 Sean Christopherson 提交于 6月 22, 2020

Drop kvm_arch_write_log_dirty() in favor of invoking .write_log_dirty()
directly from FNAME(update_accessed_dirty_bits). "kvm_arch" is usually
used for x86 functions that are invoked from generic KVM, and implies
that there are external callers, neither of which is true.

Remove the check for a non-NULL kvm_x86_ops hook as the call is wrapped
in PTTYPE_EPT and is unconditionally set by VMX.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622215832.22090-3-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f25a9dec

KVM: async_pf: change kvm_setup_async_pf()/kvm_arch_setup_async_pf() return type to bool · e8c22266

由 Vitaly Kuznetsov 提交于 6月 15, 2020

Unlike normal 'int' functions returning '0' on success, kvm_setup_async_pf()/
kvm_arch_setup_async_pf() return '1' when a job to handle page fault
asynchronously was scheduled and '0' otherwise. To avoid the confusion
change return type to 'bool'.

No functional change intended.
Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200615121334.91300-1-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e8c22266

KVM: x86: drop KVM_PV_REASON_PAGE_READY case from kvm_handle_page_fault() · 9ce372b3

由 Vitaly Kuznetsov 提交于 5月 07, 2020

KVM guest code in Linux enables APF only when KVM_FEATURE_ASYNC_PF_INT
is supported, this means we will never see KVM_PV_REASON_PAGE_READY
when handling page fault vmexit in KVM.

While on it, make sure we only follow genuine page fault path when
APF reason is zero. If we happen to see something else this means
that the underlying hypervisor is misbehaving. Leave WARN_ON_ONCE()
to catch that.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9ce372b3

30 6月, 2020 1 次提交

KVM: x86: bit 8 of non-leaf PDPEs is not reserved · 5ecad245

由 Paolo Bonzini 提交于 6月 30, 2020

Bit 8 would be the "global" bit, which does not quite make sense for non-leaf
page table entries. Intel ignores it; AMD ignores it in PDEs and PDPEs, but
reserves it in PML4Es.

Probably, earlier versions of the AMD manual documented it as reserved in PDPEs
as well, and that behavior made it into KVM as well as kvm-unit-tests; fix it.

Cc: stable@vger.kernel.org
Reported-by: NNadav Amit <namit@vmware.com>
Fixes: a0c0feb5 ("KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD", 2014-09-03)
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5ecad245

23 6月, 2020 1 次提交

KVM: nVMX: Plumb L2 GPA through to PML emulation · 2dbebf7a

由 Sean Christopherson 提交于 6月 22, 2020

Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all
intents and purposes is vmx_write_pml_buffer(), instead of having the
latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS. If the dirty bit
update is the result of KVM emulation (rare for L2), then the GPA in the
VMCS may be stale and/or hold a completely unrelated GPA.

Fixes: c5f983f6 ("nVMX: Implement emulated Page Modification Logging")
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2dbebf7a

01 6月, 2020 2 次提交

KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info · 68fd66f1

由 Vitaly Kuznetsov 提交于 5月 25, 2020

Currently, APF mechanism relies on the #PF abuse where the token is being
passed through CR2. If we switch to using interrupts to deliver page-ready
notifications we need a different way to pass the data. Extent the existing
'struct kvm_vcpu_pv_apf_data' with token information for page-ready
notifications.

While on it, rename 'reason' to 'flags'. This doesn't change the semantics
as we only have reasons '1' and '2' and these can be treated as bit flags
but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery
making 'reason' name misleading.

The newly introduced apf_put_user_ready() temporary puts both flags and
token information, this will be changed to put token only when we switch
to interrupt based notifications.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-3-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

68fd66f1

KVM: MMU: pass arbitrary CR0/CR4/EFER to kvm_init_shadow_mmu · 929d1cfa

由 Paolo Bonzini 提交于 5月 19, 2020

This allows fetching the registers from the hsave area when setting
up the NPT shadow MMU, and is needed for KVM_SET_NESTED_STATE (which
runs long after the CR0, CR4 and EFER values in vcpu have been switched
to hold L2 guest state).
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

929d1cfa

28 5月, 2020 3 次提交

彭

kvm/x86: Remove redundant function implementations · 88197e6a

由彭浩(Richard) 提交于 5月 21, 2020

pic_in_kernel(), ioapic_in_kernel() and irqchip_kernel() have the
same implementation.
Signed-off-by: NPeng Hao <richard.peng@oppo.com>
Message-Id: <HKAPR02MB4291D5926EA10B8BFE9EA0D3E0B70@HKAPR02MB4291.apcprd02.prod.outlook.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

88197e6a

KVM: x86: simplify is_mmio_spte · e7581cac

由 Paolo Bonzini 提交于 5月 19, 2020

We can simply look at bits 52-53 to identify MMIO entries in KVM's page
tables. Therefore, there is no need to pass a mask to kvm_mmu_set_mmio_spte_mask.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e7581cac

KVM: x86/mmu: Set mmio_value to '0' if reserved #PF can't be generated · 6129ed87

由 Sean Christopherson 提交于 5月 27, 2020

Set the mmio_value to '0' instead of simply clearing the present bit to
squash a benign warning in kvm_mmu_set_mmio_spte_mask() that complains
about the mmio_value overlapping the lower GFN mask on systems with 52
bits of PA space.

Opportunistically clean up the code and comments.

Cc: stable@vger.kernel.org
Fixes: d43e2675 ("KVM: x86: only do L1TF workaround on affected processors")
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200527084909.23492-1-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6129ed87

19 5月, 2020 2 次提交

x86/kvm: Sanitize kvm_async_pf_task_wait() · 6bca69ad

由 Thomas Gleixner 提交于 3月 07, 2020

While working on the entry consolidation I stumbled over the KVM async page
fault handler and kvm_async_pf_task_wait() in particular. It took me a
while to realize that the randomly sprinkled around rcu_irq_enter()/exit()
invocations are just cargo cult programming. Several patches "fixed" RCU
splats by curing the symptoms without noticing that the code is flawed 
from a design perspective.

The main problem is that this async injection is not based on a proper
handshake mechanism and only respects the minimal requirement, i.e. the
guest is not in a state where it has interrupts disabled.

Aside of that the actual code is a convoluted one fits it all swiss army
knife. It is invoked from different places with different RCU constraints:

  1) Host side:

     vcpu_enter_guest()
       kvm_x86_ops->handle_exit()
         kvm_handle_page_fault()
           kvm_async_pf_task_wait()

     The invocation happens from fully preemptible context.

  2) Guest side:

     The async page fault interrupted:

         a) user space

	 b) preemptible kernel code which is not in a RCU read side
	    critical section

     	 c) non-preemtible kernel code or a RCU read side critical section
	    or kernel code with CONFIG_PREEMPTION=n which allows not to
	    differentiate between #2b and #2c.

RCU is watching for:

  #1  The vCPU exited and current is definitely not the idle task

  #2a The #PF entry code on the guest went through enter_from_user_mode()
      which reactivates RCU

  #2b There is no preemptible, interrupts enabled code in the kernel
      which can run with RCU looking away. (The idle task is always
      non preemptible).

I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU
voodoo at all.

In #2c RCU is eventually not watching, but as that state cannot schedule
anyway there is no point to worry about it so it has to invoke
rcu_irq_enter() before running that code. This can be optimized, but this
will be done as an extra step in course of the entry code consolidation
work.

So the proper solution for this is to:

  - Split kvm_async_pf_task_wait() into schedule and halt based waiting
    interfaces which share the enqueueing code.

  - Add comments (condensed form of this changelog) to spare others the
    time waste and pain of reverse engineering all of this with the help of
    uncomprehensible changelogs and code history.

  - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(),
    user mode and schedulable kernel side async page faults (#1, #2a, #2b)

  - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel
    case (#2c).

    For this case also remove the rcu_irq_exit()/enter() pair around the
    halt as it is just a pointless exercise:

       - vCPUs can VMEXIT at any random point and can be scheduled out for
         an arbitrary amount of time by the host and this is not any
         different except that it voluntary triggers the exit via halt.

       - The interrupted context could have RCU watching already. So the
	 rcu_irq_exit() before the halt is not gaining anything aside of
	 confusing the reader. Claiming that this might prevent RCU stalls
	 is just an illusion.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.262701431@linutronix.de

6bca69ad

KVM: x86: only do L1TF workaround on affected processors · d43e2675

由 Paolo Bonzini 提交于 5月 19, 2020

KVM stores the gfn in MMIO SPTEs as a caching optimization. These are split
in two parts, as in "[high 11111 low]", to thwart any attempt to use these bits
in an L1TF attack. This works as long as there are 5 free bits between
MAXPHYADDR and bit 50 (inclusive), leaving bit 51 free so that the MMIO
access triggers a reserved-bit-set page fault.

The bit positions however were computed wrongly for AMD processors that have
encryption support. In this case, x86_phys_bits is reduced (for example
from 48 to 43, to account for the C bit at position 47 and four bits used
internally to store the SEV ASID and other stuff) while x86_cache_bits in
would remain set to 48, and _all_ bits between the reduced MAXPHYADDR
and bit 51 are set. Then low_phys_bits would also cover some of the
bits that are set in the shadow_mmio_value, terribly confusing the gfn
caching mechanism.

To fix this, avoid splitting gfns as long as the processor does not have
the L1TF bug (which includes all AMD processors). When there is no
splitting, low_phys_bits can be set to the reduced MAXPHYADDR removing
the overlap. This fixes "npt=0" operation on EPYC processors.

Thanks to Maxim Levitsky for bisecting this bug.

Cc: stable@vger.kernel.org
Fixes: 52918ed5 ("KVM: SVM: Override default MMIO mask if memory encryption is enabled")
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d43e2675

16 5月, 2020 3 次提交

KVM: x86/mmu: Add a helper to consolidate root sp allocation · 8123f265

由 Sean Christopherson 提交于 4月 27, 2020

Add a helper, mmu_alloc_root(), to consolidate the allocation of a root
shadow page, which has the same basic mechanics for all flavors of TDP
and shadow paging.

Note, __pa(sp->spt) doesn't need to be protected by mmu_lock, sp->spt
points at a kernel page.

No functional change intended.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200428023714.31923-1-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8123f265

KVM: x86/mmu: Drop KVM's hugepage enums in favor of the kernel's enums · 3bae0459

由 Sean Christopherson 提交于 4月 27, 2020

Replace KVM's PT_PAGE_TABLE_LEVEL, PT_DIRECTORY_LEVEL and PT_PDPE_LEVEL
with the kernel's PG_LEVEL_4K, PG_LEVEL_2M and PG_LEVEL_1G.  KVM's
enums are borderline impossible to remember and result in code that is
visually difficult to audit, e.g.

        if (!enable_ept)
                ept_lpage_level = 0;
        else if (cpu_has_vmx_ept_1g_page())
                ept_lpage_level = PT_PDPE_LEVEL;
        else if (cpu_has_vmx_ept_2m_page())
                ept_lpage_level = PT_DIRECTORY_LEVEL;
        else
                ept_lpage_level = PT_PAGE_TABLE_LEVEL;

versus

        if (!enable_ept)
                ept_lpage_level = 0;
        else if (cpu_has_vmx_ept_1g_page())
                ept_lpage_level = PG_LEVEL_1G;
        else if (cpu_has_vmx_ept_2m_page())
                ept_lpage_level = PG_LEVEL_2M;
        else
                ept_lpage_level = PG_LEVEL_4K;

No functional change intended.
Suggested-by: NBarret Rhoden <brho@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200428005422.4235-4-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3bae0459

KVM: x86/mmu: Move max hugepage level to a separate #define · e662ec3e

由 Sean Christopherson 提交于 4月 27, 2020

Rename PT_MAX_HUGEPAGE_LEVEL to KVM_MAX_HUGEPAGE_LEVEL and make it a
separate define in anticipation of dropping KVM's PT_*_LEVEL enums in
favor of the kernel's PG_LEVEL_* enums.

No functional change intended.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200428005422.4235-3-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e662ec3e

14 5月, 2020 1 次提交

KVM: x86/mmu: Capture TDP level when updating CPUID · e93fd3b3

由 Sean Christopherson 提交于 5月 01, 2020

Snapshot the TDP level now that it's invariant (SVM) or dependent only
on host capabilities and guest CPUID (VMX).  This avoids having to call
kvm_x86_ops.get_tdp_level() when initializing a TDP MMU and/or
calculating the page role, and thus avoids the associated retpoline.

Drop the WARN in vmx_get_tdp_level() as updating CPUID while L2 is
active is legal, if dodgy.

No functional change intended.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200502043234.12481-11-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e93fd3b3

21 4月, 2020 3 次提交

KVM: x86/mmu: Avoid an extra memslot lookup in try_async_pf() for L2 · c36b7150

由 Paolo Bonzini 提交于 4月 16, 2020

Create a new function kvm_is_visible_memslot() and use it from
kvm_is_visible_gfn(); use the new function in try_async_pf() too,
to avoid an extra memslot lookup.

Opportunistically squish a multi-line comment into a single-line comment.

Note, the end result, KVM_PFN_NOSLOT, is unchanged.

Cc: Jim Mattson <jmattson@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c36b7150

KVM: x86/mmu: Set @writable to false for non-visible accesses by L2 · c583eed6

由 Sean Christopherson 提交于 4月 15, 2020

Explicitly set @writable to false in try_async_pf() if the GFN->PFN
translation is short-circuited due to the requested GFN not being
visible to L2.

Leaving @writable ('map_writable' in the callers) uninitialized is ok
in that it's never actually consumed, but one has to track it all the
way through set_spte() being short-circuited by set_mmio_spte() to
understand that the uninitialized variable is benign, and relying on
@writable being ignored is an unnecessary risk.  Explicitly setting
@writable also aligns try_async_pf() with __gfn_to_pfn_memslot().

Jim Mattson <jmattson@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200415214414.10194-2-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c583eed6

KVM: x86: Replace "cr3" with "pgd" in "new cr3/pgd" related code · be01e8e2

由 Sean Christopherson 提交于 3月 20, 2020

Rename functions and variables in kvm_mmu_new_cr3() and related code to
replace "cr3" with "pgd", i.e. continue the work started by commit
727a7e27 ("KVM: x86: rename set_cr3 callback and related flags to
load_mmu_pgd"). kvm_mmu_new_cr3() and company are not always loading a
new CR3, e.g. when nested EPT is enabled "cr3" is actually an EPTP.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200320212833.3507-37-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

be01e8e2

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功