提交 · 46a4a014c48e64e28970ca775bb7adf4778821af · openanolis / cloud-kernel

13 11月, 2019 8 次提交

kvm: x86: mmu: Recovery of shattered NX large pages · 46a4a014

由 Junaid Shahid 提交于 11月 01, 2019

commit 1aa9b9572b10529c2e64e2b8f44025d86e124308 upstream.

The page table pages corresponding to broken down large pages are zapped in
FIFO order, so that the large page can potentially be recovered, if it is
not longer being used for execution.  This removes the performance penalty
for walking deeper EPT page tables.

By default, one large page will last about one hour once the guest
reaches a steady state.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

46a4a014

kvm: mmu: ITLB_MULTIHIT mitigation · 5219505f

由 Paolo Bonzini 提交于 11月 04, 2019

commit b8e8c8303ff28c61046a4d0f6ea99aea609a7dc0 upstream.

With some Intel processors, putting the same virtual address in the TLB
as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
and cause the processor to issue a machine check resulting in a CPU lockup.

Unfortunately when EPT page tables use huge pages, it is possible for a
malicious guest to cause this situation.

Add a knob to mark huge pages as non-executable. When the nx_huge_pages
parameter is enabled (and we are using EPT), all huge pages are marked as
NX. If the guest attempts to execute in one of those pages, the page is
broken down into 4K pages, which are then marked executable.

This is not an issue for shadow paging (except nested EPT), because then
the host is in control of TLB flushes and the problematic situation cannot
happen. With nested EPT, again the nested guest can cause problems shadow
and direct EPT is treated in the same way.

[ tglx: Fixup default to auto and massage wording a bit ]
Originally-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

5219505f

KVM: x86: add tracepoints around __direct_map and FNAME(fetch) · 37dfbc8b

由 Paolo Bonzini 提交于 7月 04, 2019

commit 335e192a3fa415e1202c8b9ecdaaecd643f823cc upstream.

These are useful in debugging shadow paging.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

37dfbc8b

KVM: x86: change kvm_mmu_page_get_gfn BUG_ON to WARN_ON · 9ef1fae2

由 Paolo Bonzini 提交于 6月 30, 2019

commit e9f2a760b158551bfbef6db31d2cae45ab8072e5 upstream.

Note that in such a case it is quite likely that KVM will BUG_ON
in __pte_list_remove when the VM is closed.  However, there is no
immediate risk of memory corruption in the host so a WARN_ON is
enough and it lets you gather traces for debugging.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

9ef1fae2

KVM: x86: remove now unneeded hugepage gfn adjustment · b182093d

由 Paolo Bonzini 提交于 6月 23, 2019

commit d679b32611c0102ce33b9e1a4e4b94854ed1812a upstream.

After the previous patch, the low bits of the gfn are masked in
both FNAME(fetch) and __direct_map, so we do not need to clear them
in transparent_hugepage_adjust.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

b182093d

KVM: x86: make FNAME(fetch) and __direct_map more similar · e79234ce

由 Paolo Bonzini 提交于 6月 24, 2019

commit 3fcf2d1bdeb6a513523cb2c77012a6b047aa859c upstream.

These two functions are basically doing the same thing through
kvm_mmu_get_page, link_shadow_page and mmu_set_spte; yet, for historical
reasons, their code looks very different.  This patch tries to take the
best of each and make them very similar, so that it is easy to understand
changes that apply to both of them.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

e79234ce

kvm: mmu: Do not release the page inside mmu_set_spte() · 8aaac306

由 Junaid Shahid 提交于 1月 03, 2019

commit 43fdcda96e2550c6d1c46fb8a78801aa2f7276ed upstream.

Release the page at the call-site where it was originally acquired.
This makes the exit code cleaner for most call sites, since they
do not need to duplicate code between success and the failure
label.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

8aaac306

kvm: Convert kvm_lock to a mutex · 30d8d8d6

由 Junaid Shahid 提交于 1月 03, 2019

commit 0d9ce162cf46c99628cc5da9510b959c7976735b upstream.

It doesn't seem as if there is any particular need for kvm_lock to be a
spinlock, so convert the lock to a mutex so that sleepable functions (in
particular cond_resched()) can be called while holding it.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

30d8d8d6

16 9月, 2019 1 次提交

kvm: mmu: Fix overflow on kvm mmu page limit calculation · 163b24b1

由 Ben Gardon 提交于 4月 08, 2019

[ Upstream commit bc8a3d8925a8fa09fa550e0da115d95851ce33c6 ]

KVM bases its memory usage limits on the total number of guest pages
across all memslots. However, those limits, and the calculations to
produce them, use 32 bit unsigned integers. This can result in overflow
if a VM has more guest pages that can be represented by a u32. As a
result of this overflow, KVM can use a low limit on the number of MMU
pages it will allocate. This makes KVM unable to map all of guest memory
at once, prompting spurious faults.

Tested: Ran all kvm-unit-tests on an Intel Haswell machine. This patch
	introduced no new failures.
Signed-off-by: NBen Gardon <bgardon@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

163b24b1

07 8月, 2019 1 次提交

x86: kvm: avoid constant-conversion warning · 80f58147

由 Arnd Bergmann 提交于 7月 12, 2019

[ Upstream commit a6a6d3b1f867d34ba5bd61aa7bb056b48ca67cff ]

clang finds a contruct suspicious that converts an unsigned
character to a signed integer and back, causing an overflow:

arch/x86/kvm/mmu.c:4605:39: error: implicit conversion from 'int' to 'u8' (aka 'unsigned char') changes value from -205 to 51 [-Werror,-Wconstant-conversion]
                u8 wf = (pfec & PFERR_WRITE_MASK) ? ~w : 0;
                   ~~                               ^~
arch/x86/kvm/mmu.c:4607:38: error: implicit conversion from 'int' to 'u8' (aka 'unsigned char') changes value from -241 to 15 [-Werror,-Wconstant-conversion]
                u8 uf = (pfec & PFERR_USER_MASK) ? ~u : 0;
                   ~~                              ^~
arch/x86/kvm/mmu.c:4609:39: error: implicit conversion from 'int' to 'u8' (aka 'unsigned char') changes value from -171 to 85 [-Werror,-Wconstant-conversion]
                u8 ff = (pfec & PFERR_FETCH_MASK) ? ~x : 0;
                   ~~                               ^~

Add an explicit cast to tell clang that everything works as
intended here.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Link: https://github.com/ClangBuiltLinux/linux/issues/95Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

80f58147

03 7月, 2019 1 次提交

KVM: x86/mmu: Allocate PAE root array when using SVM's 32-bit NPT · 01a02a98

由 Sean Christopherson 提交于 6月 13, 2019

commit b6b80c78af838bef17501416d5d383fedab0010a upstream.

SVM's Nested Page Tables (NPT) reuses x86 paging for the host-controlled
page walk.  For 32-bit KVM, this means PAE paging is used even when TDP
is enabled, i.e. the PAE root array needs to be allocated.

Fixes: ee6268ba ("KVM: x86: Skip pae_root shadow allocation if tdp enabled")
Cc: stable@vger.kernel.org
Reported-by: NJiri Palecek <jpalecek@web.de>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Cc: Jiri Palecek <jpalecek@web.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

01a02a98

24 3月, 2019 2 次提交

KVM: x86/mmu: Detect MMIO generation wrap in any address space · 656e9e5d

由 Sean Christopherson 提交于 2月 05, 2019

commit e1359e2beb8b0a1188abc997273acbaedc8ee791 upstream.

The check to detect a wrap of the MMIO generation explicitly looks for a
generation number of zero.  Now that unique memslots generation numbers
are assigned to each address space, only address space 0 will get a
generation number of exactly zero when wrapping.  E.g. when address
space 1 goes from 0x7fffe to 0x80002, the MMIO generation number will
wrap to 0x2.  Adjust the MMIO generation to strip the address space
modifier prior to checking for a wrap.

Fixes: 4bd518f1 ("KVM: use separate generations for each address space")
Cc: <stable@vger.kernel.org>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

656e9e5d

KVM: Call kvm_arch_memslots_updated() before updating memslots · 23ad135a

由 Sean Christopherson 提交于 2月 05, 2019

commit 152482580a1b0accb60676063a1ac57b2d12daf6 upstream.

kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound. x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.

Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots. Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.

Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc: <stable@vger.kernel.org>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

23ad135a

06 12月, 2018 1 次提交

kvm: mmu: Fix race in emulated page table writes · 471aca57

由 Junaid Shahid 提交于 10月 31, 2018

commit 0e0fee5c539b61fdd098332e0e2cc375d9073706 upstream.

When a guest page table is updated via an emulated write,
kvm_mmu_pte_write() is called to update the shadow PTE using the just
written guest PTE value. But if two emulated guest PTE writes happened
concurrently, it is possible that the guest PTE and the shadow PTE end
up being out of sync. Emulated writes do not mark the shadow page as
unsync-ed, so this inconsistency will not be resolved even by a guest TLB
flush (unless the page was marked as unsync-ed at some other point).

This is fixed by re-reading the current value of the guest PTE after the
MMU lock has been acquired instead of just using the value that was
written prior to calling kvm_mmu_pte_write().
Signed-off-by: NJunaid Shahid <junaids@google.com>
Reviewed-by: NWanpeng Li <wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

471aca57

01 10月, 2018 1 次提交

KVM: x86: fix L1TF's MMIO GFN calculation · daa07cbc

由 Sean Christopherson 提交于 9月 25, 2018

One defense against L1TF in KVM is to always set the upper five bits
of the *legal* physical address in the SPTEs for non-present and
reserved SPTEs, e.g. MMIO SPTEs.  In the MMIO case, the GFN of the
MMIO SPTE may overlap with the upper five bits that are being usurped
to defend against L1TF.  To preserve the GFN, the bits of the GFN that
overlap with the repurposed bits are shifted left into the reserved
bits, i.e. the GFN in the SPTE will be split into high and low parts.
When retrieving the GFN from the MMIO SPTE, e.g. to check for an MMIO
access, get_mmio_spte_gfn() unshifts the affected bits and restores
the original GFN for comparison.  Unfortunately, get_mmio_spte_gfn()
neglects to mask off the reserved bits in the SPTE that were used to
store the upper chunk of the GFN.  As a result, KVM fails to detect
MMIO accesses whose GPA overlaps the repurprosed bits, which in turn
causes guest panics and hangs.

Fix the bug by generating a mask that covers the lower chunk of the
GFN, i.e. the bits that aren't shifted by the L1TF mitigation.  The
alternative approach would be to explicitly zero the five reserved
bits that are used to store the upper chunk of the GFN, but that
requires additional run-time computation and makes an already-ugly
bit of code even more inscrutable.

I considered adding a WARN_ON_ONCE(low_phys_bits-1 <= PAGE_SHIFT) to
warn if GENMASK_ULL() generated a nonsensical value, but that seemed
silly since that would mean a system that supports VMX has less than
18 bits of physical address space...
Reported-by: NSakari Ailus <sakari.ailus@iki.fi>
Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
Cc: Junaid Shahid <junaids@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: NJunaid Shahid <junaids@google.com>
Tested-by: NSakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

daa07cbc

20 9月, 2018 2 次提交

KVM/MMU: Fix comment in walk_shadow_page_lockless_end() · 9a984586

由 Tianyu Lan 提交于 9月 07, 2018

kvm_commit_zap_page() has been renamed to kvm_mmu_commit_zap_page()
This patch is to fix the commit.
Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9a984586

KVM: x86: don't reset root in kvm_mmu_setup() · 83b20b28

由 Wei Yang 提交于 9月 07, 2018

Here is the code path which shows kvm_mmu_setup() is invoked after
kvm_mmu_create(). Since kvm_mmu_setup() is only invoked in this code path,
this means the root_hpa and prev_roots are guaranteed to be invalid. And
it is not necessary to reset it again.

    kvm_vm_ioctl_create_vcpu()
        kvm_arch_vcpu_create()
            vmx_create_vcpu()
                kvm_vcpu_init()
                    kvm_arch_vcpu_init()
                        kvm_mmu_create()
        kvm_arch_vcpu_setup()
            kvm_mmu_setup()
                kvm_init_mmu()

This patch set reset_roots to false in kmv_mmu_setup().

Fixes: 50c28f21Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
Reviewed-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

83b20b28

07 9月, 2018 1 次提交

KVM: Remove obsolete kvm_unmap_hva notifier backend · a35381e1

由 Marc Zyngier 提交于 8月 23, 2018

kvm_unmap_hva is long gone, and we only have kvm_unmap_hva_range to
deal with. Drop the now obsolete code.

Fixes: fb1522e0 ("KVM: update to new mmu_notifier semantic v2")
Cc: James Hogan <jhogan@kernel.org>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NChristoffer Dall <christoffer.dall@arm.com>

a35381e1

30 8月, 2018 4 次提交

KVM: x86: Do not re-{try,execute} after failed emulation in L2 · 6c3dfeb6

由 Sean Christopherson 提交于 8月 23, 2018

Commit a6f177ef ("KVM: Reenter guest after emulation failure if
due to access to non-mmio address") added reexecute_instruction() to
handle the scenario where two (or more) vCPUS race to write a shadowed
page, i.e. reexecute_instruction() is intended to return true if and
only if the instruction being emulated was accessing a shadowed page.
As L0 is only explicitly shadowing L1 tables, an emulation failure of
a nested VM instruction cannot be due to a race to write a shadowed
page and so should never be re-executed.

This fixes an issue where an "MMIO" emulation failure[1] in L2 is all
but guaranteed to result in an infinite loop when TDP is enabled.
Because "cr2" is actually an L2 GPA when TDP is enabled, calling
kvm_mmu_gva_to_gpa_write() to translate cr2 in the non-direct mapped
case (L2 is never direct mapped) will almost always yield UNMAPPED_GVA
and cause reexecute_instruction() to immediately return true. The
!mmio_info_in_cache() check in kvm_mmu_page_fault() doesn't catch this
case because mmio_info_in_cache() returns false for a nested MMU (the
MMIO caching currently handles L1 only, e.g. to cache nested guests'
GPAs we'd have to manually flush the cache when switching between
VMs and when L1 updated its page tables controlling the nested guest).

Way back when, commit 68be0803 ("KVM: x86: never re-execute
instruction with enabled tdp") changed reexecute_instruction() to
always return false when using TDP under the assumption that KVM would
only get into the emulator for MMIO. Commit 95b3cf69 ("KVM: x86:
let reexecute_instruction work for tdp") effectively reverted that
behavior in order to handle the scenario where emulation failed due to
an access from L1 to the shadow page tables for L2, but it didn't
account for the case where emulation failed in L2 with TDP enabled.

All of the above logic also applies to retry_instruction(), added by
commit 1cb3f3ae ("KVM: x86: retry non-page-table writing
instructions"). An indefinite loop in retry_instruction() should be
impossible as it protects against retrying the same instruction over
and over, but it's still correct to not retry an L2 instruction in
the first place.

Fix the immediate issue by adding a check for a nested guest when
determining whether or not to allow retry in kvm_mmu_page_fault().
In addition to fixing the immediate bug, add WARN_ON_ONCE in the
retry functions since they are not designed to handle nested cases,
i.e. they need to be modified even if there is some scenario in the
future where we want to allow retrying a nested guest.

[1] This issue was encountered after commit 3a2936de ("kvm: mmu:
Don't expose private memslots to L2") changed the page fault path
to return KVM_PFN_NOSLOT when translating an L2 access to a
prive memslot. Returning KVM_PFN_NOSLOT is semantically correct
when we want to hide a memslot from L2, i.e. there effectively is
no defined memory region for L2, but it has the unfortunate side
effect of making KVM think the GFN is a MMIO page, thus triggering
emulation. The failure occurred with in-development code that
deliberately exposed a private memslot to L2, which L2 accessed
with an instruction that is not emulated by KVM.

Fixes: 95b3cf69 ("KVM: x86: let reexecute_instruction work for tdp")
Fixes: 1cb3f3ae ("KVM: x86: retry non-page-table writing instructions")
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

6c3dfeb6

KVM: x86: Default to not allowing emulation retry in kvm_mmu_page_fault · 472faffa

由 Sean Christopherson 提交于 8月 23, 2018

Effectively force kvm_mmu_page_fault() to opt-in to allowing retry to
make it more obvious when and why it allows emulation to be retried.
Previously this approach was less convenient due to retry and
re-execute behavior being controlled by separate flags that were also
inverted in their implementations (opt-in versus opt-out).
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

472faffa

KVM: x86: Merge EMULTYPE_RETRY and EMULTYPE_ALLOW_REEXECUTE · 384bf221

由 Sean Christopherson 提交于 8月 23, 2018

retry_instruction() and reexecute_instruction() are a package deal,
i.e. there is no scenario where one is allowed and the other is not.
Merge their controlling emulation type flags to enforce this in code.
Name the combined flag EMULTYPE_ALLOW_RETRY to make it abundantly
clear that we are allowing re{try,execute} to occur, as opposed to
explicitly requesting retry of a previously failed instruction.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

384bf221

KVM: x86: Invert emulation re-execute behavior to make it opt-in · 8065dbd1

由 Sean Christopherson 提交于 8月 23, 2018

Re-execution of an instruction after emulation decode failure is
intended to be used only when emulating shadow page accesses.  Invert
the flag to make allowing re-execution opt-in since that behavior is
by far in the minority.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

8065dbd1

15 8月, 2018 1 次提交

kvm: x86: Set highest physical address bits in non-present/reserved SPTEs · 28a1f3ac

由 Junaid Shahid 提交于 8月 14, 2018

Always set the 5 upper-most supported physical address bits to 1 for SPTEs
that are marked as non-present or reserved, to make them unusable for
L1TF attacks from the guest. Currently, this just applies to MMIO SPTEs.
(We do not need to mark PTEs that are completely 0 as physical page 0
is already reserved.)

This allows mitigation of L1TF without disabling hyper-threading by using
shadow paging mode instead of EPT.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

28a1f3ac

06 8月, 2018 17 次提交

KVM: x86: Skip pae_root shadow allocation if tdp enabled · ee6268ba

由 Liang Chen 提交于 7月 25, 2018

Considering the fact that the pae_root shadow is not needed when
tdp is in use, skip the pae_root shadow page allocation to allow
mmu creation even not being able to obtain memory from DMA32
zone when particular cgroup cpuset.mems or mempolicy control is
applied.
Signed-off-by: NLiang Chen <liangchen.linux@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ee6268ba

KVM/MMU: Combine flushing remote tlb in mmu_set_spte() · c2a4eadf

由 Tianyu Lan 提交于 7月 24, 2018

mmu_set_spte() flushes remote tlbs for drop_parent_pte/drop_spte()
and set_spte() separately. This may introduce redundant flush. This
patch is to combine these flushes and check flush request after
calling set_spte().
Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
Reviewed-by: NJunaid Shahid <junaids@google.com>
Reviewed-by: NXiao Guangrong <xiaoguangrong@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c2a4eadf

KVM/MMU: Simplify __kvm_sync_page() function · 450917b6

由 Tianyu Lan 提交于 7月 18, 2018

Merge check of "sp->role.cr4_pae != !!is_pae(vcpu))" and "vcpu->
arch.mmu.sync_page(vcpu, sp) == 0". kvm_mmu_prepare_zap_page()
is called under both these conditions.
Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

450917b6

kvm: x86: Add multi-entry LRU cache for previous CR3s · b94742c9

由 Junaid Shahid 提交于 6月 27, 2018

Adds support for storing multiple previous CR3/root_hpa pairs maintained
as an LRU cache, so that the lockless CR3 switch path can be used when
switching back to any of them.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b94742c9

kvm: x86: Flush only affected TLB entries in kvm_mmu_invlpg* · faff8758

由 Junaid Shahid 提交于 6月 29, 2018

This needs a minor bug fix. The updated patch is as follows.

Thanks,
Junaid

------------------------------------------------------------------------------

kvm_mmu_invlpg() and kvm_mmu_invpcid_gva() only need to flush the TLB
entries for the specific guest virtual address, instead of flushing all
TLB entries associated with the VM.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

faff8758

kvm: x86: Skip shadow page resync on CR3 switch when indicated by guest · 956bf353

由 Junaid Shahid 提交于 6月 27, 2018

When the guest indicates that the TLB doesn't need to be flushed in a
CR3 switch, we can also skip resyncing the shadow page tables since an
out-of-sync shadow page table is equivalent to an out-of-sync TLB.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

956bf353

kvm: x86: Support selectively freeing either current or previous MMU root · 08fb59d8

由 Junaid Shahid 提交于 6月 27, 2018

kvm_mmu_free_roots() now takes a mask specifying which roots to free, so
that either one of the roots (active/previous) can be individually freed
when needed.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

08fb59d8

kvm: x86: Add a root_hpa parameter to kvm_mmu->invlpg() · 7eb77e9f

由 Junaid Shahid 提交于 6月 27, 2018

This allows invlpg() to be called using either the active root_hpa
or the prev_root_hpa.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7eb77e9f

kvm: x86: Skip TLB flush on fast CR3 switch when indicated by guest · ade61e28

由 Junaid Shahid 提交于 6月 27, 2018

When PCIDs are enabled, the MSb of the source operand for a MOV-to-CR3
instruction indicates that the TLB doesn't need to be flushed.

This change enables this optimization for MOV-to-CR3s in the guest
that have been intercepted by KVM for shadow paging and are handled
within the fast CR3 switch path.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ade61e28

kvm: vmx: Support INVPCID in shadow paging mode · eb4b248e

由 Junaid Shahid 提交于 6月 27, 2018

Implement support for INVPCID in shadow paging mode as well.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

eb4b248e

kvm: x86: Add ability to skip TLB flush when switching CR3 · afe828d1

由 Junaid Shahid 提交于 6月 27, 2018

Remove the implicit flush from the set_cr3 handlers, so that the
callers are able to decide whether to flush the TLB or not.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

afe828d1

kvm: x86: Use fast CR3 switch for nested VMX · 50c28f21

由 Junaid Shahid 提交于 6月 27, 2018

Use the fast CR3 switch mechanism to locklessly change the MMU root
page when switching between L1 and L2. The switch from L2 to L1 should
always go through the fast path, while the switch from L1 to L2 should
go through the fast path if L1's CR3/EPTP for L2 hasn't changed
since the last time.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

50c28f21

kvm: x86: Support resetting the MMU context without resetting roots · 1c53da3f

由 Junaid Shahid 提交于 6月 27, 2018

This adds support for re-initializing the MMU context in a different
mode while preserving the active root_hpa and the prev_root.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1c53da3f

kvm: x86: Add support for fast CR3 switch across different MMU modes · 0aab33e4

由 Junaid Shahid 提交于 6月 27, 2018

This generalizes the lockless CR3 switch path to be able to work
across different MMU modes (e.g. nested vs non-nested) by checking
that the expected page role of the new root page matches the page role
of the previously stored root page in addition to checking that the new
CR3 matches the previous CR3. Furthermore, instead of loading the
hardware CR3 in fast_cr3_switch(), it is now done in vcpu_enter_guest(),
as by that time the MMU context would be up-to-date with the VCPU mode.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0aab33e4

kvm: x86: Introduce KVM_REQ_LOAD_CR3 · 6e42782f

由 Junaid Shahid 提交于 6月 27, 2018

The KVM_REQ_LOAD_CR3 request loads the hardware CR3 using the
current root_hpa.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6e42782f

kvm: x86: Introduce kvm_mmu_calc_root_page_role() · 9fa72119

由 Junaid Shahid 提交于 6月 27, 2018

These functions factor out the base role calculation from the
corresponding kvm_init_*_mmu() functions. The new functions return
what would be the role assigned to a root page in the current VCPU
state. This can be masked with mmu_base_role_mask to derive the base
role.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9fa72119

kvm: x86: Add fast CR3 switch code path · 7c390d35

由 Junaid Shahid 提交于 6月 27, 2018

When using shadow paging, a CR3 switch in the guest results in a VM Exit.
In the common case, that VM exit doesn't require much processing by KVM.
However, it does acquire the MMU lock, which can start showing signs of
contention under some workloads even on a 2 VCPU VM when the guest is
using KPTI. Therefore, we add a fast path that avoids acquiring the MMU
lock in the most common cases e.g. when switching back and forth between
the kernel and user mode CR3s used by KPTI with no guest page table
changes in between.

For now, this fast path is implemented only for 64-bit guests and hosts
to avoid the handling of PDPTEs, but it can be extended later to 32-bit
guests and/or hosts as well.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7c390d35

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功