提交 · c774de22c430733487f70d755067d9ea55dbe6de · openeuler / Kernel

20 1月, 2022 8 次提交

riscv: Explicit comment about user virtual address space size · c774de22

由 Alexandre Ghiti 提交于 12月 06, 2021

Define precisely the size of the user accessible virtual space size
for sv32/39/48 mmu types and explain why the whole virtual address
space is split into 2 equal chunks between kernel and user space.
Signed-off-by: NAlexandre Ghiti <alexandre.ghiti@canonical.com>
Reviewed-by: NAnup Patel <anup@brainfault.org>
Reviewed-by: NPalmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: NPalmer Dabbelt <palmer@rivosinc.com>

c774de22

riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo · 73c7c8f6

由 Alexandre Ghiti 提交于 12月 06, 2021

Now that the mmu type is determined at runtime using SATP
characteristic, use the global variable pgtable_l4_enabled to output
mmu type of the processor through /proc/cpuinfo instead of relying on
device tree infos.
Signed-off-by: NAlexandre Ghiti <alexandre.ghiti@canonical.com>
Reviewed-by: NAnup Patel <anup@brainfault.org>
Reviewed-by: NPalmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: NPalmer Dabbelt <palmer@rivosinc.com>

73c7c8f6

riscv: Implement sv48 support · e8a62cc2

由 Alexandre Ghiti 提交于 12月 06, 2021

By adding a new 4th level of page table, give the possibility to 64bit
kernel to address 2^48 bytes of virtual address: in practice, that offers
128TB of virtual address space to userspace and allows up to 64TB of
physical memory.

If the underlying hardware does not support sv48, we will automatically
fallback to a standard 3-level page table by folding the new PUD level into
PGDIR level. In order to detect HW capabilities at runtime, we
use SATP feature that ignores writes with an unsupported mode.
Signed-off-by: NAlexandre Ghiti <alexandre.ghiti@canonical.com>
Signed-off-by: NPalmer Dabbelt <palmer@rivosinc.com>

e8a62cc2

riscv: Allow to dynamically define VA_BITS · 3270bfdb

由 Alexandre Ghiti 提交于 12月 06, 2021

With 4-level page table folding at runtime, we don't know at compile time
the size of the virtual address space so we must set VA_BITS dynamically
so that sparsemem reserves the right amount of memory for struct pages.
Signed-off-by: NAlexandre Ghiti <alexandre.ghiti@canonical.com>
Signed-off-by: NPalmer Dabbelt <palmer@rivosinc.com>

3270bfdb

riscv: Introduce functions to switch pt_ops · 840125a9

由 Alexandre Ghiti 提交于 12月 06, 2021

This simply gathers the different pt_ops initialization in functions
where a comment was added to explain why the page table operations must
be changed along the boot process.
Signed-off-by: NAlexandre Ghiti <alexandre.ghiti@canonical.com>
Signed-off-by: NPalmer Dabbelt <palmer@rivosinc.com>

840125a9

riscv: Split early kasan mapping to prepare sv48 introduction · 2efad17e

由 Alexandre Ghiti 提交于 12月 06, 2021

Now that kasan shadow region is next to the kernel, for sv48, this
region won't be aligned on PGDIR_SIZE and then when populating this
region, we'll need to get down to lower levels of the page table. So
instead of reimplementing the page table walk for the early population,
take advantage of the existing functions used for the final population.

Note that kasan swapper initialization must also be split since memblock
is not initialized at this point and as the last PGD is shared with the
kernel, we'd need to allocate a PUD so postpone the kasan final
population after the kernel population is done.
Signed-off-by: NAlexandre Ghiti <alexandre.ghiti@canonical.com>
Signed-off-by: NPalmer Dabbelt <palmer@rivosinc.com>

2efad17e

riscv: Move KASAN mapping next to the kernel mapping · f7ae0233

由 Alexandre Ghiti 提交于 12月 06, 2021

Now that KASAN_SHADOW_OFFSET is defined at compile time as a config,
this value must remain constant whatever the size of the virtual address
space, which is only possible by pushing this region at the end of the
address space next to the kernel mapping.
Signed-off-by: NAlexandre Ghiti <alexandre.ghiti@canonical.com>
Signed-off-by: NPalmer Dabbelt <palmer@rivosinc.com>

f7ae0233

riscv: Get rid of MAXPHYSMEM configs · db1503d3

由 Alexandre Ghiti 提交于 1月 17, 2022

CONFIG_MAXPHYSMEM_* are actually never used, even the nommu defconfigs
selecting the MAXPHYSMEM_2GB had no effects on PAGE_OFFSET since it was
preempted by !MMU case right before.

In addition, the move of the kernel mapping at the end of the address
space broke the use of MAXPHYSMEM_2G with MMU since it defines PAGE_OFFSET
at the same address as the kernel mapping.
Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Fixes: 2bfc6cd8 ("riscv: Move kernel mapping outside of linear mapping")
Signed-off-by: NAlexandre Ghiti <alexandre.ghiti@canonical.com>
Tested-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Tested-by: NConor Dooley <Conor.Dooley@microchip.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPalmer Dabbelt <palmer@rivosinc.com>

db1503d3

14 11月, 2021 4 次提交

parisc/entry: fix trace test in syscall exit path · 3ec18fc7

由 Sven Schnelle 提交于 11月 13, 2021

commit 8779e05b ("parisc: Fix ptrace check on syscall return")
fixed testing of TI_FLAGS. This uncovered a bug in the test mask.
syscall_restore_rfi is only used when the kernel needs to exit to
usespace with single or block stepping and the recovery counter
enabled. The test however used _TIF_SYSCALL_TRACE_MASK, which
includes a lot of bits that shouldn't be tested here.

Fix this by using TIF_SINGLESTEP and TIF_BLOCKSTEP directly.

I encountered this bug by enabling syscall tracepoints. Both in qemu and
on real hardware. As soon as i enabled the tracepoint (sys_exit_read,
but i guess it doesn't really matter which one), i got random page
faults in userspace almost immediately.
Signed-off-by: NSven Schnelle <svens@stackframe.org>
Signed-off-by: NHelge Deller <deller@gmx.de>

3ec18fc7

parisc: Flush kernel data mapping in set_pte_at() when installing pte for user page · 38860b2c

由 John David Anglin 提交于 11月 08, 2021

For years, there have been random segmentation faults in userspace on
SMP PA-RISC machines.  It occurred to me that this might be a problem in
set_pte_at().  MIPS and some other architectures do cache flushes when
installing PTEs with the present bit set.

Here I have adapted the code in update_mmu_cache() to flush the kernel
mapping when the kernel flush is deferred, or when the kernel mapping
may alias with the user mapping.  This simplifies calls to
update_mmu_cache().

I also changed the barrier in set_pte() from a compiler barrier to a
full memory barrier.  I know this change is not sufficient to fix the
problem.  It might not be needed.

I have had a few days of operation with 5.14.16 to 5.15.1 and haven't
seen any random segmentation faults on rp3440 or c8000 so far.
Signed-off-by: NJohn David Anglin <dave.anglin@bell.net>
Signed-off-by: NHelge Deller <deller@gmx.de>
Cc: stable@kernel.org # 5.12+

38860b2c

H
parisc: Fix implicit declaration of function '__kernel_text_address' · f0d1cfac
由 Helge Deller 提交于 11月 09, 2021
```
Signed-off-by: NHelge Deller <deller@gmx.de>
```
f0d1cfac

parisc: Fix backtrace to always include init funtion names · 279917e2

由 Helge Deller 提交于 11月 04, 2021

I noticed that sometimes at kernel startup the backtraces did not
included the function names of init functions. Their address were not
resolved to function names and instead only the address was printed.

Debugging shows that the culprit is is_ksym_addr() which is called
by the backtrace functions to check if an address belongs to a function in
the kernel. The problem occurs only for CONFIG_KALLSYMS_ALL=y.

When looking at is_ksym_addr() one can see that for CONFIG_KALLSYMS_ALL=y
the function only tries to resolve the address via is_kernel() function,
which checks like this:
	if (addr >= _stext && addr <= _end)
                return 1;
On parisc the init functions are located before _stext, so this check fails.
Other platforms seem to have all functions (including init functions)
behind _stext.

The following patch moves the _stext symbol at the beginning of the
kernel and thus includes the init section. This fixes the check and does
not seem to have any negative side effects on where the kernel mapping
happens in the map_pages() function in arch/parisc/mm/init.c.
Signed-off-by: NHelge Deller <deller@gmx.de>
Cc: stable@kernel.org # 5.4+

279917e2

13 11月, 2021 3 次提交

signal/vm86_32: Remove pointless test in BUG_ON · c7a9b647

由 Eric W. Biederman 提交于 11月 12, 2021

kernel test robot <oliver.sang@intel.com> writes[1]:
>
> Greeting,
>
> FYI, we noticed the following commit (built with gcc-9):
>
> commit: 1a4d21a2 ("signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> in testcase: trinity
> version: trinity-static-i386-x86_64-1c734c75-1_2020-01-06
> with following parameters:
>
>
> [ 70.645554][ T3747] kernel BUG at arch/x86/kernel/vm86_32.c:109!
> [ 70.646185][ T3747] invalid opcode: 0000 [#1] SMP
> [ 70.646682][ T3747] CPU: 0 PID: 3747 Comm: trinity-c6 Not tainted 5.15.0-rc1-00009-g1a4d21a2 #1
> [ 70.647598][ T3747] EIP: save_v86_state (arch/x86/kernel/vm86_32.c:109 (discriminator 3))
> [ 70.648113][ T3747] Code: 89 c3 64 8b 35 60 b8 25 c2 83 ec 08 89 55 f0 8b 96 10 19 00 00 89 55 ec e8 c6 2d 0c 00 fb 8b 55 ec 85 d2 74 05 83 3a 00 75 02 <0f> 0b 8b 86 10 19 00 00 8b 4b 38 8b 78 48 31 cf 89 f8 8b 7a 4c 81
> [ 70.650136][ T3747] EAX: 00000001 EBX: f5f49fac ECX: 0000000b EDX: f610b600
> [ 70.650852][ T3747] ESI: f5f79cc0 EDI: f5f79cc0 EBP: f5f49f04 ESP: f5f49ef0
> [ 70.651593][ T3747] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
> [ 70.652413][ T3747] CR0: 80050033 CR2: 00004000 CR3: 35fc7000 CR4: 000406d0
> [ 70.653169][ T3747] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [ 70.653897][ T3747] DR6: fffe0ff0 DR7: 00000400
> [ 70.654382][ T3747] Call Trace:
> [ 70.654719][ T3747] arch_do_signal_or_restart (arch/x86/kernel/signal.c:792 arch/x86/kernel/signal.c:867)
> [ 70.655288][ T3747] exit_to_user_mode_prepare (kernel/entry/common.c:174 kernel/entry/common.c:209)
> [ 70.655854][ T3747] irqentry_exit_to_user_mode (kernel/entry/common.c:126 kernel/entry/common.c:317)
> [ 70.656450][ T3747] irqentry_exit (kernel/entry/common.c:406)
> [ 70.656897][ T3747] exc_page_fault (arch/x86/mm/fault.c:1535)
> [ 70.657369][ T3747] ? sysvec_kvm_asyncpf_interrupt (arch/x86/mm/fault.c:1488)
> [ 70.657989][ T3747] handle_exception (arch/x86/entry/entry_32.S:1085)

vm86_32.c:109 is: "BUG_ON(!vm86 || !vm86->user_vm86)"

When trying to understand the failure Brian Gerst pointed out[2] that
the code does not need protection against vm86->user_vm86 being NULL.
The copy_from_user code will already handles that case if the address
is going to fault.

Looking futher I realized that if we care about not allowing struct
vm86plus_struct at address 0 it should be do_sys_vm86 (the system
call) that does the filtering.  Not way down deep when the emulation
has completed in save_v86_state.

So let's just remove the silly case of attempting to filter a
userspace address with a BUG_ON.  Existing userspace can't break and
it won't make the kernel any more attackable as the userspace access
helpers will handle it, if it isn't a good userspace pointer.

I have run the reproducer the fuzzer gave me before I made this change
and it reproduced, and after I made this change and I have not seen
the reported failure.  So it does looks like this fixes the reported
issue.

[1] https://lkml.kernel.org/r/20211112074030.GB19820@xsang-OptiPlex-9020
[2] https://lkml.kernel.org/r/CAMzpN2jkK5sAv-Kg_kVnCEyVySiqeTdUORcC=AdG1gV6r8nUew@mail.gmail.comSuggested-by: NBrian Gerst <brgerst@gmail.com>
Reported-by: Nkernel test robot <oliver.sang@intel.com>
Tested-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

c7a9b647

x86/cpu: Add Raptor Lake to Intel family · fbdb5e8f

由 Tony Luck 提交于 11月 12, 2021

Add model ID for Raptor Lake.

[ dhansen: These get added as soon as possible so that folks doing
  development can leverage them. ]
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20211112182835.924977-1-tony.luck@intel.com

fbdb5e8f

x86/mce: Add errata workaround for Skylake SKX37 · e629fc14

由 Dave Jones 提交于 10月 29, 2021

Errata SKX37 is word-for-word identical to the other errata listed in
this workaround.   I happened to notice this after investigating a CMCI
storm on a Skylake host.  While I can't confirm this was the root cause,
spurious corrected errors does sound like a likely suspect.

Fixes: 2976908e ("x86/mce: Do not log spurious corrected mce errors")
Signed-off-by: NDave Jones <davej@codemonkey.org.uk>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: NTony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20211029205759.GA7385@codemonkey.org.uk

e629fc14

12 11月, 2021 7 次提交

ARM: 9156/1: drop cc-option fallbacks for architecture selection · 418ace99

由 Arnd Bergmann 提交于 11月 06, 2021

Naresh and Antonio ran into a build failure with latest Debian
armhf compilers, with lots of output like

 tmp/ccY3nOAs.s:2215: Error: selected processor does not support `cpsid i' in ARM mode

As it turns out, $(cc-option) fails early here when the FPU is not
selected before CPU architecture is selected, as the compiler
option check runs before enabling -msoft-float, which causes
a problem when testing a target architecture level without an FPU:

cc1: error: '-mfloat-abi=hard': selected architecture lacks an FPU

Passing e.g. -march=armv6k+fp in place of -march=armv6k would avoid this
issue, but the fallback logic is already broken because all supported
compilers (gcc-5 and higher) are much more recent than these options,
and building with -march=armv5t as a fallback no longer works.

The best way forward that I see is to just remove all the checks, which
also has the nice side-effect of slightly improving the startup time for
'make'.

The -mtune=marvell-f option was apparently never supported by any mainline
compiler, and the custom Codesourcery gcc build that did support is
now too old to build kernels, so just use -mtune=xscale unconditionally
for those.

This should be safe to apply on all stable kernels, and will be required
in order to keep building them with gcc-11 and higher.

Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=996419Reported-by: NAntonio Terceiro <antonio.terceiro@linaro.org>
Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: NSebastian Andrzej Siewior <sebastian@breakpoint.cc>
Tested-by: NSebastian Reichel <sebastian.reichel@collabora.com>
Tested-by: NKlaus Kudielka <klaus.kudielka@gmail.com>
Cc: Matthias Klose <doko@debian.org>
Cc: stable@vger.kernel.org
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk>

418ace99

ARM: 9155/1: fix early early_iounmap() · 0d08e7bf

由 Michał Mirosław 提交于 11月 04, 2021

Currently __set_fixmap() bails out with a warning when called in early boot
from early_iounmap(). Fix it, and while at it, make the comment a bit easier
to understand.

Cc: <stable@vger.kernel.org>
Fixes: b089c31c ("ARM: 8667/3: Fix memory attribute inconsistencies when using fixmap")
Acked-by: NArd Biesheuvel <ardb@kernel.org>
Signed-off-by: NMichał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk>

0d08e7bf

KVM: SEV: unify cgroup cleanup code for svm_vm_migrate_from · 501cfe06

由 Paolo Bonzini 提交于 11月 12, 2021

Use the same cleanup code independent of whether the cgroup to be
uncharged and unref'd is the source or the destination cgroup. Use a
bool to track whether the destination cgroup has been charged, which also
fixes a bug in the error case: the destination cgroup must be uncharged
only if it does not match the source.

Fixes: b5663931 ("KVM: SEV: Add support for SEV intra host migration")
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

501cfe06

KVM: x86: move guest_pv_has out of user_access section · 3e067fd8

由 Paolo Bonzini 提交于 11月 12, 2021

When UBSAN is enabled, the code emitted for the call to guest_pv_has
includes a call to __ubsan_handle_load_invalid_value.  objtool
complains that this call happens with UACCESS enabled; to avoid
the warning, pull the calls to user_access_begin into both arms
of the "if" statement, after the check for guest_pv_has.
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3e067fd8

dt-bindings: Rename Ingenic CGU headers to ingenic,*.h · c4a11bf4

由 Paul Cercueil 提交于 10月 16, 2021

Tidy up a bit the tree, by prefixing all include/dt-bindings/clock/ files
related to Ingenic SoCs with 'ingenic,'.
Signed-off-by: NPaul Cercueil <paul@crapouillou.net>
Acked-by: NRob Herring <robh@kernel.org>
Acked-by: NStephen Boyd <sboyd@kernel.org>
Signed-off-by: NRob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20211016133322.40771-1-paul@crapouillou.net

c4a11bf4

kasan: add kasan mode messages when kasan init · b873e986

由 Kuan-Ying Lee 提交于 11月 10, 2021

There are multiple kasan modes.  It makes sense that we add some
messages to know which kasan mode is active when booting up [1].

Link: https://bugzilla.kernel.org/show_bug.cgi?id=212195 [1]
Link: https://lkml.kernel.org/r/20211020094850.4113-1-Kuan-Ying.Lee@mediatek.comSigned-off-by: NKuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Reviewed-by: NMarco Elver <elver@google.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Yee Lee <yee.lee@mediatek.com>
Cc: Nicholas Tang <nicholas.tang@mediatek.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b873e986

mm/migrate.c: remove MIGRATE_PFN_LOCKED · ab09243a

由 Alistair Popple 提交于 11月 10, 2021

MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a
source page was already locked during migrate_vma_collect().  If it
wasn't then the a second attempt is made to lock the page.  However if
the first attempt failed it's unlikely a second attempt will succeed,
and the retry adds complexity.  So clean this up by removing the retry
and MIGRATE_PFN_LOCKED flag.

Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag
set, but nothing actually checks that.

Link: https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ab09243a

11 11月, 2021 18 次提交

KVM: x86: Drop arbitrary KVM_SOFT_MAX_VCPUS · da1bfd52

由 Vitaly Kuznetsov 提交于 11月 11, 2021

KVM_CAP_NR_VCPUS is used to get the "recommended" maximum number of
VCPUs and arm64/mips/riscv report num_online_cpus(). Powerpc reports
either num_online_cpus() or num_present_cpus(), s390 has multiple
constants depending on hardware features. On x86, KVM reports an
arbitrary value of '710' which is supposed to be the maximum tested
value but it's possible to test all KVM_MAX_VCPUS even when there are
less physical CPUs available.

Drop the arbitrary '710' value and return num_online_cpus() on x86 as
well. The recommendation will match other architectures and will mean
'no CPU overcommit'.

For reference, QEMU only queries KVM_CAP_NR_VCPUS to print a warning
when the requested vCPU number exceeds it. The static limit of '710'
is quite weird as smaller systems with just a few physical CPUs should
certainly "recommend" less.
Suggested-by: NEduardo Habkost <ehabkost@redhat.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211111134733.86601-1-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

da1bfd52

KVM: Move INVPCID type check from vmx and svm to the common kvm_handle_invpcid() · 796c83c5

由 Vipin Sharma 提交于 11月 09, 2021

Handle #GP on INVPCID due to an invalid type in the common switch
statement instead of relying on the callers (VMX and SVM) to manually
validate the type.

Unlike INVVPID and INVEPT, INVPCID is not explicitly documented to check
the type before reading the operand from memory, so deferring the
type validity check until after that point is architecturally allowed.
Signed-off-by: NVipin Sharma <vipinsh@google.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211109174426.2350547-3-vipinsh@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

796c83c5

KVM: VMX: Add a helper function to retrieve the GPR index for INVPCID, INVVPID, and INVEPT · 329bd56c

由 Vipin Sharma 提交于 11月 09, 2021

handle_invept(), handle_invvpid(), handle_invpcid() read the same reg2
field in vmcs.VMX_INSTRUCTION_INFO to get the index of the GPR that
holds the invalidation type. Add a helper to retrieve reg2 from VMX
instruction info to consolidate and document the shift+mask magic.
Signed-off-by: NVipin Sharma <vipinsh@google.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211109174426.2350547-2-vipinsh@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

329bd56c

KVM: nVMX: Clean up x2APIC MSR handling for L2 · a5e0c252

由 Sean Christopherson 提交于 11月 09, 2021

Clean up the x2APIC MSR bitmap intereption code for L2, which is the last
holdout of open coded bitmap manipulations.  Freshen up the SDM/PRM
comment, rename the function to make it abundantly clear the funky
behavior is x2APIC specific, and explain _why_ vmcs01's bitmap is ignored
(the previous comment was flat out wrong for x2APIC behavior).

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211109013047.2041518-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a5e0c252

KVM: VMX: Macrofy the MSR bitmap getters and setters · 0cacb80b

由 Sean Christopherson 提交于 11月 09, 2021

Add builder macros to generate the MSR bitmap helpers to reduce the
amount of copy-paste code, especially with respect to all the magic
numbers needed to calc the correct bit location.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211109013047.2041518-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0cacb80b

KVM: nVMX: Handle dynamic MSR intercept toggling · 67f4b996

由 Sean Christopherson 提交于 11月 09, 2021

Always check vmcs01's MSR bitmap when merging L0 and L1 bitmaps for L2,
and always update the relevant bits in vmcs02.  This fixes two distinct,
but intertwined bugs related to dynamic MSR bitmap modifications.

The first issue is that KVM fails to enable MSR interception in vmcs02
for the FS/GS base MSRs if L1 first runs L2 with interception disabled,
and later enables interception.

The second issue is that KVM fails to honor userspace MSR filtering when
preparing vmcs02.

Fix both issues simultaneous as fixing only one of the issues (doesn't
matter which) would create a mess that no one should have to bisect.
Fixing only the first bug would exacerbate the MSR filtering issue as
userspace would see inconsistent behavior depending on the whims of L1.
Fixing only the second bug (MSR filtering) effectively requires fixing
the first, as the nVMX code only knows how to transition vmcs02's
bitmap from 1->0.

Move the various accessor/mutators that are currently buried in vmx.c
into vmx.h so that they can be shared by the nested code.

Fixes: 1a155254 ("KVM: x86: Introduce MSR filtering")
Fixes: d69129b4 ("KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible")
Cc: stable@vger.kernel.org
Cc: Alexander Graf <graf@amazon.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211109013047.2041518-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

67f4b996

KVM: nVMX: Query current VMCS when determining if MSR bitmaps are in use · 7dfbc624

由 Sean Christopherson 提交于 11月 09, 2021

Check the current VMCS controls to determine if an MSR write will be
intercepted due to MSR bitmaps being disabled.  In the nested VMX case,
KVM will disable MSR bitmaps in vmcs02 if they're disabled in vmcs12 or
if KVM can't map L1's bitmaps for whatever reason.

Note, the bad behavior is relatively benign in the current code base as
KVM sets all bits in vmcs02's MSR bitmap by default, clears bits if and
only if L0 KVM also disables interception of an MSR, and only uses the
buggy helper for MSR_IA32_SPEC_CTRL.  Because KVM explicitly tests WRMSR
before disabling interception of MSR_IA32_SPEC_CTRL, the flawed check
will only result in KVM reading MSR_IA32_SPEC_CTRL from hardware when it
isn't strictly necessary.

Tag the fix for stable in case a future fix wants to use
msr_write_intercepted(), in which case a buggy implementation in older
kernels could prove subtly problematic.

Fixes: d28b387f ("KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL")
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211109013047.2041518-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7dfbc624

KVM: x86: Don't update vcpu->arch.pv_eoi.msr_val when a bogus value was... · afd67ee3

由 Vitaly Kuznetsov 提交于 11月 08, 2021

KVM: x86: Don't update vcpu->arch.pv_eoi.msr_val when a bogus value was written to MSR_KVM_PV_EOI_EN

When kvm_gfn_to_hva_cache_init() call from kvm_lapic_set_pv_eoi() fails,
MSR write to MSR_KVM_PV_EOI_EN results in #GP so it is reasonable to
expect that the value we keep internally in KVM wasn't updated.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211108152819.12485-3-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

afd67ee3

KVM: x86: Rename kvm_lapic_enable_pv_eoi() · 77c3323f

由 Vitaly Kuznetsov 提交于 11月 08, 2021

kvm_lapic_enable_pv_eoi() is a misnomer as the function is also
used to disable PV EOI. Rename it to kvm_lapic_set_pv_eoi().

No functional change intended.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211108152819.12485-2-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

77c3323f

KVM: x86: Make sure KVM_CPUID_FEATURES really are KVM_CPUID_FEATURES · 760849b1

由 Paul Durrant 提交于 11月 05, 2021

Currently when kvm_update_cpuid_runtime() runs, it assumes that the
KVM_CPUID_FEATURES leaf is located at 0x40000001. This is not true,
however, if Hyper-V support is enabled. In this case the KVM leaves will
be offset.

This patch introdues as new 'kvm_cpuid_base' field into struct
kvm_vcpu_arch to track the location of the KVM leaves and function
kvm_update_kvm_cpuid_base() (called from kvm_set_cpuid()) to locate the
leaves using the 'KVMKVMKVM\0\0\0' signature (which is now given a
definition in kvm_para.h). Adjustment of KVM_CPUID_FEATURES will hence now
target the correct leaf.

NOTE: A new for_each_possible_hypervisor_cpuid_base() macro is intoduced
      into processor.h to avoid having duplicate code for the iteration
      over possible hypervisor base leaves.
Signed-off-by: NPaul Durrant <pdurrant@amazon.com>
Message-Id: <20211105095101.5384-3-pdurrant@amazon.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

760849b1

KVM: x86: Add helper to consolidate core logic of SET_CPUID{2} flows · 8b44b174

由 Sean Christopherson 提交于 11月 05, 2021

Move the core logic of SET_CPUID and SET_CPUID2 to a common helper, the
only difference between the two ioctls() is the format of the userspace
struct.  A future fix will add yet more code to the core logic.

No functional change intended.

Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211105095101.5384-2-pdurrant@amazon.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8b44b174

kvm: mmu: Use fast PF path for access tracking of huge pages when possible · 10c30de0

由 Junaid Shahid 提交于 11月 03, 2021

The fast page fault path bails out on write faults to huge pages in
order to accommodate dirty logging. This change adds a check to do that
only when dirty logging is actually enabled, so that access tracking for
huge pages can still use the fast path for write faults in the common
case.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Reviewed-by: NBen Gardon <bgardon@google.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211104003359.2201967-1-junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

10c30de0

KVM: x86/mmu: Properly dereference rcu-protected TDP MMU sptep iterator · c435d4b7

由 Sean Christopherson 提交于 11月 03, 2021

Wrap the read of iter->sptep in tdp_mmu_map_handle_target_level() with
rcu_dereference().  Shadow pages in the TDP MMU, and thus their SPTEs,
are protected by rcu.

This fixes a Sparse warning at tdp_mmu.c:900:51:
  warning: incorrect type in argument 1 (different address spaces)
  expected unsigned long long [usertype] *sptep
  got unsigned long long [noderef] [usertype] __rcu *[usertype] sptep

Fixes: 7158bee4 ("KVM: MMU: pass kvm_mmu_page struct to make_spte")
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211103161833.3769487-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c435d4b7

KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active · cae72dcc

由 Maxim Levitsky 提交于 11月 08, 2021

KVM_GUESTDBG_BLOCKIRQ relies on interrupts being injected using
standard kvm's inject_pending_event, and not via APICv/AVIC.

Since this is a debug feature, just inhibit APICv/AVIC while
KVM_GUESTDBG_BLOCKIRQ is in use on at least one vCPU.

Fixes: 61e5f69e ("KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ")
Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Tested-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211108090245.166408-1-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cae72dcc

kvm: x86: Convert return type of *is_valid_rdpmc_ecx() to bool · e6cd31f1

由 Jim Mattson 提交于 11月 05, 2021

These function names sound like predicates, and they have siblings,
*is_valid_msr(), which _are_ predicates. Moreover, there are comments
that essentially warn that these functions behave unexpectedly.

Flip the polarity of the return values, so that they become
predicates, and convert the boolean result to a success/failure code
at the outer call site.
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211105202058.1048757-1-jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e6cd31f1

KVM: x86: Fix recording of guest steal time / preempted status · 7e2175eb

由 David Woodhouse 提交于 11月 02, 2021

In commit b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is
not missed") we switched to using a gfn_to_pfn_cache for accessing the
guest steal time structure in order to allow for an atomic xchg of the
preempted field. This has a couple of problems.

Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the
atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a
guest vCPU using an IOMEM page for its steal time would never have its
preempted field set.

Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it
should have been. There are two stages to the GFN->PFN conversion;
first the GFN is converted to a userspace HVA, and then that HVA is
looked up in the process page tables to find the underlying host PFN.
Correct invalidation of the latter would require being hooked up to the
MMU notifiers, but that doesn't happen---so it just keeps mapping and
unmapping the *wrong* PFN after the userspace page tables change.

In the !IOMEM case at least the stale page *is* pinned all the time it's
cached, so it won't be freed and reused by anyone else while still
receiving the steal time updates. The map/unmap dance only takes care
of the KVM administrivia such as marking the page dirty.

Until the gfn_to_pfn cache handles the remapping automatically by
integrating with the MMU notifiers, we might as well not get a
kernel mapping of it, and use the perfectly serviceable userspace HVA
that we already have. We just need to implement the atomic xchg on
the userspace address with appropriate exception handling, which is
fairly trivial.

Cc: stable@vger.kernel.org
Fixes: b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed")
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Message-Id: <3645b9b889dac6438394194bb5586a46b68d581f.camel@infradead.org>
[I didn't entirely agree with David's assessment of the
usefulness of the gfn_to_pfn cache, and integrated the outcome
of the discussion in the above commit message. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7e2175eb

KVM: SEV: Add support for SEV-ES intra host migration · 0b020f5a

由 Peter Gonda 提交于 10月 21, 2021

For SEV-ES to work with intra host migration the VMSAs, GHCB metadata,
and other SEV-ES info needs to be preserved along with the guest's
memory.
Signed-off-by: NPeter Gonda <pgonda@google.com>
Reviewed-by: NMarc Orr <marcorr@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <20211021174303.385706-4-pgonda@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0b020f5a

KVM: SEV: Add support for SEV intra host migration · b5663931

由 Peter Gonda 提交于 10月 21, 2021

For SEV to work with intra host migration, contents of the SEV info struct
such as the ASID (used to index the encryption key in the AMD SP) and
the list of memory regions need to be transferred to the target VM.
This change adds a commands for a target VMM to get a source SEV VM's sev
info.
Signed-off-by: NPeter Gonda <pgonda@google.com>
Suggested-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NMarc Orr <marcorr@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <20211021174303.385706-3-pgonda@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b5663931

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功