提交 · fee060cd52d69c114b62d1a2948ea9648b5131f9 · openeuler / Kernel

25 5月, 2022 4 次提交

KVM: x86: avoid calling x86 emulator without a decoded instruction · fee060cd

由 Sean Christopherson 提交于 3月 11, 2022

Whenever x86_decode_emulated_instruction() detects a breakpoint, it
returns the value that kvm_vcpu_check_breakpoint() writes into its
pass-by-reference second argument.  Unfortunately this is completely
bogus because the expected outcome of x86_decode_emulated_instruction
is an EMULATION_* value.

Then, if kvm_vcpu_check_breakpoint() does "*r = 0" (corresponding to
a KVM_EXIT_DEBUG userspace exit), it is misunderstood as EMULATION_OK
and x86_emulate_instruction() is called without having decoded the
instruction.  This causes various havoc from running with a stale
emulation context.

The fix is to move the call to kvm_vcpu_check_breakpoint() where it was
before commit 4aa2691d ("KVM: x86: Factor out x86 instruction
emulation with decoding") introduced x86_decode_emulated_instruction().
The other caller of the function does not need breakpoint checks,
because it is invoked as part of a vmexit and the processor has already
checked those before executing the instruction that #GP'd.

This fixes CVE-2022-1852.
Reported-by: NQiuhao Li <qiuhao@sysec.org>
Reported-by: NGaoning Pan <pgn@zju.edu.cn>
Reported-by: NYongkang Jia <kangel@zju.edu.cn>
Fixes: 4aa2691d ("KVM: x86: Factor out x86 instruction emulation with decoding")
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220311032801.3467418-2-seanjc@google.com>
[Rewrote commit message according to Qiuhao's report, since a patch
 already existed to fix the bug. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fee060cd

KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak · d22d2474

由 Ashish Kalra 提交于 5月 16, 2022

For some sev ioctl interfaces, the length parameter that is passed maybe
less than or equal to SEV_FW_BLOB_MAX_SIZE, but larger than the data
that PSP firmware returns. In this case, kmalloc will allocate memory
that is the size of the input rather than the size of the data.
Since PSP firmware doesn't fully overwrite the allocated buffer, these
sev ioctl interface may return uninitialized kernel slab memory.
Reported-by: NAndy Nguyen <theflow@google.com>
Suggested-by: NDavid Rientjes <rientjes@google.com>
Suggested-by: NPeter Gonda <pgonda@google.com>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: eaf78265 ("KVM: SVM: Move SEV code to separate file")
Fixes: 2c07ded0 ("KVM: SVM: add support for SEV attestation command")
Fixes: 4cfdd47d ("KVM: SVM: Add KVM_SEV SEND_START command")
Fixes: d3d1af85 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Fixes: eba04b20 ("KVM: x86: Account a variety of miscellaneous allocations")
Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
Reviewed-by: NPeter Gonda <pgonda@google.com>
Message-Id: <20220516154310.3685678-1-Ashish.Kalra@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d22d2474

x86/fpu: KVM: Set the base guest FPU uABI size to sizeof(struct kvm_xsave) · d187ba53

由 Sean Christopherson 提交于 5月 04, 2022

Set the starting uABI size of KVM's guest FPU to 'struct kvm_xsave',
i.e. to KVM's historical uABI size.  When saving FPU state for usersapce,
KVM (well, now the FPU) sets the FP+SSE bits in the XSAVE header even if
the host doesn't support XSAVE.  Setting the XSAVE header allows the VM
to be migrated to a host that does support XSAVE without the new host
having to handle FPU state that may or may not be compatible with XSAVE.

Setting the uABI size to the host's default size results in out-of-bounds
writes (setting the FP+SSE bits) and data corruption (that is thankfully
caught by KASAN) when running on hosts without XSAVE, e.g. on Core2 CPUs.

WARN if the default size is larger than KVM's historical uABI size; all
features that can push the FPU size beyond the historical size must be
opt-in.

  ==================================================================
  BUG: KASAN: slab-out-of-bounds in fpu_copy_uabi_to_guest_fpstate+0x86/0x130
  Read of size 8 at addr ffff888011e33a00 by task qemu-build/681
  CPU: 1 PID: 681 Comm: qemu-build Not tainted 5.18.0-rc5-KASAN-amd64 #1
  Hardware name:  /DG35EC, BIOS ECG3510M.86A.0118.2010.0113.1426 01/13/2010
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x45
   print_report.cold+0x45/0x575
   kasan_report+0x9b/0xd0
   fpu_copy_uabi_to_guest_fpstate+0x86/0x130
   kvm_arch_vcpu_ioctl+0x72a/0x1c50 [kvm]
   kvm_vcpu_ioctl+0x47f/0x7b0 [kvm]
   __x64_sys_ioctl+0x5de/0xc90
   do_syscall_64+0x31/0x50
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  Allocated by task 0:
  (stack is not available)
  The buggy address belongs to the object at ffff888011e33800
   which belongs to the cache kmalloc-512 of size 512
  The buggy address is located 0 bytes to the right of
   512-byte region [ffff888011e33800, ffff888011e33a00)
  The buggy address belongs to the physical page:
  page:0000000089cd4adb refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11e30
  head:0000000089cd4adb order:2 compound_mapcount:0 compound_pincount:0
  flags: 0x4000000000010200(slab|head|zone=1)
  raw: 4000000000010200 dead000000000100 dead000000000122 ffff888001041c80
  raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
  page dumped because: kasan: bad access detected
  Memory state around the buggy address:
   ffff888011e33900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   ffff888011e33980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  >ffff888011e33a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                     ^
   ffff888011e33a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
   ffff888011e33b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ==================================================================
  Disabling lock debugging due to kernel taint

Fixes: be50b206 ("kvm: x86: Add support for getting/setting expanded xstate buffer")
Fixes: c60427dd ("x86/fpu: Add uabi_size to guest_fpu")
Reported-by: NZdenek Kaspar <zkaspar82@gmail.com>
Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Tested-by: NZdenek Kaspar <zkaspar82@gmail.com>
Message-Id: <20220504001219.983513-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d187ba53

KVM: LAPIC: Trace LAPIC timer expiration on every vmentry · e0ac5351

由 Wanpeng Li 提交于 4月 26, 2022

In commit ec0671d5 ("KVM: LAPIC: Delay trace_kvm_wait_lapic_expire
tracepoint to after vmexit", 2019-06-04), trace_kvm_wait_lapic_expire
was moved after guest_exit_irqoff() because invoking tracepoints within
kvm_guest_enter/kvm_guest_exit caused a lockdep splat.

These days this is not necessary, because commit 87fa7f3e ("x86/kvm:
Move context tracking where it belongs", 2020-07-09) restricted
the RCU extended quiescent state to be closer to vmentry/vmexit.
Moving the tracepoint back to __kvm_wait_lapic_expire is more accurate,
because it will be reported even if vcpu_enter_guest causes multiple
vmentries via the IPI/Timer fast paths, and it allows the removal of
advance_expire_delta.
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Message-Id: <1650961551-38390-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e0ac5351

20 5月, 2022 10 次提交

KVM: s390: Don't indicate suppression on dirtying, failing memop · c783631b

由 Janis Schoetterl-Glausch 提交于 5月 12, 2022

If user space uses a memop to emulate an instruction and that
memop fails, the execution of the instruction ends.
Instruction execution can end in different ways, one of which is
suppression, which requires that the instruction execute like a no-op.
A writing memop that spans multiple pages and fails due to key
protection may have modified guest memory, as a result, the likely
correct ending is termination. Therefore, do not indicate a
suppressing instruction ending in this case.
Signed-off-by: NJanis Schoetterl-Glausch <scgl@linux.ibm.com>
Reviewed-by: NChristian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: NClaudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20220512131019.2594948-2-scgl@linux.ibm.comSigned-off-by: NChristian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: NJanosch Frank <frankja@linux.ibm.com>

c783631b

drivers/s390/char: Add Ultravisor io device · 4689752c

由 Steffen Eiden 提交于 5月 16, 2022

This patch adds a new miscdevice to expose some Ultravisor functions
to userspace. Userspace can send IOCTLs to the uvdevice that will then
emit a corresponding Ultravisor Call and hands the result over to
userspace. The uvdevice is available if the Ultravisor Call facility is
present.
Userspace can call the Retrieve Attestation Measurement
Ultravisor Call using IOCTLs on the uvdevice.

The uvdevice will do some sanity checks first.
Then, copy the request data to kernel space, build the UVCB,
perform the UV call, and copy the result back to userspace.
Signed-off-by: NSteffen Eiden <seiden@linux.ibm.com>
Reviewed-by: NJanosch Frank <frankja@linux.ibm.com>
Reviewed-by: NClaudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/kvm/20220516113335.338212-1-seiden@linux.ibm.com/
Message-Id: <20220516113335.338212-1-seiden@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com> (whitespace and  tristate fixes, pick)

4689752c

RISC-V: KVM: Introduce ISA extension register · affa28e4

由 Atish Patra 提交于 5月 09, 2022

Currently, there is no provision for vmm (qemu-kvm or kvmtool) to
query about multiple-letter ISA extensions. The config register
is only used for base single letter ISA extensions.

A new ISA extension register is added that will allow the vmm
to query about any ISA extension one at a time. It is enabled for
both single letter or multi-letter ISA extensions. The ISA extension
register is useful to if the vmm requires to retrieve/set single
extension while the config register should be used if all the base
ISA extension required to retrieve or set.

For any multi-letter ISA extensions, the new register interface
must be used.
Signed-off-by: NAtish Patra <atishp@rivosinc.com>
Signed-off-by: NAnup Patel <anup@brainfault.org>

affa28e4

RISC-V: KVM: Cleanup stale TLB entries when host CPU changes · 92e45050

由 Anup Patel 提交于 5月 09, 2022

On RISC-V platforms with hardware VMID support, we share same
VMID for all VCPUs of a particular Guest/VM. This means we might
have stale G-stage TLB entries on the current Host CPU due to
some other VCPU of the same Guest which ran previously on the
current Host CPU.

To cleanup stale TLB entries, we simply flush all G-stage TLB
entries by VMID whenever underlying Host CPU changes for a VCPU.
Signed-off-by: NAnup Patel <apatel@ventanamicro.com>
Reviewed-by: NAtish Patra <atishp@rivosinc.com>
Signed-off-by: NAnup Patel <anup@brainfault.org>

92e45050

RISC-V: KVM: Add remote HFENCE functions based on VCPU requests · 13acfec2

由 Anup Patel 提交于 5月 09, 2022

The generic KVM has support for VCPU requests which can be used
to do arch-specific work in the run-loop. We introduce remote
HFENCE functions which will internally use VCPU requests instead
of host SBI calls.

Advantages of doing remote HFENCEs as VCPU requests are:
1) Multiple VCPUs of a Guest may be running on different Host CPUs
   so it is not always possible to determine the Host CPU mask for
   doing Host SBI call. For example, when VCPU X wants to do HFENCE
   on VCPU Y, it is possible that VCPU Y is blocked or in user-space
   (i.e. vcpu->cpu < 0).
2) To support nested virtualization, we will be having a separate
   shadow G-stage for each VCPU and a common host G-stage for the
   entire Guest/VM. The VCPU requests based remote HFENCEs helps
   us easily synchronize the common host G-stage and shadow G-stage
   of each VCPU without any additional IPI calls.

This is also a preparatory patch for upcoming nested virtualization
support where we will be having a shadow G-stage page table for
each Guest VCPU.
Signed-off-by: NAnup Patel <apatel@ventanamicro.com>
Reviewed-by: NAtish Patra <atishp@rivosinc.com>
Signed-off-by: NAnup Patel <anup@brainfault.org>

13acfec2

RISC-V: KVM: Reduce KVM_MAX_VCPUS value · 486a3842

由 Anup Patel 提交于 5月 09, 2022

Currently, the KVM_MAX_VCPUS value is 16384 for RV64 and 128
for RV32.

The KVM_MAX_VCPUS value is too high for RV64 and too low for
RV32 compared to other architectures (e.g. x86 sets it to 1024
and ARM64 sets it to 512). The too high value of KVM_MAX_VCPUS
on RV64 also leads to VCPU mask on stack consuming 2KB.

We set KVM_MAX_VCPUS to 1024 for both RV64 and RV32 to be
aligned other architectures.
Signed-off-by: NAnup Patel <apatel@ventanamicro.com>
Reviewed-by: NAtish Patra <atishp@rivosinc.com>
Signed-off-by: NAnup Patel <anup@brainfault.org>

486a3842

RISC-V: KVM: Introduce range based local HFENCE functions · 2415e46e

由 Anup Patel 提交于 5月 09, 2022

Various  __kvm_riscv_hfence_xyz() functions implemented in the
kvm/tlb.S are equivalent to corresponding HFENCE.GVMA instructions
and we don't have range based local HFENCE functions.

This patch provides complete set of local HFENCE functions which
supports range based TLB invalidation and supports HFENCE.VVMA
based functions. This is also a preparatory patch for upcoming
Svinval support in KVM RISC-V.
Signed-off-by: NAnup Patel <apatel@ventanamicro.com>
Reviewed-by: NAtish Patra <atishp@rivosinc.com>
Signed-off-by: NAnup Patel <anup@brainfault.org>

2415e46e

RISC-V: KVM: Treat SBI HFENCE calls as NOPs · c7fa3c48

由 Anup Patel 提交于 5月 09, 2022

We should treat SBI HFENCE calls as NOPs until nested virtualization
is supported by KVM RISC-V. This will help us test booting a hypervisor
under KVM RISC-V.
Signed-off-by: NAnup Patel <apatel@ventanamicro.com>
Reviewed-by: NAtish Patra <atishp@rivosinc.com>
Signed-off-by: NAnup Patel <anup@brainfault.org>

c7fa3c48

RISC-V: KVM: Add Sv57x4 mode support for G-stage · b4bbb95e

由 Anup Patel 提交于 5月 09, 2022

Latest QEMU supports G-stage Sv57x4 mode so this patch extends KVM
RISC-V G-stage handling to detect and use Sv57x4 mode when available.
Signed-off-by: NAnup Patel <apatel@ventanamicro.com>
Reviewed-by: NAtish Patra <atishp@rivosinc.com>
Signed-off-by: NAnup Patel <anup@brainfault.org>

b4bbb95e

RISC-V: KVM: Use G-stage name for hypervisor page table · 26708234

由 Anup Patel 提交于 5月 09, 2022

The two-stage address translation defined by the RISC-V privileged
specification defines: VS-stage (guest virtual address to guest
physical address) programmed by the Guest OS  and G-stage (guest
physical addree to host physical address) programmed by the
hypervisor.

To align with above terminology, we replace "stage2" with "gstage"
and "Stage2" with "G-stage" name everywhere in KVM RISC-V sources.
Signed-off-by: NAnup Patel <apatel@ventanamicro.com>
Reviewed-by: NAtish Patra <atishp@rivosinc.com>
Signed-off-by: NAnup Patel <anup@brainfault.org>

26708234

17 5月, 2022 1 次提交

KVM: arm64: Fix hypercall bitmap writeback when vcpus have already run · 528ada28

由 Marc Zyngier 提交于 5月 16, 2022

We generally want to disallow hypercall bitmaps being changed
once vcpus have already run. But we must allow the write if
the written value is unchanged so that userspace can rewrite
the register file on reboot, for example.

Without this, a QEMU-based VM will fail to reboot correctly.

The original code was correct, and it is me that introduced
the regression.

Fixes: 05714cab ("KVM: arm64: Setup a framework for hypercall bitmap firmware registers")
Signed-off-by: NMarc Zyngier <maz@kernel.org>

528ada28

16 5月, 2022 5 次提交

KVM: arm64: vgic: Undo work in failed ITS restores · 8c5e74c9

由 Ricardo Koller 提交于 5月 09, 2022

Failed ITS restores should clean up all state restored until the
failure. There is some cleanup already present when failing to restore
some tables, but it's not complete. Add the missing cleanup.

Note that this changes the behavior in case of a failed restore of the
device tables.

	restore ioctl:
	1. restore collection tables
	2. restore device tables

With this commit, failures in 2. clean up everything created so far,
including state created by 1.
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NRicardo Koller <ricarkol@google.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510001633.552496-5-ricarkol@google.com

8c5e74c9

KVM: arm64: vgic: Do not ignore vgic_its_restore_cte failures · a1ccfd6f

由 Ricardo Koller 提交于 5月 09, 2022

Restoring a corrupted collection entry (like an out of range ID) is
being ignored and treated as success. More specifically, a
vgic_its_restore_cte failure is treated as success by
vgic_its_restore_collection_table.  vgic_its_restore_cte uses positive
and negative numbers to return error, and +1 to return success.  The
caller then uses "ret > 0" to check for success.

Fix this by having vgic_its_restore_cte only return negative numbers on
error.  Do this by changing alloc_collection return codes to only return
negative numbers on error.
Signed-off-by: NRicardo Koller <ricarkol@google.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510001633.552496-4-ricarkol@google.com

a1ccfd6f

KVM: arm64: vgic: Add more checks when restoring ITS tables · 243b1f6c

由 Ricardo Koller 提交于 5月 09, 2022

Try to improve the predictability of ITS save/restores (and debuggability
of failed ITS saves) by failing early on restore when trying to read
corrupted tables.

Restoring the ITS tables does some checks for corrupted tables, but not as
many as in a save: an overflowing device ID will be detected on save but
not on restore.  The consequence is that restoring a corrupted table won't
be detected until the next save; including the ITS not working as expected
after the restore.  As an example, if the guest sets tables overlapping
each other, which would most likely result in some corrupted table, this is
what we would see from the host point of view:

	guest sets base addresses that overlap each other
	save ioctl
	restore ioctl
	save ioctl (fails)

Ideally, we would like the first save to fail, but overlapping tables could
actually be intended by the guest. So, let's at least fail on the restore
with some checks: like checking that device and event IDs don't overflow
their tables.
Signed-off-by: NRicardo Koller <ricarkol@google.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510001633.552496-3-ricarkol@google.com

243b1f6c

KVM: arm64: vgic: Check that new ITEs could be saved in guest memory · cafe7e54

由 Ricardo Koller 提交于 5月 09, 2022

Try to improve the predictability of ITS save/restores by failing
commands that would lead to failed saves. More specifically, fail any
command that adds an entry into an ITS table that is not in guest
memory, which would otherwise lead to a failed ITS save ioctl. There
are already checks for collection and device entries, but not for
ITEs.  Add the corresponding check for the ITT when adding ITEs.
Reviewed-by: NEric Auger <eric.auger@redhat.com>
Signed-off-by: NRicardo Koller <ricarkol@google.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510001633.552496-2-ricarkol@google.com

cafe7e54

KVM: arm64: pmu: Restore compilation when HW_PERF_EVENTS isn't selected · 20492a62

由 Marc Zyngier 提交于 5月 16, 2022

Moving kvm_pmu_events into the vcpu (and refering to it) broke the
somewhat unusual case where the kernel has no support for a PMU
at all.

In order to solve this, move things around a bit so that we can
easily avoid refering to the pmu structure outside of PMU-aware
code. As a bonus, pmu.c isn't compiled in when HW_PERF_EVENTS
isn't selected.
Reported-by: Nkernel test robot <lkp@intel.com>
Reviewed-by: NFuad Tabba <tabba@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/202205161814.KQHpOzsJ-lkp@intel.com

20492a62

15 5月, 2022 5 次提交

KVM: arm64: Hide KVM_REG_ARM_*_BMAP_BIT_COUNT from userspace · 2cde51f1

由 Marc Zyngier 提交于 5月 15, 2022

These constants will change over time, and userspace has no
business knowing about them. Hide them behind __KERNEL__.
Signed-off-by: NMarc Zyngier <maz@kernel.org>

2cde51f1

KVM: arm64: Reenable pmu in Protected Mode · 722625c6

由 Fuad Tabba 提交于 5月 10, 2022

Now that the pmu code does not access hyp data, reenable it in
protected mode.

Once fully supported, protected VMs will not have pmu support,
since that could leak information. However, non-protected VMs in
protected mode should have pmu support if available.
Signed-off-by: NFuad Tabba <tabba@google.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510095710.148178-5-tabba@google.com

722625c6

KVM: arm64: Pass pmu events to hyp via vcpu · 84d751a0

由 Fuad Tabba 提交于 5月 10, 2022

Instead of the host accessing hyp data directly, pass the pmu
events of the current cpu to hyp via the vcpu.

This adds 64 bits (in two fields) to the vcpu that need to be
synced before every vcpu run in nvhe and protected modes.
However, it isolates the hypervisor from the host, which allows
us to use pmu in protected mode in a subsequent patch.

No visible side effects in behavior intended.
Signed-off-by: NFuad Tabba <tabba@google.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510095710.148178-4-tabba@google.com

84d751a0

KVM: arm64: Wrapper for getting pmu_events · 3cb8a091

由 Fuad Tabba 提交于 5月 10, 2022

Eases migrating away from using hyp data and simplifies the code.

No functional change intended.
Reviewed-by: NOliver Upton <oupton@google.com>
Signed-off-by: NFuad Tabba <tabba@google.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510095710.148178-2-tabba@google.com

3cb8a091

KVM: arm64: vgic-v3: List M1 Pro/Max as requiring the SEIS workaround · cae88930

由 Marc Zyngier 提交于 5月 14, 2022

Unsusprisingly, Apple M1 Pro/Max have the exact same defect as the
original M1 and generate random SErrors in the host when a guest
tickles the GICv3 CPU interface the wrong way.

Add the part numbers for both the CPU types found in these two
new implementations, and add them to the hall of shame. This also
applies to the Ultra version, as it is composed of 2 Max SoCs.
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20220514102524.3188730-1-maz@kernel.org

cae88930

13 5月, 2022 1 次提交

x86/mm: Fix marking of unused sub-pmd ranges · 280abe14

由 Adrian-Ken Rueegsegger 提交于 5月 09, 2022

The unused part precedes the new range spanned by the start, end parameters
of vmemmap_use_new_sub_pmd(). This means it actually goes from
ALIGN_DOWN(start, PMD_SIZE) up to start.

Use the correct address when applying the mark using memset.

Fixes: 8d400913 ("x86/vmemmap: handle unpopulated sub-pmd ranges")
Signed-off-by: NAdrian-Ken Rueegsegger <ken@codelabs.ch>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220509090637.24152-2-ken@codelabs.ch

280abe14

12 5月, 2022 14 次提交

KVM: x86/mmu: Speed up slot_rmap_walk_next for sparsely populated rmaps · 6ba1e04f

由 Vipin Sharma 提交于 5月 02, 2022

Avoid calling handlers on empty rmap entries and skip to the next non
empty rmap entry.

Empty rmap entries are noop in handlers.
Signed-off-by: NVipin Sharma <vipinsh@google.com>
Suggested-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220502220347.174664-1-vipinsh@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6ba1e04f

KVM: VMX: Include MKTME KeyID bits in shadow_zero_check · 3c5c3245

由 Kai Huang 提交于 4月 19, 2022

Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should
never be set to SPTE. Set shadow_me_value to 0 and shadow_me_mask to
include all MKTME KeyID bits to include them to shadow_zero_check.
Signed-off-by: NKai Huang <kai.huang@intel.com>
Message-Id: <27bc10e97a3c0b58a4105ff9107448c190328239.1650363789.git.kai.huang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3c5c3245

KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask · e54f1ff2

由 Kai Huang 提交于 4月 19, 2022

Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
high bits of physical address bits as 'KeyID' bits.  Intel Trust Domain
Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
KeyID bits.  TDX private KeyID bits cannot be set in any mapping in the
host kernel since they can only be accessed by software running inside a
new CPU isolated mode.  And unlike to AMD's SME, host kernel doesn't set
any legacy MKTME KeyID bits to any mapping either.  Therefore, it's not
legitimate for KVM to set any KeyID bits in SPTE which maps guest
memory.

KVM maintains shadow_zero_check bits to represent which bits must be
zero for SPTE which maps guest memory.  MKTME KeyID bits should be set
to shadow_zero_check.  Currently, shadow_me_mask is used by AMD to set
the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
shadow_zero_check.  So initializing shadow_me_mask to represent all
MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
to shadow_zero_check).

Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
and repurpose shadow_me_mask as 'all possible memory encryption bits'.
The new schematic of them will be:

 - shadow_me_value: the memory encryption bit(s) that will be set to the
   SPTE (the original shadow_me_mask).
 - shadow_me_mask: all possible memory encryption bits (which is a super
   set of shadow_me_value).
 - For now, shadow_me_value is supposed to be set by SVM and VMX
   respectively, and it is a constant during KVM's life time.  This
   perhaps doesn't fit MKTME but for now host kernel doesn't support it
   (and perhaps will never do).
 - Bits in shadow_me_mask are set to shadow_zero_check, except the bits
   in shadow_me_value.

Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
Replace shadow_me_mask with shadow_me_value in almost all code paths,
except the one in PT64_PERM_MASK, which is used by need_remote_flush()
to determine whether remote TLB flush is needed.  This should still use
shadow_me_mask as any encryption bit change should need a TLB flush.
And for AMD, move initializing shadow_me_value/shadow_me_mask from
kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
Signed-off-by: NKai Huang <kai.huang@intel.com>
Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e54f1ff2

KVM: x86/mmu: Rename reset_rsvds_bits_mask() · c919e881

由 Kai Huang 提交于 4月 19, 2022

Rename reset_rsvds_bits_mask() to reset_guest_rsvds_bits_mask() to make
it clearer that it resets the reserved bits check for guest's page table
entries.
Signed-off-by: NKai Huang <kai.huang@intel.com>
Message-Id: <efdc174b85d55598880064b8bf09245d3791031d.1650363789.git.kai.huang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c919e881

P
KVM: x86: a vCPU with a pending triple fault is runnable · c9f3d9fb
由 Paolo Bonzini 提交于 4月 27, 2022
```
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
```
c9f3d9fb

KVM: x86/mmu: Expand and clean up page fault stats · 1075d41e

由 Sean Christopherson 提交于 4月 23, 2022

Expand and clean up the page fault stats. The current stats are at best
incomplete, and at worst misleading. Differentiate between faults that
are actually fixed vs those that result in an MMIO SPTE being created,
track faults that are spurious, faults that trigger emulation, faults
that that are fixed in the fast path, and last but not least, track the
number of faults that are taken.

Note, the number of faults that require emulation for write-protected
shadow pages can roughly be calculated by subtracting the number of MMIO
SPTEs created from the overall number of faults that trigger emulation.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-10-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1075d41e

KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faults · 8d5265b1

由 Sean Christopherson 提交于 4月 23, 2022

Use IS_ENABLED() instead of an #ifdef to activate the anti-RETPOLINE fast
path for TDP page faults.  The generated code is identical, and the #ifdef
makes it dangerously difficult to extend the logic (guess who forgot to
add an "else" inside the #ifdef and ran through the page fault handler
twice).

No functional or binary change intented.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-9-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8d5265b1

KVM: x86/mmu: Make all page fault handlers internal to the MMU · 8a009d5b

由 Sean Christopherson 提交于 4月 23, 2022

Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
of the page fault handling collateral that was in mmu.h purely for the
async #PF handler into mmu_internal.h, where it belongs.  This will allow
kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
expose those enums outside of the MMU.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-8-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8a009d5b

KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns" · 5276c616

由 Sean Christopherson 提交于 4月 23, 2022

Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
kvm_faultin_pfn() to signal that the page fault handler should continue
doing its thing.  Aside from being gross and inefficient, using a boolean
return to signal continue vs. stop makes it extremely difficult to add
more helpers and/or move existing code to a helper.

E.g. hypothetically, if nested MMUs were to gain a separate page fault
handler in the future, everything up to the "is self-modifying PTE" check
can be shared by all shadow MMUs, but communicating up the stack whether
to continue on or stop becomes a nightmare.

More concretely, proposed support for private guest memory ran into a
similar issue, where it'll be forced to forego a helper in order to yield
sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.

No functional change intended.

Cc: David Matlack <dmatlack@google.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-7-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5276c616

KVM: x86/mmu: Drop exec/NX check from "page fault can be fast" · 5c64aba5

由 Sean Christopherson 提交于 4月 23, 2022

Tweak the "page fault can be fast" logic to explicitly check for !PRESENT
faults in the access tracking case, and drop the exec/NX check that
becomes redundant as a result. No sane hardware will generate an access
that is both an instruct fetch and a write, i.e. it's a waste of cycles.
If hardware goes off the rails, or KVM runs under a misguided hypervisor,
spuriously running throught fast path is benign (KVM has been uknowingly
being doing exactly that for years).
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-6-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5c64aba5

KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use · 54275f74

由 Sean Christopherson 提交于 4月 23, 2022

Check for A/D bits being disabled instead of the access tracking mask
being non-zero when deciding whether or not to attempt to fix a page
fault vian the fast path. Originally, the access tracking mask was
non-zero if and only if A/D bits were disabled by _KVM_ (including not
being supported by hardware), but that hasn't been true since nVMX was
fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
KVM to not use A/D bits while running L2 despite KVM using them while
running L1.

In other words, don't attempt the fast path just because EPT is enabled.

Note, attempting the fast path for all !PRESENT faults can "fix" a very,
_VERY_ tiny percentage of faults out of mmu_lock by detecting that the
fault is spurious, i.e. has been fixed by a different vCPU, but again the
odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM
gets less than 10 successes out of 30k+ faults, and that's likely one of
the more favorable scenarios. Disabling dirty logging can likely lead to
a rash of collisions between vCPUs for some workloads that operate on a
common set of pages, but penalizing _all_ !PRESENT faults for that one
case is unlikely to be a net positive, not to mention that that problem
is best solved by not zapping in the first place.

The number of spurious faults does scale with the number of vCPUs, e.g. a
255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
path (again out of 30k), but that's all of 0.2% of faults. Using legacy
shadow paging does get more spurious faults, and a few more detected out
of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
faults that are reflected into the guest), i.e. the extra detections are
purely due to the sheer number of faults observed.

On the other hand, getting a "negative" in the fast path takes in the
neighborhood of 150-250 cycles. So while it is tempting to keep/extend
the current behavior, such a change needs to come with hard numbers
showing that it's actually a win in the grand scheme, or any scheme for
that matter.

Fixes: 995f00a6 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

54275f74

KVM: VMX: clean up pi_wakeup_handler · 91ab933f

由 Li RongQing 提交于 4月 06, 2022

Passing per_cpu() to list_for_each_entry() causes the macro to be
evaluated N+1 times for N sleeping vCPUs. This is a very small
inefficiency, and the code is cleaner if the address of the per-CPU
variable is loaded earlier. Do this for both the list and the spinlock.
Signed-off-by: NLi RongQing <lirongqing@baidu.com>
Message-Id: <1649244302-6777-1-git-send-email-lirongqing@baidu.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

91ab933f

KVM: x86: fix typo in __try_cmpxchg_user causing non-atomicness · 33fbe6be

由 Maxim Levitsky 提交于 5月 12, 2022

This shows up as a TDP MMU leak when running nested.  Non-working cmpxchg on L0
relies makes L1 install two different shadow pages under same spte, and one of
them is leaked.

Fixes: 1c2361f6 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses")
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220512101420.306759-1-mlevitsk@redhat.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

33fbe6be

arm64: Enable repeat tlbi workaround on KRYO4XX gold CPUs · 51f559d6

由 Shreyas K K 提交于 5月 12, 2022

Add KRYO4XX gold/big cores to the list of CPUs that need the
repeat TLBI workaround. Apply this to the affected
KRYO4XX cores (rcpe to rfpe).

The variant and revision bits are implementation defined and are
different from the their Cortex CPU counterparts on which they are
based on, i.e., (r0p0 to r3p0) is equivalent to (rcpe to rfpe).
Signed-off-by: NShreyas K K <quic_shrekk@quicinc.com>
Reviewed-by: NSai Prakash Ranjan <quic_saipraka@quicinc.com>
Link: https://lore.kernel.org/r/20220512110134.12179-1-quic_shrekk@quicinc.comSigned-off-by: NWill Deacon <will@kernel.org>

51f559d6

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功