提交 · 2f23a7c914317ac0b2a7e2bbe48dc00213652f98 · openeuler / Kernel

27 8月, 2022 2 次提交

perf/x86/intel/uncore: Fix broken read_counter() for SNB IMC PMU · 11745ecf

由 Stephane Eranian 提交于 8月 03, 2022

Existing code was generating bogus counts for the SNB IMC bandwidth counters:

$ perf stat -a -I 1000 -e uncore_imc/data_reads/,uncore_imc/data_writes/
     1.000327813           1,024.03 MiB  uncore_imc/data_reads/
     1.000327813              20.73 MiB  uncore_imc/data_writes/
     2.000580153         261,120.00 MiB  uncore_imc/data_reads/
     2.000580153              23.28 MiB  uncore_imc/data_writes/

The problem was introduced by commit:
  07ce734d ("perf/x86/intel/uncore: Clean up client IMC")

Where the read_counter callback was replace to point to the generic
uncore_mmio_read_counter() function.

The SNB IMC counters are freerunnig 32-bit counters laid out contiguously in
MMIO. But uncore_mmio_read_counter() is using a readq() call to read from
MMIO therefore reading 64-bit from MMIO. Although this is okay for the
uncore_perf_event_update() function because it is shifting the value based
on the actual counter width to compute a delta, it is not okay for the
uncore_pmu_event_start() which is simply reading the counter  and therefore
priming the event->prev_count with a bogus value which is responsible for
causing bogus deltas in the perf stat command above.

The fix is to reintroduce the custom callback for read_counter for the SNB
IMC PMU and use readl() instead of readq(). With the change the output of
perf stat is back to normal:
$ perf stat -a -I 1000 -e uncore_imc/data_reads/,uncore_imc/data_writes/
     1.000120987             296.94 MiB  uncore_imc/data_reads/
     1.000120987             138.42 MiB  uncore_imc/data_writes/
     2.000403144             175.91 MiB  uncore_imc/data_reads/
     2.000403144              68.50 MiB  uncore_imc/data_writes/

Fixes: 07ce734d ("perf/x86/intel/uncore: Clean up client IMC")
Signed-off-by: NStephane Eranian <eranian@google.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NKan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20220803160031.1379788-1-eranian@google.com

11745ecf

wait_on_bit: add an acquire memory barrier · 8238b457

由 Mikulas Patocka 提交于 8月 26, 2022

There are several places in the kernel where wait_on_bit is not followed
by a memory barrier (for example, in drivers/md/dm-bufio.c:new_read).

On architectures with weak memory ordering, it may happen that memory
accesses that follow wait_on_bit are reordered before wait_on_bit and
they may return invalid data.

Fix this class of bugs by introducing a new function "test_bit_acquire"
that works like test_bit, but has acquire memory ordering semantics.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Acked-by: NWill Deacon <will@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8238b457

25 8月, 2022 2 次提交

x86/sev: Mark snp_abort() noreturn · c93c296f

由 Borislav Petkov 提交于 8月 24, 2022

Mark both the function prototype and definition as noreturn in order to
prevent the compiler from doing transformations which confuse objtool
like so:

  vmlinux.o: warning: objtool: sme_enable+0x71: unreachable instruction

This triggers with gcc-12.

Add it and sev_es_terminate() to the objtool noreturn tracking array
too. Sort it while at it.
Suggested-by: NMichael Matz <matz@suse.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220824152420.20547-1-bp@alien8.de

c93c296f

xen: x86: remove setting the obsolete config XEN_MAX_DOMAIN_MEMORY · ab0af755

由 Lukas Bulwahn 提交于 8月 17, 2022

Commit c70727a5 ("xen: allow more than 512 GB of RAM for 64 bit
pv-domains") from July 2015 replaces the config XEN_MAX_DOMAIN_MEMORY with
a new config XEN_512GB, but misses to adjust arch/x86/configs/xen.config.
As XEN_512GB defaults to yes, there is no need to explicitly set any config
in xen.config.

Just remove setting the obsolete config XEN_MAX_DOMAIN_MEMORY.
Signed-off-by: NLukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: NJuergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20220817044333.22310-1-lukas.bulwahn@gmail.comSigned-off-by: NJuergen Gross <jgross@suse.com>

ab0af755

24 8月, 2022 3 次提交

x86/sev: Don't use cc_platform_has() for early SEV-SNP calls · cdaa0a40

由 Tom Lendacky 提交于 8月 23, 2022

When running identity-mapped and depending on the kernel configuration,
it is possible that the compiler uses jump tables when generating code
for cc_platform_has().

This causes a boot failure because the jump table uses un-mapped kernel
virtual addresses, not identity-mapped addresses. This has been seen
with CONFIG_RETPOLINE=n.

Similar to sme_encrypt_kernel(), use an open-coded direct check for the
status of SNP rather than trying to eliminate the jump table. This
preserves any code optimization in cc_platform_has() that can be useful
post boot. It also limits the changes to SEV-specific files so that
future compiler features won't necessarily require possible build changes
just because they are not compatible with running identity-mapped.

  [ bp: Massage commit message. ]

Fixes: 5e5ccff6 ("x86/sev: Add helper for validating pages in early enc attribute changes")
Reported-by: NSean Christopherson <seanjc@google.com>
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 5.19.x
Link: https://lore.kernel.org/all/YqfabnTRxFSM+LoX@google.com/

cdaa0a40

x86/boot: Don't propagate uninitialized boot_params->cc_blob_address · 4b1c7424

由 Michael Roth 提交于 8月 23, 2022

In some cases, bootloaders will leave boot_params->cc_blob_address
uninitialized rather than zeroing it out. This field is only meant to be
set by the boot/compressed kernel in order to pass information to the
uncompressed kernel when SEV-SNP support is enabled.

Therefore, there are no cases where the bootloader-provided values
should be treated as anything other than garbage. Otherwise, the
uncompressed kernel may attempt to access this bogus address, leading to
a crash during early boot.

Normally, sanitize_boot_params() would be used to clear out such fields
but that happens too late: sev_enable() may have already initialized
it to a valid value that should not be zeroed out. Instead, have
sev_enable() zero it out unconditionally beforehand.

Also ensure this happens for !CONFIG_AMD_MEM_ENCRYPT as well by also
including this handling in the sev_enable() stub function.

  [ bp: Massage commit message and comments. ]

Fixes: b190a043 ("x86/sev: Add SEV-SNP feature detection/setup")
Reported-by: NJeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Reported-by: watnuss@gmx.de
Signed-off-by: NMichael Roth <michael.roth@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216387
Link: https://lore.kernel.org/r/20220823160734.89036-1-michael.roth@amd.com

4b1c7424

x86/cpu: Add new Raptor Lake CPU model number · ea902bcc

由 Tony Luck 提交于 8月 23, 2022

Note1: Model 0xB7 already claimed the "no suffix" #define for a regular
client part, so add (yet another) suffix "S" to distinguish this new
part from the earlier one.

Note2: the RAPTORLAKE* and ALDERLAKE* processors are very similar from a
software enabling point of view.  There are no known features that have
model-specific enabling and also differ between the two.  In other words,
every single place that list *one* or more RAPTORLAKE* or ALDERLAKE*
processors should list all of them.

Note3: This is being merged before there is an in-tree user.  Merging
this provides an "anchor" so that the different folks can update their
subsystems (like perf) in parallel to use this define and test it.

[ dhansen: add a note about why this has no in-tree users yet ]
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220823174819.223941-1-tony.luck@intel.com

ea902bcc

22 8月, 2022 1 次提交

asm goto: eradicate CC_HAS_ASM_GOTO · a0a12c3e

由 Nick Desaulniers 提交于 8月 19, 2022

GCC has supported asm goto since 4.5, and Clang has since version 9.0.0.
The minimum supported versions of these tools for the build according to
Documentation/process/changes.rst are 5.1 and 11.0.0 respectively.

Remove the feature detection script, Kconfig option, and clean up some
fallback code that is no longer supported.

The removed script was also testing for a GCC specific bug that was
fixed in the 4.7 release.

Also remove workarounds for bpftrace using clang older than 9.0.0, since
other BPF backend fixes are required at this point.

Link: https://lore.kernel.org/lkml/CAK7LNATSr=BXKfkdW8f-H5VT_w=xBpT2ZQcZ7rm6JfkdE+QnmA@mail.gmail.com/
Link: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48637Acked-by: NBorislav Petkov <bp@suse.de>
Suggested-by: NMasahiro Yamada <masahiroy@kernel.org>
Suggested-by: NAlexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: NNick Desaulniers <ndesaulniers@google.com>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Reviewed-by: NNathan Chancellor <nathan@kernel.org>
Reviewed-by: NAlexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a0a12c3e

21 8月, 2022 1 次提交

x86/unwind/orc: Unwind ftrace trampolines with correct ORC entry · fc2e426b

由 Chen Zhongjin 提交于 8月 19, 2022

When meeting ftrace trampolines in ORC unwinding, unwinder uses address
of ftrace_{regs_}call address to find the ORC entry, which gets next frame at
sp+176.

If there is an IRQ hitting at sub $0xa8,%rsp, the next frame should be
sp+8 instead of 176. It makes unwinder skip correct frame and throw
warnings such as "wrong direction" or "can't access registers", etc,
depending on the content of the incorrect frame address.

By adding the base address ftrace_{regs_}caller with the offset
*ip - ops->trampoline*, we can get the correct address to find the ORC entry.

Also change "caller" to "tramp_addr" to make variable name conform to
its content.

[ mingo: Clarified the changelog a bit. ]

Fixes: 6be7fa3c ("ftrace, orc, x86: Handle ftrace dynamically allocated trampolines")
Signed-off-by: NChen Zhongjin <chenzhongjin@huawei.com>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Reviewed-by: NSteven Rostedt (Google) <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220819084334.244016-1-chenzhongjin@huawei.com

fc2e426b

20 8月, 2022 4 次提交

perf/x86/intel: Fix pebs event constraints for ADL · cde643ff

由 Kan Liang 提交于 8月 18, 2022

According to the latest event list, the LOAD_LATENCY PEBS event only
works on the GP counter 0 and 1 for ADL and RPL.

Update the pebs event constraints table.

Fixes: f83d2f91 ("perf/x86/intel: Add Alder Lake Hybrid support")
Reported-by: NAmmy Yi <ammy.yi@intel.com>
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220818184429.2355857-1-kan.liang@linux.intel.com

cde643ff

perf/x86/intel/ds: Fix precise store latency handling · d4bdb0be

由 Stephane Eranian 提交于 8月 17, 2022

With the existing code in store_latency_data(), the memory operation (mem_op)
returned to the user is always OP_LOAD where in fact, it should be OP_STORE.
This comes from the fact that the function is simply grabbing the information
from a data source map which covers only load accesses. Intel 12th gen CPU
offers precise store sampling that captures both the data source and latency.
Therefore it can use the data source mapping table but must override the
memory operation to reflect stores instead of loads.

Fixes: 61b985e3 ("perf/x86/intel: Add perf core PMU support for Sapphire Rapids")
Signed-off-by: NStephane Eranian <eranian@google.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220818054613.1548130-1-eranian@google.com

d4bdb0be

perf/x86/core: Set pebs_capable and PMU_FL_PEBS_ALL for the Baseline · 7d359886

由 Peter Zijlstra 提交于 8月 16, 2022

The SDM explicitly states that PEBS Baseline implies Extended PEBS.
For cpu model forward compatibility (e.g. on ICX, SPR, ADL), it's
safe to stop doing FMS table thing such as setting pebs_capable and
PMU_FL_PEBS_ALL since it's already set in the intel_ds_init().

The Goldmont Plus is the only platform which supports extended PEBS
but doesn't have Baseline. Keep the status quo.
Reported-by: NLike Xu <likexu@tencent.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NKan Liang <kan.liang@linux.intel.com>
Link: https://lkml.kernel.org/r/20220816114057.51307-1-likexu@tencent.com

7d359886

perf/x86/lbr: Enable the branch type for the Arch LBR by default · 32ba156d

由 Kan Liang 提交于 8月 16, 2022

On the platform with Arch LBR, the HW raw branch type encoding may leak
to the perf tool when the SAVE_TYPE option is not set.

In the intel_pmu_store_lbr(), the HW raw branch type is stored in
lbr_entries[].type. If the SAVE_TYPE option is set, the
lbr_entries[].type will be converted into the generic PERF_BR_* type
in the intel_pmu_lbr_filter() and exposed to the user tools.
But if the SAVE_TYPE option is NOT set by the user, the current perf
kernel doesn't clear the field. The HW raw branch type leaks.

There are two solutions to fix the issue for the Arch LBR.
One is to clear the field if the SAVE_TYPE option is NOT set.
The other solution is to unconditionally convert the branch type and
expose the generic type to the user tools.

The latter is implemented here, because
- The branch type is valuable information. I don't see a case where
  you would not benefit from the branch type. (Stephane Eranian)
- Not having the branch type DOES NOT save any space in the
  branch record (Stephane Eranian)
- The Arch LBR HW can retrieve the common branch types from the
  LBR_INFO. It doesn't require the high overhead SW disassemble.

Fixes: 47125db2 ("perf/x86/intel/lbr: Support Architectural LBR")
Reported-by: NStephane Eranian <eranian@google.com>
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220816125612.2042397-1-kan.liang@linux.intel.com

32ba156d

19 8月, 2022 8 次提交

x86/mm: Use proper mask when setting PUD mapping · 88e0a749

由 Aaron Lu 提交于 8月 19, 2022

Commit c164fbb4("x86/mm: thread pgprot_t through
init_memory_mapping()") mistakenly used __pgprot() which doesn't respect
__default_kernel_pte_mask when setting PUD mapping.

Fix it by only setting the one bit we actually need (PSE) and leaving
the other bits (that have been properly masked) alone.

Fixes: c164fbb4 ("x86/mm: thread pgprot_t through init_memory_mapping()")
Signed-off-by: NAaron Lu <aaron.lu@intel.com>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

88e0a749

x86/nospec: Fix i386 RSB stuffing · 33292497

由 Peter Zijlstra 提交于 8月 19, 2022

Turns out that i386 doesn't unconditionally have LFENCE, as such the
loop in __FILL_RETURN_BUFFER isn't actually speculation safe on such
chips.

Fixes: ba6e31af ("x86/speculation: Add LFENCE to RSB fill sequence")
Reported-by: NBen Hutchings <ben@decadent.org.uk>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Yv9tj9vbQ9nNlXoY@worktop.programming.kicks-ass.net

33292497

x86/nospec: Unwreck the RSB stuffing · 4e3aa923

由 Peter Zijlstra 提交于 8月 16, 2022

Commit 2b129932 ("x86/speculation: Add RSB VM Exit protections")
made a right mess of the RSB stuffing, rewrite the whole thing to not
suck.

Thanks to Andrew for the enlightening comment about Post-Barrier RSB
things so we can make this code less magical.

Cc: stable@vger.kernel.org
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YvuNdDWoUZSBjYcm@worktop.programming.kicks-ass.net

4e3aa923

x86/kvm: Fix "missing ENDBR" BUG for fastop functions · 3d9606b0

由 Josh Poimboeuf 提交于 8月 18, 2022

The following BUG was reported:

  traps: Missing ENDBR: andw_ax_dx+0x0/0x10 [kvm]
  ------------[ cut here ]------------
  kernel BUG at arch/x86/kernel/traps.c:253!
  invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
   <TASK>
   asm_exc_control_protection+0x2b/0x30
  RIP: 0010:andw_ax_dx+0x0/0x10 [kvm]
  Code: c3 cc cc cc cc 0f 1f 44 00 00 66 0f 1f 00 48 19 d0 c3 cc cc cc
        cc 0f 1f 40 00 f3 0f 1e fa 20 d0 c3 cc cc cc cc 0f 1f 44 00 00
        <66> 0f 1f 00 66 21 d0 c3 cc cc cc cc 0f 1f 40 00 66 0f 1f 00 21
        d0

   ? andb_al_dl+0x10/0x10 [kvm]
   ? fastop+0x5d/0xa0 [kvm]
   x86_emulate_insn+0x822/0x1060 [kvm]
   x86_emulate_instruction+0x46f/0x750 [kvm]
   complete_emulated_mmio+0x216/0x2c0 [kvm]
   kvm_arch_vcpu_ioctl_run+0x604/0x650 [kvm]
   kvm_vcpu_ioctl+0x2f4/0x6b0 [kvm]
   ? wake_up_q+0xa0/0xa0

The BUG occurred because the ENDBR in the andw_ax_dx() fastop function
had been incorrectly "sealed" (converted to a NOP) by apply_ibt_endbr().

Objtool marked it to be sealed because KVM has no compile-time
references to the function.  Instead KVM calculates its address at
runtime.

Prevent objtool from annotating fastop functions as sealable by creating
throwaway dummy compile-time references to the functions.

Fixes: 6649fa87 ("x86/ibt,kvm: Add ENDBR to fastops")
Reported-by: NPengfei Xu <pengfei.xu@intel.com>
Debugged-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NJosh Poimboeuf <jpoimboe@kernel.org>
Message-Id: <0d4116f90e9d0c1b754bb90c585e6f0415a1c508.1660837839.git.jpoimboe@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3d9606b0

x86/kvm: Simplify FOP_SETCC() · 22472d12

由 Josh Poimboeuf 提交于 8月 18, 2022

SETCC_ALIGN and FOP_ALIGN are both 16. Remove the special casing for
FOP_SETCC() and just make it a normal fastop.
Signed-off-by: NJosh Poimboeuf <jpoimboe@kernel.org>
Message-Id: <7c13d94d1a775156f7e36eed30509b274a229140.1660837839.git.jpoimboe@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

22472d12

x86/ibt, objtool: Add IBT_NOSEAL() · e27e5bea

由 Josh Poimboeuf 提交于 8月 18, 2022

Add a macro which prevents a function from getting sealed if there are
no compile-time references to it.
Signed-off-by: NJosh Poimboeuf <jpoimboe@kernel.org>
Message-Id: <20220818213927.e44fmxkoq4yj6ybn@treble>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e27e5bea

KVM: Rename mmu_notifier_* to mmu_invalidate_* · 20ec3ebd

由 Chao Peng 提交于 8月 16, 2022

The motivation of this renaming is to make these variables and related
helper functions less mmu_notifier bound and can also be used for non
mmu_notifier based page invalidation. mmu_invalidate_* was chosen to
better describe the purpose of 'invalidating' a page that those
variables are used for.

  - mmu_notifier_seq/range_start/range_end are renamed to
    mmu_invalidate_seq/range_start/range_end.

  - mmu_notifier_retry{_hva} helper functions are renamed to
    mmu_invalidate_retry{_hva}.

  - mmu_notifier_count is renamed to mmu_invalidate_in_progress to
    avoid confusion with mn_active_invalidate_count.

  - While here, also update kvm_inc/dec_notifier_count() to
    kvm_mmu_invalidate_begin/end() to match the change for
    mmu_notifier_count.

No functional change intended.
Signed-off-by: NChao Peng <chao.p.peng@linux.intel.com>
Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

20ec3ebd

KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS · bdd1c37a

由 Chao Peng 提交于 8月 16, 2022

KVM_INTERNAL_MEM_SLOTS better reflects the fact those slots are KVM
internally used (invisible to userspace) and avoids confusion to future
private slots that can have different meaning.
Signed-off-by: NChao Peng <chao.p.peng@linux.intel.com>
Message-Id: <20220816125322.1110439-2-chao.p.peng@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bdd1c37a

18 8月, 2022 1 次提交

x86/bugs: Add "unknown" reporting for MMIO Stale Data · 7df54884

由 Pawan Gupta 提交于 8月 03, 2022

Older Intel CPUs that are not in the affected processor list for MMIO
Stale Data vulnerabilities currently report "Not affected" in sysfs,
which may not be correct. Vulnerability status for these older CPUs is
unknown.

Add known-not-affected CPUs to the whitelist. Report "unknown"
mitigation status for CPUs that are not in blacklist, whitelist and also
don't enumerate MSR ARCH_CAPABILITIES bits that reflect hardware
immunity to MMIO Stale Data vulnerabilities.

Mitigation is not deployed when the status is unknown.

  [ bp: Massage, fixup. ]

Fixes: 8d50cdf8 ("x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data")
Suggested-by: NAndrew Cooper <andrew.cooper3@citrix.com>
Suggested-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/a932c154772f2121794a5f2eded1a11013114711.1657846269.git.pawan.kumar.gupta@linux.intel.com

7df54884

17 8月, 2022 1 次提交

x86: simplify load_unaligned_zeropad() implementation · c4e34dd9

由 Linus Torvalds 提交于 8月 14, 2022

The exception for the "unaligned access at the end of the page, next
page not mapped" never happens, but the fixup code ends up causing
trouble for compilers to optimize well.

clang in particular ends up seeing it being in the middle of a loop, and
tries desperately to optimize the exception fixup code that is never
really reached.

The simple solution is to just move all the fixups into the exception
handler itself, which moves it all out of the hot case code, and means
that the compiler never sees it or needs to worry about it.
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c4e34dd9

16 8月, 2022 1 次提交

x86/entry: Fix entry_INT80_compat for Xen PV guests · 5b9f0c4d

由 Juergen Gross 提交于 8月 16, 2022

Commit

  c89191ce ("x86/entry: Convert SWAPGS to swapgs and remove the definition of SWAPGS")

missed one use case of SWAPGS in entry_INT80_compat(). Removing of
the SWAPGS macro led to asm just using "swapgs", as it is accepting
instructions in capital letters, too.

This in turn leads to splats in Xen PV guests like:

  [   36.145223] general protection fault, maybe for address 0x2d: 0000 [#1] PREEMPT SMP NOPTI
  [   36.145794] CPU: 2 PID: 1847 Comm: ld-linux.so.2 Not tainted 5.19.1-1-default #1 \
	  openSUSE Tumbleweed f3b44bfb672cdb9f235aff53b57724eba8b9411b
  [   36.146608] Hardware name: HP ProLiant ML350p Gen8, BIOS P72 11/14/2013
  [   36.148126] RIP: e030:entry_INT80_compat+0x3/0xa3

Fix that by open coding this single instance of the SWAPGS macro.

Fixes: c89191ce ("x86/entry: Convert SWAPGS to swapgs and remove the definition of SWAPGS")
Signed-off-by: NJuergen Gross <jgross@suse.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NJan Beulich <jbeulich@suse.com>
Cc: <stable@vger.kernel.org> # 5.19
Link: https://lore.kernel.org/r/20220816071137.4893-1-jgross@suse.com

5b9f0c4d

15 8月, 2022 1 次提交

x86/PAT: Have pat_enabled() properly reflect state when running on Xen · 72cbc8f0

由 Jan Beulich 提交于 4月 28, 2022

After commit ID in the Fixes: tag, pat_enabled() returns false (because
of PAT initialization being suppressed in the absence of MTRRs being
announced to be available).

This has become a problem: the i915 driver now fails to initialize when
running PV on Xen (i915_gem_object_pin_map() is where I located the
induced failure), and its error handling is flaky enough to (at least
sometimes) result in a hung system.

Yet even beyond that problem the keying of the use of WC mappings to
pat_enabled() (see arch_can_pci_mmap_wc()) means that in particular
graphics frame buffer accesses would have been quite a bit less optimal
than possible.

Arrange for the function to return true in such environments, without
undermining the rest of PAT MSR management logic considering PAT to be
disabled: specifically, no writes to the PAT MSR should occur.

For the new boolean to live in .init.data, init_cache_modes() also needs
moving to .init.text (where it could/should have lived already before).

  [ bp: This is the "small fix" variant for stable. It'll get replaced
    with a proper PAT and MTRR detection split upstream but that is too
    involved for a stable backport.
    - additional touchups to commit msg. Use cpu_feature_enabled(). ]

Fixes: bdd8b6c9 ("drm/i915: replace X86_FEATURE_PAT with pat_enabled()")
Signed-off-by: NJan Beulich <jbeulich@suse.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Acked-by: NIngo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/9385fa60-fa5d-f559-a137-6608408f88b0@suse.com

72cbc8f0

14 8月, 2022 1 次提交

x86/kprobes: Fix JNG/JNLE emulation · 8924779d

由 Nadav Amit 提交于 8月 13, 2022

When kprobes emulates JNG/JNLE instructions on x86 it uses the wrong
condition. For JNG (opcode: 0F 8E), according to Intel SDM, the jump is
performed if (ZF == 1 or SF != OF). However the kernel emulation
currently uses 'and' instead of 'or'.

As a result, setting a kprobe on JNG/JNLE might cause the kernel to
behave incorrectly whenever the kprobe is hit.

Fix by changing the 'and' to 'or'.

Fixes: 6256e668 ("x86/kprobes: Use int3 instead of debug trap for single-step")
Signed-off-by: NNadav Amit <namit@vmware.com>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220813225943.143767-1-namit@vmware.com

8924779d

12 8月, 2022 1 次提交

x86/xen: Add support for HVMOP_set_evtchn_upcall_vector · b1c3497e

由 Jane Malalane 提交于 7月 29, 2022

Implement support for the HVMOP_set_evtchn_upcall_vector hypercall in
order to set the per-vCPU event channel vector callback on Linux and
use it in preference of HVM_PARAM_CALLBACK_IRQ.

If the per-VCPU vector setup is successful on BSP, use this method
for the APs. If not, fallback to the global vector-type callback.

Also register callback_irq at per-vCPU event channel setup to trick
toolstack to think the domain is enlightened.
Suggested-by: N"Roger Pau Monné" <roger.pau@citrix.com>
Signed-off-by: NJane Malalane <jane.malalane@citrix.com>
Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20220729070416.23306-1-jane.malalane@citrix.comSigned-off-by: NJuergen Gross <jgross@suse.com>

b1c3497e

11 8月, 2022 13 次提交

x86: link vdso and boot with -z noexecstack --no-warn-rwx-segments · ffcf9c57

由 Nick Desaulniers 提交于 8月 10, 2022

Users of GNU ld (BFD) from binutils 2.39+ will observe multiple
instances of a new warning when linking kernels in the form:

  ld: warning: arch/x86/boot/pmjump.o: missing .note.GNU-stack section implies executable stack
  ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
  ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions

Generally, we would like to avoid the stack being executable.  Because
there could be a need for the stack to be executable, assembler sources
have to opt-in to this security feature via explicit creation of the
.note.GNU-stack feature (which compilers create by default) or command
line flag --noexecstack.  Or we can simply tell the linker the
production of such sections is irrelevant and to link the stack as
--noexecstack.

LLVM's LLD linker defaults to -z noexecstack, so this flag isn't
strictly necessary when linking with LLD, only BFD, but it doesn't hurt
to be explicit here for all linkers IMO.  --no-warn-rwx-segments is
currently BFD specific and only available in the current latest release,
so it's wrapped in an ld-option check.

While the kernel makes extensive usage of ELF sections, it doesn't use
permissions from ELF segments.

Link: https://lore.kernel.org/linux-block/3af4127a-f453-4cf7-f133-a181cce06f73@kernel.dk/
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=ba951afb99912da01a6e8434126b8fac7aa75107
Link: https://github.com/llvm/llvm-project/issues/57009Reported-and-tested-by: NJens Axboe <axboe@kernel.dk>
Suggested-by: NFangrui Song <maskray@google.com>
Signed-off-by: NNick Desaulniers <ndesaulniers@google.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ffcf9c57

KVM: VMX: Adjust number of LBR records for PERF_CAPABILITIES at refresh · 6348aafa

由 Sean Christopherson 提交于 7月 27, 2022

Now that the PMU is refreshed when MSR_IA32_PERF_CAPABILITIES is written
by host userspace, zero out the number of LBR records for a vCPU during
PMU refresh if PMU_CAP_LBR_FMT is not set in PERF_CAPABILITIES instead of
handling the check at run-time.

guest_cpuid_has() is expensive due to the linear search of guest CPUID
entries, intel_pmu_lbr_is_enabled() is checked on every VM-Enter, _and_
simply enumerating the same "Model" as the host causes KVM to set the
number of LBR records to a non-zero value.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220727233424.2968356-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6348aafa

KVM: VMX: Use proper type-safe functions for vCPU => LBRs helpers · 7de8e5b6

由 Sean Christopherson 提交于 7月 27, 2022

Turn vcpu_to_lbr_desc() and vcpu_to_lbr_records() into functions in order
to provide type safety, to document exactly what they return, and to
allow consuming the helpers in vmx.h.  Move the definitions as necessary
(the macros "reference" to_vmx() before its definition).

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220727233424.2968356-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7de8e5b6

KVM: x86: Refresh PMU after writes to MSR_IA32_PERF_CAPABILITIES · 17a024a8

由 Sean Christopherson 提交于 7月 27, 2022

Refresh the PMU if userspace modifies MSR_IA32_PERF_CAPABILITIES. KVM
consumes the vCPU's PERF_CAPABILITIES when enumerating PEBS support, but
relies on CPUID updates to refresh the PMU. I.e. KVM will do the wrong
thing if userspace stuffs PERF_CAPABILITIES _after_ setting guest CPUID.

Opportunistically fix a curly-brace indentation.

Fixes: c59a1f10 ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
Cc: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220727233424.2968356-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

17a024a8

KVM: x86/mmu: Add sanity check that MMIO SPTE mask doesn't overlap gen · 8bad4606

由 Sean Christopherson 提交于 8月 05, 2022

Add compile-time and init-time sanity checks to ensure that the MMIO SPTE
mask doesn't overlap the MMIO SPTE generation or the MMU-present bit.
The generation currently avoids using bit 63, but that's as much
coincidence as it is strictly necessarly.  That will change in the future,
as TDX support will require setting bit 63 (SUPPRESS_VE) in the mask.

Explicitly carve out the bits that are allowed in the mask so that any
future shuffling of SPTE bits doesn't silently break MMIO caching (KVM
has broken MMIO caching more than once due to overlapping the generation
with other things).
Suggested-by: NKai Huang <kai.huang@intel.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NKai Huang <kai.huang@intel.com>
Message-Id: <20220805194133.86299-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8bad4606

KVM: x86/mmu: rename trace function name for asynchronous page fault · 1685c0f3

由 Mingwei Zhang 提交于 8月 07, 2022

Rename the tracepoint function from trace_kvm_async_pf_doublefault() to
trace_kvm_async_pf_repeated_fault() to make it clear, since double fault
has nothing to do with this trace function.

Asynchronous Page Fault (APF) is an artifact generated by KVM when it
cannot find a physical page to satisfy an EPT violation. KVM uses APF to
tell the guest OS to do something else such as scheduling other guest
processes to make forward progress. However, when another guest process
also touches a previously APFed page, KVM halts the vCPU instead of
generating a repeated APF to avoid wasting cycles.

Double fault (#DF) clearly has a different meaning and a different
consequence when triggered. #DF requires two nested contributory exceptions
instead of two page faults faulting at the same address. A prevous bug on
APF indicates that it may trigger a double fault in the guest [1] and
clearly this trace function has nothing to do with it. So rename this
function should be a valid choice.

No functional change intended.

[1] https://www.spinics.net/lists/kvm/msg214957.htmlSigned-off-by: NMingwei Zhang <mizhang@google.com>
Message-Id: <20220807052141.69186-1-mizhang@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1685c0f3

KVM: x86/xen: Stop Xen timer before changing IRQ · c0368991

由 Coleman Dietsch 提交于 8月 08, 2022

Stop Xen timer (if it's running) prior to changing the IRQ vector and
potentially (re)starting the timer. Changing the IRQ vector while the
timer is still running can result in KVM injecting a garbage event, e.g.
vm_xen_inject_timer_irqs() could see a non-zero xen.timer_pending from
a previous timer but inject the new xen.timer_virq.

Fixes: 53639526 ("KVM: x86/xen: handle PV timers oneshot mode")
Cc: stable@vger.kernel.org
Link: https://syzkaller.appspot.com/bug?id=8234a9dfd3aafbf092cc5a7cd9842e3ebc45fc42
Reported-by: syzbot+e54f930ed78eb0f85281@syzkaller.appspotmail.com
Signed-off-by: NColeman Dietsch <dietschc@csp.edu>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Acked-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20220808190607.323899-3-dietschc@csp.edu>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c0368991

KVM: x86/xen: Initialize Xen timer only once · af735db3

由 Coleman Dietsch 提交于 8月 08, 2022

Add a check for existing xen timers before initializing a new one.

Currently kvm_xen_init_timer() is called on every
KVM_XEN_VCPU_ATTR_TYPE_TIMER, which is causing the following ODEBUG
crash when vcpu->arch.xen.timer is already set.

ODEBUG: init active (active state 0)
object type: hrtimer hint: xen_timer_callbac0
RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:502
Call Trace:
__debug_object_init
debug_hrtimer_init
debug_init
hrtimer_init
kvm_xen_init_timer
kvm_xen_vcpu_set_attr
kvm_arch_vcpu_ioctl
kvm_vcpu_ioctl
vfs_ioctl

Fixes: 53639526 ("KVM: x86/xen: handle PV timers oneshot mode")
Cc: stable@vger.kernel.org
Link: https://syzkaller.appspot.com/bug?id=8234a9dfd3aafbf092cc5a7cd9842e3ebc45fc42
Reported-by: syzbot+e54f930ed78eb0f85281@syzkaller.appspotmail.com
Signed-off-by: NColeman Dietsch <dietschc@csp.edu>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220808190607.323899-2-dietschc@csp.edu>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

af735db3

KVM: SVM: Disable SEV-ES support if MMIO caching is disable · 0c29397a

由 Sean Christopherson 提交于 8月 03, 2022

Disable SEV-ES if MMIO caching is disabled as SEV-ES relies on MMIO SPTEs
generating #NPF(RSVD), which are reflected by the CPU into the guest as
a #VC.  With SEV-ES, the untrusted host, a.k.a. KVM, doesn't have access
to the guest instruction stream or register state and so can't directly
emulate in response to a #NPF on an emulated MMIO GPA.  Disabling MMIO
caching means guest accesses to emulated MMIO ranges cause #NPF(!PRESENT),
and those flavors of #NPF cause automatic VM-Exits, not #VC.

Adjust KVM's MMIO masks to account for the C-bit location prior to doing
SEV(-ES) setup, and document that dependency between adjusting the MMIO
SPTE mask and SEV(-ES) setup.

Fixes: b09763da ("KVM: x86/mmu: Add module param to disable MMIO caching (for testing)")
Reported-by: NMichael Roth <michael.roth@amd.com>
Tested-by: NMichael Roth <michael.roth@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220803224957.1285926-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0c29397a

KVM: x86/mmu: Fully re-evaluate MMIO caching when SPTE masks change · c3e0c8c2

由 Sean Christopherson 提交于 8月 03, 2022

Fully re-evaluate whether or not MMIO caching can be enabled when SPTE
masks change; simply clearing enable_mmio_caching when a configuration
isn't compatible with caching fails to handle the scenario where the
masks are updated, e.g. by VMX for EPT or by SVM to account for the C-bit
location, and toggle compatibility from false=>true.

Snapshot the original module param so that re-evaluating MMIO caching
preserves userspace's desire to allow caching.  Use a snapshot approach
so that enable_mmio_caching still reflects KVM's actual behavior.

Fixes: 8b9e74bf ("KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabled")
Reported-by: NMichael Roth <michael.roth@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Tested-by: NMichael Roth <michael.roth@amd.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NKai Huang <kai.huang@intel.com>
Message-Id: <20220803224957.1285926-3-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c3e0c8c2

KVM: x86: Tag kvm_mmu_x86_module_init() with __init · 982bae43

由 Sean Christopherson 提交于 8月 03, 2022

Mark kvm_mmu_x86_module_init() with __init, the entire reason it exists
is to initialize variables when kvm.ko is loaded, i.e. it must never be
called after module initialization.

Fixes: 1d0e8480 ("KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded")
Cc: stable@vger.kernel.org
Reviewed-by: NKai Huang <kai.huang@intel.com>
Tested-by: NMichael Roth <michael.roth@amd.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220803224957.1285926-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

982bae43

KVM: x86: emulator: Fix illegal LEA handling · 4ac5b423

由 Michal Luczaj 提交于 7月 29, 2022

The emulator mishandles LEA with register source operand. Even though such
LEA is illegal, it can be encoded and fed to CPU. In which case real
hardware throws #UD. The emulator, instead, returns address of
x86_emulate_ctxt._regs. This info leak hurts host's kASLR.

Tell the decoder that illegal LEA is not to be emulated.
Signed-off-by: NMichal Luczaj <mhal@rbox.co>
Message-Id: <20220729134801.1120-1-mhal@rbox.co>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4ac5b423

KVM: X86: avoid uninitialized 'fault.async_page_fault' from fixed-up #PF · 2bc685e6

由 Yu Zhang 提交于 7月 18, 2022

kvm_fixup_and_inject_pf_error() was introduced to fixup the error code(
e.g., to add RSVD flag) and inject the #PF to the guest, when guest
MAXPHYADDR is smaller than the host one.

When it comes to nested, L0 is expected to intercept and fix up the #PF
and then inject to L2 directly if
- L2.MAXPHYADDR < L0.MAXPHYADDR and
- L1 has no intention to intercept L2's #PF (e.g., L2 and L1 have the
  same MAXPHYADDR value && L1 is using EPT for L2),
instead of constructing a #PF VM Exit to L1. Currently, with PFEC_MASK
and PFEC_MATCH both set to 0 in vmcs02, the interception and injection
may happen on all L2 #PFs.

However, failing to initialize 'fault' in kvm_fixup_and_inject_pf_error()
may cause the fault.async_page_fault being NOT zeroed, and later the #PF
being treated as a nested async page fault, and then being injected to L1.
Instead of zeroing 'fault' at the beginning of this function, we mannually
set the value of 'fault.async_page_fault', because false is the value we
really expect.

Fixes: 89786147 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216178Reported-by: NYang Lixiao <lixiao.yang@intel.com>
Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220718074756.53788-1-yu.c.zhang@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2bc685e6

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功