提交 · 55ede4f19ea7bc0be4f3040fbc752c98e77158ec · openeuler / Kernel

01 9月, 2022 1 次提交

KVM: x86: Properly handle APF vs disabled LAPIC situation · 55ede4f1

由 Vitaly Kuznetsov 提交于 8月 30, 2022

stable inclusion
from stable-v5.10.119
commit 74c6e5d584354c6126f1231667f9d8e85d7f536f
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5L6BB

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=74c6e5d584354c6126f1231667f9d8e85d7f536f

--------------------------------

commit 2f15d027 upstream.

Async PF 'page ready' event may happen when LAPIC is (temporary) disabled.
In particular, Sebastien reports that when Linux kernel is directly booted
by Cloud Hypervisor, LAPIC is 'software disabled' when APF mechanism is
initialized. On initialization KVM tries to inject 'wakeup all' event and
puts the corresponding token to the slot. It is, however, failing to inject
an interrupt (kvm_apic_set_irq() -> __apic_accept_irq() -> !apic_enabled())
so the guest never gets notified and the whole APF mechanism gets stuck.
The same issue is likely to happen if the guest temporary disables LAPIC
and a previously unavailable page becomes available.

Do two things to resolve the issue:
- Avoid dequeuing 'page ready' events from APF queue when LAPIC is
  disabled.
- Trigger an attempt to deliver pending 'page ready' events when LAPIC
  becomes enabled (SPIV or MSR_IA32_APICBASE).
Reported-by: NSebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210422092948.568327-1-vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
[Guoqing: backport to 5.10-stable ]
Signed-off-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

55ede4f1

17 8月, 2022 3 次提交

KVM: x86/mmu: Update number of zapped pages even if page list is stable · f7f1301d

由 Sean Christopherson 提交于 8月 17, 2022

stable inclusion
from stable-v5.10.118
commit b49bc8d615eea1043cb238318af04f8876e846b3
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5L686

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b49bc8d615eea1043cb238318af04f8876e846b3

--------------------------------

commit b28cb0cd upstream.

When zapping obsolete pages, update the running count of zapped pages
regardless of whether or not the list has become unstable due to zapping
a shadow page with its own child shadow pages.  If the VM is backed by
mostly 4kb pages, KVM can zap an absurd number of SPTEs without bumping
the batch count and thus without yielding.  In the worst case scenario,
this can cause a soft lokcup.

 watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [dirty_log_perf_:13020]
   RIP: 0010:workingset_activation+0x19/0x130
   mark_page_accessed+0x266/0x2e0
   kvm_set_pfn_accessed+0x31/0x40
   mmu_spte_clear_track_bits+0x136/0x1c0
   drop_spte+0x1a/0xc0
   mmu_page_zap_pte+0xef/0x120
   __kvm_mmu_prepare_zap_page+0x205/0x5e0
   kvm_mmu_zap_all_fast+0xd7/0x190
   kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
   kvm_page_track_flush_slot+0x5c/0x80
   kvm_arch_flush_shadow_memslot+0xe/0x10
   kvm_set_memslot+0x1a8/0x5d0
   __kvm_set_memory_region+0x337/0x590
   kvm_vm_ioctl+0xb08/0x1040

Fixes: fbb158cb ("KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""")
Reported-by: NDavid Matlack <dmatlack@google.com>
Reviewed-by: NBen Gardon <bgardon@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220511145122.3133334-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

f7f1301d

crypto: x86/chacha20 - Avoid spurious jumps to other functions · 86ecb48d

由 Peter Zijlstra 提交于 8月 17, 2022

stable inclusion
from stable-v5.10.118
commit a59450656bcda7fbee9f892d5a65715ad846ce29
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5L686

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a59450656bcda7fbee9f892d5a65715ad846ce29

--------------------------------

[ Upstream commit 4327d168 ]

The chacha_Nblock_xor_avx512vl() functions all have their own,
identical, .LdoneN label, however in one particular spot {2,4} jump to
the 8 version instead of their own. Resulting in:

  arch/x86/crypto/chacha-x86_64.o: warning: objtool: chacha_2block_xor_avx512vl() falls through to next function chacha_8block_xor_avx512vl()
  arch/x86/crypto/chacha-x86_64.o: warning: objtool: chacha_4block_xor_avx512vl() falls through to next function chacha_8block_xor_avx512vl()

Make each function consistently use its own done label.
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NMartin Willi <martin@strongswan.org>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

86ecb48d

um: Cleanup syscall_handler_t definition/cast, fix warning · 63c98d5f

由 David Gow 提交于 8月 17, 2022

stable inclusion
from stable-v5.10.118
commit ea6a86886caa1ec659dd08b38ca879bacde47123
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5L686

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ea6a86886caa1ec659dd08b38ca879bacde47123

--------------------------------

[ Upstream commit f4f03f29 ]

The syscall_handler_t type for x86_64 was defined as 'long (*)(void)',
but always cast to 'long (*)(long, long, long, long, long, long)' before
use. This now triggers a warning (see below).

Define syscall_handler_t as the latter instead, and remove the cast.
This simplifies the code, and fixes the warning.

Warning:
In file included from ../arch/um/include/asm/processor-generic.h:13
                 from ../arch/x86/um/asm/processor.h:41
                 from ../include/linux/rcupdate.h:30
                 from ../include/linux/rculist.h:11
                 from ../include/linux/pid.h:5
                 from ../include/linux/sched.h:14
                 from ../include/linux/ptrace.h:6
                 from ../arch/um/kernel/skas/syscall.c:7:
../arch/um/kernel/skas/syscall.c: In function ‘handle_syscall’:
../arch/x86/um/shared/sysdep/syscalls_64.h:18:11: warning: cast between incompatible function types from ‘long int (*)(void)’ to ‘long int (*)(long int,  long int,  long int,  long int,  long int,  long int)’ [
-Wcast-function-type]
   18 |         (((long (*)(long, long, long, long, long, long)) \
      |           ^
../arch/x86/um/asm/ptrace.h:36:62: note: in definition of macro ‘PT_REGS_SET_SYSCALL_RETURN’
   36 | #define PT_REGS_SET_SYSCALL_RETURN(r, res) (PT_REGS_AX(r) = (res))
      |                                                              ^~~
../arch/um/kernel/skas/syscall.c:46:33: note: in expansion of macro ‘EXECUTE_SYSCALL’
   46 |                                 EXECUTE_SYSCALL(syscall, regs));
      |                                 ^~~~~~~~~~~~~~~
Signed-off-by: NDavid Gow <davidgow@google.com>
Signed-off-by: NRichard Weinberger <richard@nod.at>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

63c98d5f

10 8月, 2022 1 次提交

x86: Clear .brk area at early boot · fac962a6

由 Juergen Gross 提交于 8月 10, 2022

stable inclusion
from stable-v5.10.132
commit 136d7987fcfdeca73ee3c6a29e48f99fdd0f4d87
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5JTYM
CVE: CVE-2022-36123

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=136d7987fcfdeca73ee3c6a29e48f99fdd0f4d87

--------------------------------

[ Upstream commit 38fa5479 ]

The .brk section has the same properties as .bss: it is an alloc-only
section and should be cleared before being used.

Not doing so is especially a problem for Xen PV guests, as the
hypervisor will validate page tables (check for writable page tables
and hypervisor private bits) before accepting them to be used.

Make sure .brk is initially zero by letting clear_bss() clear the brk
area, too.
Signed-off-by: NJuergen Gross <jgross@suse.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220630071441.28576-3-jgross@suse.comSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

fac962a6

04 8月, 2022 9 次提交

KVM: LAPIC: Enable timer posted-interrupt only when mwait/hlt is advertised · dab6aad5

由 Wanpeng Li 提交于 8月 04, 2022

stable inclusion
from stable-v5.10.115
commit 43dbc3edada66eaa9d406ea9647359e795288ea0
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5IZ9C

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=43dbc3edada66eaa9d406ea9647359e795288ea0

--------------------------------

[ Upstream commit 1714a4eb ]

As commit 0c5f81da ("KVM: LAPIC: Inject timer interrupt via posted
interrupt") mentioned that the host admin should well tune the guest
setup, so that vCPUs are placed on isolated pCPUs, and with several pCPUs
surplus for *busy* housekeeping.  In this setup, it is preferrable to
disable mwait/hlt/pause vmexits to keep the vCPUs in non-root mode.

However, if only some guests isolated and others not, they would not
have any benefit from posted timer interrupts, and at the same time lose
VMX preemption timer fast paths because kvm_can_post_timer_interrupt()
returns true and therefore forces kvm_can_use_hv_timer() to false.

By guaranteeing that posted-interrupt timer is only used if MWAIT or
HLT are done without vmexit, KVM can make a better choice and use the
VMX preemption timer and the corresponding fast paths.
Reported-by: NAili Yao <yaoaili@kingsoft.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Message-Id: <1643112538-36743-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

dab6aad5

KVM: x86/mmu: avoid NULL-pointer dereference on page freeing bugs · a45d3e87

由 Paolo Bonzini 提交于 8月 04, 2022

stable inclusion
from stable-v5.10.115
commit 9c8474fa3477395e754687b3edb17ca0dfe46d6a
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5IZ9C

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9c8474fa3477395e754687b3edb17ca0dfe46d6a

--------------------------------

[ Upstream commit 9191b8f0 ]

WARN and bail if KVM attempts to free a root that isn't backed by a shadow
page.  KVM allocates a bare page for "special" roots, e.g. when using PAE
paging or shadowing 2/3/4-level page tables with 4/5-level, and so root_hpa
will be valid but won't be backed by a shadow page.  It's all too easy to
blindly call mmu_free_root_page() on root_hpa, be nice and WARN instead of
crashing KVM and possibly the kernel.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

a45d3e87

KVM: x86: Do not change ICR on write to APIC_SELF_IPI · 440b2a7a

由 Paolo Bonzini 提交于 8月 04, 2022

stable inclusion
from stable-v5.10.115
commit a474ee5ececc19c764c718c76550c5650e54a3be
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5IZ9C

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a474ee5ececc19c764c718c76550c5650e54a3be

--------------------------------

[ Upstream commit d22a81b3 ]

Emulating writes to SELF_IPI with a write to ICR has an unwanted side effect:
the value of ICR in vAPIC page gets changed.  The lists SELF_IPI as write-only,
with no associated MMIO offset, so any write should have no visible side
effect in the vAPIC page.
Reported-by: NChao Gao <chao.gao@intel.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

440b2a7a

x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume · b5f6c150

由 Wanpeng Li 提交于 8月 04, 2022

stable inclusion
from stable-v5.10.115
commit 64e3e16dbc26c88bc182028b53c27df7af40239a
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5IZ9C

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=64e3e16dbc26c88bc182028b53c27df7af40239a

--------------------------------

[ Upstream commit 0361bdfd ]

MSR_KVM_POLL_CONTROL is cleared on reset, thus reverting guests to
host-side polling after suspend/resume.  Non-bootstrap CPUs are
restored correctly by the haltpoll driver because they are hot-unplugged
during suspend and hot-plugged during resume; however, the BSP
is not hotpluggable and remains in host-sde polling mode after
the guest resume.  The makes the guest pay for the cost of vmexits
every time the guest enters idle.

Fix it by recording BSP's haltpoll state and resuming it during guest
resume.

Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Message-Id: <1650267752-46796-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

b5f6c150

kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU · 6876811e

由 Sandipan Das 提交于 8月 04, 2022

stable inclusion
from stable-v5.10.115
commit 599fc32e7421d7a591ee78bb4f3b79c54d51065c
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5IZ9C

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=599fc32e7421d7a591ee78bb4f3b79c54d51065c

--------------------------------

[ Upstream commit 5a1bde46 ]

On some x86 processors, CPUID leaf 0xA provides information
on Architectural Performance Monitoring features. It
advertises a PMU version which Qemu uses to determine the
availability of additional MSRs to manage the PMCs.

Upon receiving a KVM_GET_SUPPORTED_CPUID ioctl request for
the same, the kernel constructs return values based on the
x86_pmu_capability irrespective of the vendor.

This leaf and the additional MSRs are not supported on AMD
and Hygon processors. If AMD PerfMonV2 is detected, the PMU
version is set to 2 and guest startup breaks because of an
attempt to access a non-existent MSR. Return zeros to avoid
this.

Fixes: a6c06ed1 ("KVM: Expose the architectural performance monitoring CPUID leaf")
Reported-by: NVasant Hegde <vasant.hegde@amd.com>
Signed-off-by: NSandipan Das <sandipan.das@amd.com>
Message-Id: <3fef83d9c2b2f7516e8ff50d60851f29a4bcb716.1651058600.git.sandipan.das@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

6876811e

KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_id · 34f6c6a5

由 Kyle Huey 提交于 8月 04, 2022

stable inclusion
from stable-v5.10.115
commit fbb7c61e76017202dedd2020f4d51264f500de68
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5IZ9C

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=fbb7c61e76017202dedd2020f4d51264f500de68

--------------------------------

commit 5eb84932 upstream.

Zen renumbered some of the performance counters that correspond to the
well known events in perf_hw_id. This code in KVM was never updated for
that, so guest that attempt to use counters on Zen that correspond to the
pre-Zen perf_hw_id values will silently receive the wrong values.

This has been observed in the wild with rr[0] when running in Zen 3
guests. rr uses the retired conditional branch counter 00d1 which is
incorrectly recognized by KVM as PERF_COUNT_HW_STALLED_CYCLES_BACKEND.

[0] https://rr-project.org/Signed-off-by: NKyle Huey <me@kylehuey.com>
Message-Id: <20220503050136.86298-1-khuey@kylehuey.com>
Cc: stable@vger.kernel.org
[Check guest family, not host. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

34f6c6a5

x86/cpu: Load microcode during restore_processor_state() · 896c7853

由 Borislav Petkov 提交于 8月 04, 2022

stable inclusion
from stable-v5.10.114
commit 2ab14625b879eec22854f1dbe61d51570b427513
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5IY1V

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=2ab14625b879eec22854f1dbe61d51570b427513

--------------------------------

commit f9e14dbb upstream.

When resuming from system sleep state, restore_processor_state()
restores the boot CPU MSRs. These MSRs could be emulated by microcode.
If microcode is not loaded yet, writing to emulated MSRs leads to
unchecked MSR access error:

  ...
  PM: Calling lapic_suspend+0x0/0x210
  unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0...0) at rIP: ... (native_write_msr)
  Call Trace:
    <TASK>
    ? restore_processor_state
    x86_acpi_suspend_lowlevel
    acpi_suspend_enter
    suspend_devices_and_enter
    pm_suspend.cold
    state_store
    kobj_attr_store
    sysfs_kf_write
    kernfs_fop_write_iter
    new_sync_write
    vfs_write
    ksys_write
    __x64_sys_write
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
   RIP: 0033:0x7fda13c260a7

To ensure microcode emulated MSRs are available for restoration, load
the microcode on the boot CPU before restoring these MSRs.

  [ Pawan: write commit message and productize it. ]

Fixes: e2a1256b ("x86/speculation: Restore speculation related MSRs during S3 resume")
Reported-by: NKyle D. Pelton <kyle.d.pelton@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
Tested-by: NKyle D. Pelton <kyle.d.pelton@intel.com>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215841
Link: https://lore.kernel.org/r/4350dfbf785cd482d3fafa72b2b49c83102df3ce.1650386317.git.pawan.kumar.gupta@linux.intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

896c7853

x86: __memcpy_flushcache: fix wrong alignment if size > 2^32 · 898494ce

由 Mikulas Patocka 提交于 8月 04, 2022

stable inclusion
from stable-v5.10.114
commit 80fc45377f410426c8d922db4838760b107970aa
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5IY1V

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=80fc45377f410426c8d922db4838760b107970aa

--------------------------------

[ Upstream commit a6823e4e ]

The first "if" condition in __memcpy_flushcache is supposed to align the
"dest" variable to 8 bytes and copy data up to this alignment.  However,
this condition may misbehave if "size" is greater than 4GiB.

The statement min_t(unsigned, size, ALIGN(dest, 8) - dest); casts both
arguments to unsigned int and selects the smaller one.  However, the
cast truncates high bits in "size" and it results in misbehavior.

For example:

	suppose that size == 0x100000001, dest == 0x200000002
	min_t(unsigned, size, ALIGN(dest, 8) - dest) == min_t(0x1, 0xe) == 0x1;
	...
	dest += 0x1;

so we copy just one byte "and" dest remains unaligned.

This patch fixes the bug by replacing unsigned with size_t.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

898494ce

x86/pci/xen: Disable PCI/MSI[-X] masking for XEN_HVM guests · 80a92537

由 Thomas Gleixner 提交于 8月 04, 2022

stable inclusion
from stable-v5.10.114
commit ad604cbd1d54ba9b2944da72f97cd4669c8fcc1a
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5IY1V

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ad604cbd1d54ba9b2944da72f97cd4669c8fcc1a

--------------------------------

commit 7e0815b3 upstream.

When a XEN_HVM guest uses the XEN PIRQ/Eventchannel mechanism, then
PCI/MSI[-X] masking is solely controlled by the hypervisor, but contrary to
XEN_PV guests this does not disable PCI/MSI[-X] masking in the PCI/MSI
layer.

This can lead to a situation where the PCI/MSI layer masks an MSI[-X]
interrupt and the hypervisor grants the write despite the fact that it
already requested the interrupt. As a consequence interrupt delivery on the
affected device is not happening ever.

Set pci_msi_ignore_mask to prevent that like it's done for XEN_PV guests
already.

Fixes: 809f9267 ("xen: map MSIs into pirqs")
Reported-by: NJeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Reported-by: NDusty Mabe <dustymabe@redhat.com>
Reported-by: NSalvatore Bonaccorso <carnil@debian.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NNoah Meyerhans <noahm@debian.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87tuaduxj5.ffs@tglxSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

80a92537

29 7月, 2022 3 次提交

x86/bus_lock: Set rate limit for bus lock · 0b814883

由 Fenghua Yu 提交于 4月 19, 2021

mainline inclusion
from mainline-v5.12-rc4
commit ef4ae6e4
category: feature
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5G10C
CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
commit/?id=ef4ae6e4

Intel-SIG: commit ef4ae6e4 x86/bus_lock: Set rate limit for bus lock_

--------------------------------

A bus lock can be thousands of cycles slower than atomic operation within
one cache line. It also disrupts performance on other cores. Malicious
users can generate multiple bus locks to degrade the whole system
performance.

The current mitigation is to kill the offending process, but for certain
scenarios it's desired to identify and throttle the offending application.

Add a system wide rate limit for bus locks. When the system detects bus
locks at a rate higher than N/sec (where N can be set by the kernel boot
argument in the range [1..1000]) any task triggering a bus lock will be
forced to sleep for at least 20ms until the overall system rate of bus
locks drops below the threshold.
Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NTony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210419214958.4035512-3-fenghua.yu@intel.com

(cherry picked from commit ef4ae6e4)
Signed-off-by: NEthan Zhao <haifeng.zhao@linux.intel.com>

0b814883

x86/traps: Handle #DB for bus lock · c8db5890

由 Fenghua Yu 提交于 3月 22, 2021

mainline inclusion
from mainline-v5.12-rc4
commit ebb1064e
category: feature
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5G10C
CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
commit/?id=ebb1064e

Intel-SIG: commit ebb1064e x86/traps: Handle #DB for bus lock

--------------------------------

Bus locks degrade performance for the whole system, not just for the CPU
that requested the bus lock. Two CPU features "#AC for split lock" and
"#DB for bus lock" provide hooks so that the operating system may choose
one of several mitigation strategies.

bus lock feature to cover additional situations with new options to
mitigate.

split_lock_detect=
		#AC for split lock		#DB for bus lock

off		Do nothing			Do nothing

warn		Kernel OOPs			Warn once per task and
		Warn once per task and		and continues to run.
		disable future checking
	 	When both features are
		supported, warn in #AC

fatal		Kernel OOPs			Send SIGBUS to user.
		Send SIGBUS to user
		When both features are
		supported, fatal in #AC

ratelimit:N	Do nothing			Limit bus lock rate to
						N per second in the
						current non-root user.

Default option is "warn".

Hardware only generates #DB for bus lock detect when CPL>0 to avoid
nested #DB from multiple bus locks while the first #DB is being handled.
So no need to handle #DB for bus lock detected in the kernel.

while #AC for split lock is enabled by split lock detection bit 29 in
TEST_CTRL MSR.

Both breakpoint and bus lock in the same instruction can trigger one #DB.
The bus lock is handled before the breakpoint in the #DB handler.

Delivery of #DB for bus lock in userspace clears DR6[11], which is set by
the #DB handler right after reading DR6.
Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NTony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210322135325.682257-3-fenghua.yu@intel.com

(cherry picked from commit ebb1064e)
Signed-off-by: NEthan Zhao <haifeng.zhao@linux.intel.com>

c8db5890

x86/cpufeatures: Enumerate #DB for bus lock detection · 688b676f

由 Fenghua Yu 提交于 3月 22, 2021

mainline inclusion
from mainline-v5.12-rc4
commit f21d4d3b
category: feature
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5G10C
CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
commit/?id=f21d4d3b

Intel-SIG: commit f21d4d3b x86/cpufeatures: Enumerate #DB for bus lock detection

--------------------------------

A bus lock is acquired through either a split locked access to writeback
(WB) memory or any locked access to non-WB memory. This is typically >1000
cycles slower than an atomic operation within a cache line. It also
disrupts performance on other cores.

Some CPUs have the ability to notify the kernel by a #DB trap after a user
instruction acquires a bus lock and is executed. This allows the kernel to
enforce user application throttling or mitigation. Both breakpoint and bus
lock can trigger the #DB trap in the same instruction and the ordering of
handling them is the kernel #DB handler's choice.

The CPU feature flag to be shown in /proc/cpuinfo will be "bus_lock_detect".
Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NTony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210322135325.682257-2-fenghua.yu@intel.com

(cherry picked from commit f21d4d3b)
Signed-off-by: NEthan Zhao <haifeng.zhao@linux.intel.com>

688b676f

28 7月, 2022 2 次提交

mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP* · f1f5face

由 Muchun Song 提交于 7月 28, 2022

mainline inclusion
from mainline-v5.19-rc1
commit 47010c04
category: feature
bugzilla: 187198, https://gitee.com/openeuler/kernel/issues/I5GVFO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=47010c040dec8af6347ec6259104fc13f7e7e30a

--------------------------------

The word of "free" is not expressive enough to express the feature of
optimizing vmemmap pages associated with each HugeTLB, rename this keywork
to "optimize".  In this patch , cheanup configs to make code more
expressive.

Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	arch/arm64/configs/openeuler_defconfig
	arch/x86/configs/openeuler_defconfig
	Documentation/admin-guide/kernel-parameters.txt
	include/linux/hugetlb.h
	mm/Makefile
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f1f5face

mm: hugetlb_vmemmap: introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP · e457e43c

由 Muchun Song 提交于 7月 28, 2022

mainline inclusion
from mainline-v5.19-rc1
commit 2e4ec02b
category: feature
bugzilla: 187198, https://gitee.com/openeuler/kernel/issues/I5GVFO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2e4ec02bbcc05b8905d65c763ebde6bc85508e90

--------------------------------

The feature of minimizing overhead of struct page associated with each
HugeTLB page is implemented on x86_64, however, the infrastructure of this
feature is already there, we could easily enable it for other
architectures.  Introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP for other
architectures to be easily enabled.  Just select this config if they want
to enable this feature.

Link: https://lkml.kernel.org/r/20220331065640.5777-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NBarry Song <baohua@kernel.org>
Tested-by: NBarry Song <baohua@kernel.org>
Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e457e43c

26 7月, 2022 1 次提交

stat: fix inconsistency between struct stat and struct compat_stat · 7bec2850

由 Mikulas Patocka 提交于 7月 26, 2022

stable inclusion
from stable-v5.10.113
commit 76101c8e0c31938e59bbcb4beed47c8d8b89d39a
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5ISAH

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=76101c8e0c31938e59bbcb4beed47c8d8b89d39a

--------------------------------

[ Upstream commit 932aba1e ]

struct stat (defined in arch/x86/include/uapi/asm/stat.h) has 32-bit
st_dev and st_rdev; struct compat_stat (defined in
arch/x86/include/asm/compat.h) has 16-bit st_dev and st_rdev followed by
a 16-bit padding.

This patch fixes struct compat_stat to match struct stat.

[ Historical note: the old x86 'struct stat' did have that 16-bit field
  that the compat layer had kept around, but it was changes back in 2003
  by "struct stat - support larger dev_t":

    https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e95b2065677fe32512a597a79db94b77b90c968d

  and back in those days, the x86_64 port was still new, and separate
  from the i386 code, and had already picked up the old version with a
  16-bit st_dev field ]

Note that we can't change compat_dev_t because it is used by
compat_loop_info.

Also, if the st_dev and st_rdev values are 32-bit, we don't have to use
old_valid_dev to test if the value fits into them.  This fixes
-EOVERFLOW on filesystems that are on NVMe because NVMe uses the major
number 259.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

7bec2850

21 7月, 2022 4 次提交

x86/mce: Reduce number of machine checks taken during recovery · 3ea63cb7

由 Youquan Song 提交于 12月 23, 2021

mainline inclusion
from mainline-v5.17-rc1
commit 33761363
category: bugfix
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5HAC1
CVE: NA

Intel-SIG: commit 33761363 x86/mce: Reduce number of machine checks taken during
recovery.
Backport for MCA recovery enhancing & bug fix.

--------------------------------

When any of the copy functions in arch/x86/lib/copy_user_64.S take a
fault, the fixup code copies the remaining byte count from %ecx to %edx
and unconditionally jumps to .Lcopy_user_handle_tail to continue the
copy in case any more bytes can be copied.

If the fault was #PF this may copy more bytes (because the page fault
handler might have fixed the fault). But when the fault is a machine
check the original copy code will have copied all the way to the poisoned
cache line. So .Lcopy_user_handle_tail will just take another machine
check for no good reason.

Every code path to .Lcopy_user_handle_tail comes from an exception fixup
path, so add a check there to check the trap type (in %eax) and simply
return the count of remaining bytes if the trap was a machine check.

Doing this reduces the number of machine checks taken during synthetic
tests from four to three.

As well as reducing the number of machine checks, this also allows
Skylake generation Xeons to recover some cases that currently fail. The
is because REP; MOVSB is only recoverable when source and destination
are well aligned and the byte count is large. That useless call to
.Lcopy_user_handle_tail may violate one or more of these conditions and
generate a fatal machine check.

[ Tony: Add more details to commit message. ]
[ bp: Fixup comment.
Also, another tip patchset which is adding straight-line speculation
mitigation changes the "ret" instruction to an all-caps macro "RET".
But, since gas is case-insensitive, use "RET" in the newly added asm block
already in order to simplify tip branch merging on its way upstream.
]
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/YcTW5dh8yTGucDd+@agluck-desk2.amr.corp.intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>

3ea63cb7

x86/mce: Drop copyin special case for #MC · a1728f8d

由 Tony Luck 提交于 8月 17, 2021

mainline inclusion
from mainline-v5.16-rc1
commit 69065847
category: bugfix
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5HAC1
CVE: NA

Intel-SIG: commit 69065847 x86/mce: Drop copyin special case for #MC.
Backport for MCA recovery enhancing & bug fix.

--------------------------------

Fixes to the iterator code to handle faults that are not on page
boundaries mean that the special case for machine check during copy from
user is no longer needed.

For a full list of those fixes, see the output of:

  git log --oneline v5.14 ^v5.13 -- lib/iov_iter.c

Intel-SIG: commit 69065847 x86/mce: Drop copyin special case for #MC.
backport for MCA recovery enhance and bug fix.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210818002942.1607544-4-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>

a1728f8d

x86/mce: Change to not send SIGBUS error during copy from user · d1b9413b

由 Tony Luck 提交于 8月 17, 2021

mainline inclusion
from mainline-v5.16-rc1
commit a6e3cf70
category: bugfix
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5HAC1
CVE: NA

Intel-SIG: commit a6e3cf70 x86/mce: Change to not send SIGBUS
error during copy from user.
Backport for MCA recovery enhance and bug fix.

--------------------------------

Sending a SIGBUS for a copy from user is not the correct semantic.
System calls should return -EFAULT (or a short count for write(2)).
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>

d1b9413b

mm,hwpoison: send SIGBUS with error virutal address · 75a005c5

由 Naoya Horiguchi 提交于 6月 28, 2021

mainline inclusion
from mainline-v5.14-rc1
commit a3f5d80e
category: bugfix
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5HAC1
CVE: NA

Intel-SIG: commit a3f5d80e mm,hwpoison: send SIGBUS with error virutal address.
Backport for MCA recovery enhancing & bug fix.

--------------------------------

Now an action required MCE in already hwpoisoned address surely sends a
SIGBUS to current process, but the SIGBUS doesn't convey error virtual
address.  That's not optimal for hwpoison-aware applications.

To fix the issue, make memory_failure() call kill_accessing_process(),
that does pagetable walk to find the error virtual address.  It could find
multiple virtual addresses for the same error page, and it seems hard to
tell which virtual address is correct one.  But that's rare and sending
incorrect virtual address could be better than no address.  So let's
report the first found virtual address for now.

[naoya.horiguchi@nec.com: fix walk_page_range() return]
  Link: https://lkml.kernel.org/r/20210603051055.GA244241@hori.linux.bs1.fc.nec.co.jp
Link: https://lkml.kernel.org/r/20210521030156.2612074-4-nao.horiguchi@gmail.comSigned-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jue Wang <juew@google.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYouquan Song <youquan.song@intel.com>

75a005c5

19 7月, 2022 2 次提交

openeuler_defconfig: Enable SENSORS_ZHAOXIN_CPUTEMP as module by default · d660fbb1

由 LeoLiuoc 提交于 7月 19, 2022

zhaoxin inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I52DS7
CVE: NA

--------------------------------------------

Set CONFIG_SENSORS_ZHAOXIN_CPUTEMP to 'm' by default in
openeuler_defconfig.
Signed-off-by: NLeoLiuoc <LeoLiu-oc@zhaoxin.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>

d660fbb1

KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded · 85fc72e3

由 Sean Christopherson 提交于 7月 19, 2022

stable inclusion
from stable-v5.10.112
commit 342454231ee5f2c2782f5510cab2e7a968486fef
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5HL0X

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=342454231ee5f2c2782f5510cab2e7a968486fef

--------------------------------

commit 1d0e8480 upstream.

Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
-1 is technically undefined behavior when its value is read out by
param_get_bool(), as boolean values are supposed to be '0' or '1'.

Alternatively, KVM could define a custom getter for the param, but the
auto value doesn't depend on the vendor module in any way, and printing
"auto" would be unnecessarily unfriendly to the user.

In addition to fixing the undefined behavior, resolving the auto value
also fixes the scenario where the auto value resolves to N and no vendor
module is loaded.  Previously, -1 would result in Y being printed even
though KVM would ultimately disable the mitigation.

Rename the existing MMU module init/exit helpers to clarify that they're
invoked with respect to the vendor module, and add comments to document
why KVM has two separate "module init" flows.

  =========================================================================
  UBSAN: invalid-load in kernel/params.c:320:33
  load of value 255 is not a valid value for type '_Bool'
  CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x44
   ubsan_epilogue+0x5/0x40
   __ubsan_handle_load_invalid_value.cold+0x43/0x48
   param_get_bool.cold+0xf/0x14
   param_attr_show+0x55/0x80
   module_attr_show+0x1c/0x30
   sysfs_kf_seq_show+0x93/0xc0
   seq_read_iter+0x11c/0x450
   new_sync_read+0x11b/0x1a0
   vfs_read+0xf0/0x190
   ksys_read+0x5f/0xe0
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  =========================================================================

Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Cc: stable@vger.kernel.org
Reported-by: NBruno Goncalves <bgoncalv@redhat.com>
Reported-by: NJan Stancek <jstancek@redhat.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220331221359.3912754-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

85fc72e3

18 7月, 2022 6 次提交

x86/speculation: Restore speculation related MSRs during S3 resume · d3627300

由 Pawan Gupta 提交于 7月 14, 2022

stable inclusion
from stable-v5.10.111
commit fc4bdaed4d4ea4209e65115bd3948a1e4ac51cbb
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=fc4bdaed4d4ea4209e65115bd3948a1e4ac51cbb

--------------------------------

commit e2a1256b upstream.

After resuming from suspend-to-RAM, the MSRs that control CPU's
speculative execution behavior are not being restored on the boot CPU.

These MSRs are used to mitigate speculative execution vulnerabilities.
Not restoring them correctly may leave the CPU vulnerable.  Secondary
CPU's MSRs are correctly being restored at S3 resume by
identify_secondary_cpu().

During S3 resume, restore these MSRs for boot CPU when restoring its
processor state.

Fixes: 77243971 ("x86/bugs/intel: Set proper CPU features and setup RDS")
Reported-by: NNeelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
Tested-by: NNeelima Krishnan <neelima.krishnan@intel.com>
Acked-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>

d3627300

x86/pm: Save the MSR validity status at context setup · eb917904

由 Pawan Gupta 提交于 7月 14, 2022

stable inclusion
from stable-v5.10.111
commit 8c9e26c890ba130ad98137770d221dac7853540c
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8c9e26c890ba130ad98137770d221dac7853540c

--------------------------------

commit 73924ec4 upstream.

The mechanism to save/restore MSRs during S3 suspend/resume checks for
the MSR validity during suspend, and only restores the MSR if its a
valid MSR.  This is not optimal, as an invalid MSR will unnecessarily
throw an exception for every suspend cycle.  The more invalid MSRs,
higher the impact will be.

Check and save the MSR validity at setup.  This ensures that only valid
MSRs that are guaranteed to not throw an exception will be attempted
during suspend.

Fixes: 7a9c2dd0 ("x86/pm: Introduce quirk framework to save/restore extra MSR registers around suspend/resume")
Suggested-by: NDave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com>
Acked-by: NBorislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>

eb917904

x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy · bca15008

由 Nathan Chancellor 提交于 7月 14, 2022

stable inclusion
from stable-v5.10.111
commit 9cb90f9ad5975ddfc90d0906b40e9c71c2eca44e
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9cb90f9ad5975ddfc90d0906b40e9c71c2eca44e

--------------------------------

[ Upstream commit aaeed6ec ]

There are two outstanding issues with CONFIG_X86_X32_ABI and
llvm-objcopy, with similar root causes:

1. llvm-objcopy does not properly convert .note.gnu.property when going
   from x86_64 to x86_x32, resulting in a corrupted section when
   linking:

   https://github.com/ClangBuiltLinux/linux/issues/1141

2. llvm-objcopy produces corrupted compressed debug sections when going
   from x86_64 to x86_x32, also resulting in an error when linking:

   https://github.com/ClangBuiltLinux/linux/issues/514

After commit 41c5ef31ad71 ("x86/ibt: Base IBT bits"), the
.note.gnu.property section is always generated when
CONFIG_X86_KERNEL_IBT is enabled, which causes the first issue to become
visible with an allmodconfig build:

  ld.lld: error: arch/x86/entry/vdso/vclock_gettime-x32.o:(.note.gnu.property+0x1c): program property is too short

To avoid this error, do not allow CONFIG_X86_X32_ABI to be selected when
using llvm-objcopy. If the two issues ever get fixed in llvm-objcopy,
this can be turned into a feature check.
Signed-off-by: NNathan Chancellor <nathan@kernel.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220314194842.3452-3-nathan@kernel.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>

bca15008

xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32 · 39f75541

由 Dongli Zhang 提交于 7月 14, 2022

stable inclusion
from stable-v5.10.111
commit a2a0e04f6478e8c1038db64717f3fafd55de1420
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a2a0e04f6478e8c1038db64717f3fafd55de1420

--------------------------------

[ Upstream commit eed05744 ]

The sched_clock() can be used very early since commit 857baa87
("sched/clock: Enable sched clock early"). In addition, with commit
38669ba2 ("x86/xen/time: Output xen sched_clock time from 0"), kdump
kernel in Xen HVM guest may panic at very early stage when accessing
&__this_cpu_read(xen_vcpu)->time as in below:

setup_arch()
 -> init_hypervisor_platform()
     -> x86_init.hyper.init_platform = xen_hvm_guest_init()
         -> xen_hvm_init_time_ops()
             -> xen_clocksource_read()
                 -> src = &__this_cpu_read(xen_vcpu)->time;

This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.

However, when Xen HVM guest panic on vcpu >= 32, since
xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.

This patch calls xen_hvm_init_time_ops() again later in
xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
registered when the boot vcpu is >= 32.

This issue can be reproduced on purpose via below command at the guest
side when kdump/kexec is enabled:

"taskset -c 33 echo c > /proc/sysrq-trigger"

The bugfix for PVM is not implemented due to the lack of testing
environment.

[boris: xen_hvm_init_time_ops() returns on errors instead of jumping to end]

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: NDongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20220302164032.14569-3-dongli.zhang@oracle.comSigned-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>

39f75541

KVM: x86/emulator: Emulate RDPID only if it is enabled in guest · ab5144b6

由 Hou Wenlong 提交于 7月 14, 2022

stable inclusion
from stable-v5.10.111
commit a24479c5e9f44c2a799ca2178b1ba9fe1290706e
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a24479c5e9f44c2a799ca2178b1ba9fe1290706e

--------------------------------

[ Upstream commit a836839c ]

When RDTSCP is supported but RDPID is not supported in host,
RDPID emulation is available. However, __kvm_get_msr() would
only fail when RDTSCP/RDPID both are disabled in guest, so
the emulator wouldn't inject a #UD when RDPID is disabled but
RDTSCP is enabled in guest.

Fixes: fb6d4d34 ("KVM: x86: emulate RDPID")
Signed-off-by: NHou Wenlong <houwenlong.hwl@antgroup.com>
Message-Id: <1dfd46ae5b76d3ed87bde3154d51c64ea64c99c1.1646226788.git.houwenlong.hwl@antgroup.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>

ab5144b6

KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs · 162443c3

由 Jim Mattson 提交于 7月 14, 2022

stable inclusion
from stable-v5.10.111
commit 66b0fa6b22187caf4d789bdac8dcca49940628b0
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=66b0fa6b22187caf4d789bdac8dcca49940628b0

--------------------------------

[ Upstream commit 9b026073 ]

AMD EPYC CPUs never raise a #GP for a WRMSR to a PerfEvtSeln MSR. Some
reserved bits are cleared, and some are not. Specifically, on
Zen3/Milan, bits 19 and 42 are not cleared.

When emulating such a WRMSR, KVM should not synthesize a #GP,
regardless of which bits are set. However, undocumented bits should
not be passed through to the hardware MSR. So, rather than checking
for reserved bits and synthesizing a #GP, just clear the reserved
bits.

This may seem pedantic, but since KVM currently does not support the
"Host/Guest Only" bits (41:40), it is necessary to clear these bits
rather than synthesizing #GP, because some popular guests (e.g Linux)
will set the "Host Only" bit even on CPUs that don't support
EFER.SVME, and they don't expect a #GP.

For example,

root@Ubuntu1804:~# perf stat -e r26 -a sleep 1

 Performance counter stats for 'system wide':

                 0      r26

       1.001070977 seconds time elapsed

Feb 23 03:59:58 Ubuntu1804 kernel: [  405.379957] unchecked MSR access error: WRMSR to 0xc0010200 (tried to write 0x0000020000130026) at rIP: 0xffffffff9b276a28 (native_write_msr+0x8/0x30)
Feb 23 03:59:58 Ubuntu1804 kernel: [  405.379958] Call Trace:
Feb 23 03:59:58 Ubuntu1804 kernel: [  405.379963]  amd_pmu_disable_event+0x27/0x90

Fixes: ca724305 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Reported-by: NLotus Fenn <lotusf@google.com>
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NLike Xu <likexu@tencent.com>
Reviewed-by: NDavid Dunn <daviddunn@google.com>
Message-Id: <20220226234131.2167175-1-jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>

162443c3

17 7月, 2022 5 次提交

x86/sgx: Hook arch_memory_failure() into mainline code · 5bb9e11f

由 Tony Luck 提交于 10月 26, 2021

mainline inclusion
from mainline-5.17
commit 03b122da
category: feature
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5EZFM
CVE: NA

Intel-SIG: commit 03b122da x86/sgx: Hook arch_memory_failure()
into mainline code.
Backport for SGX MCA recovery co-existence support

--------------------------------

Add a call inside memory_failure() to call the arch specific code
to check if the address is an SGX EPC page and handle it.

Note the SGX EPC pages do not have a "struct page" entry, so the hook
goes in at the same point as the device mapping hook.

Pull the call to acquire the mutex earlier so the SGX errors are also
protected.

Make set_mce_nospec() skip SGX pages when trying to adjust
the 1:1 map.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: NJarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Tested-by: NReinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-6-tony.luck@intel.comSigned-off-by: NZhiquan Li <zhiquan1.li@intel.com>

5bb9e11f

x86/sgx: Add SGX infrastructure to recover from poison · 4fccae8c

由 Tony Luck 提交于 10月 26, 2021

mainline inclusion
from mainline-5.17
commit a495cbdf
category: feature
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5EZFM
CVE: NA

Intel-SIG: commit a495cbdf x86/sgx: Add SGX infrastructure to
recover from poison.
Backport for SGX MCA recovery co-existence support

--------------------------------

Provide a recovery function sgx_memory_failure(). If the poison was
consumed synchronously then send a SIGBUS. Note that the virtual
address of the access is not included with the SIGBUS as is the case
for poison outside of SGX enclaves. This doesn't matter as addresses
of code/data inside an enclave is of little to no use to code executing
outside the (now dead) enclave.

Poison found in a free page results in the page being moved from the
free list to the per-node poison page list.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: NJarkko Sakkinen <jarkko@kernel.org>
Tested-by: NReinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-5-tony.luck@intel.comSigned-off-by: NZhiquan Li <zhiquan1.li@intel.com>

4fccae8c

x86/sgx: Initial poison handling for dirty and free pages · dc1b46e0

由 Tony Luck 提交于 10月 26, 2021

mainline inclusion
from mainline-5.17
commit 992801ae
category: feature
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5EZFM
CVE: NA

Intel-SIG: commit 992801ae x86/sgx: Initial poison handling for
dirty and free pages.
Backport for SGX MCA recovery co-existence support

--------------------------------

A memory controller patrol scrubber can report poison in a page
that isn't currently being used.

Add "poison" field in the sgx_epc_page that can be set for an
sgx_epc_page. Check for it:
1) When sanitizing dirty pages
2) When freeing epc pages

Poison is a new field separated from flags to avoid having to make all
updates to flags atomic, or integrate poison state changes into some
other locking scheme to protect flags (Currently just sgx_reclaimer_lock
which protects the SGX_EPC_PAGE_RECLAIMER_TRACKED bit in page->flags).

In both cases place the poisoned page on a per-node list of poisoned
epc pages to make sure it will not be reallocated.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: NJarkko Sakkinen <jarkko@kernel.org>
Tested-by: NReinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-4-tony.luck@intel.comSigned-off-by: NZhiquan Li <zhiquan1.li@intel.com>

dc1b46e0

x86/sgx: Add infrastructure to identify SGX EPC pages · e87c3256

由 Tony Luck 提交于 10月 26, 2021

mainline inclusion
from mainline-5.17
commit 40e0e784
category: feature
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5EZFM
CVE: NA

Intel-SIG: commit 40e0e784 x86/sgx: Add infrastructure to identify
SGX EPC pages.
Backport for SGX MCA recovery co-existence support

--------------------------------

X86 machine check architecture reports a physical address when there
is a memory error. Handling that error requires a method to determine
whether the physical address reported is in any of the areas reserved
for EPC pages by BIOS.

SGX EPC pages do not have Linux "struct page" associated with them.

Keep track of the mapping from ranges of EPC pages to the sections
that contain them using an xarray. N.B. adds CONFIG_XARRAY_MULTI to
the SGX dependecies. So "select" that in arch/x86/Kconfig for X86/SGX.

Create a function arch_is_platform_page() that simply reports whether an
address is an EPC page for use elsewhere in the kernel. The ACPI error
injection code needs this function and is typically built as a module,
so export it.

Note that arch_is_platform_page() will be slower than other similar
"what type is this page" functions that can simply check bits in the
"struct page".  If there is some future performance critical user of
this function it may need to be implemented in a more efficient way.

Note also that the current implementation of xarray allocates a few
hundred kilobytes for this usage on a system with 4GB of SGX EPC memory
configured. This isn't ideal, but worth it for the code simplicity.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: NJarkko Sakkinen <jarkko@kernel.org>
Tested-by: NReinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-3-tony.luck@intel.comSigned-off-by: NZhiquan Li <zhiquan1.li@intel.com>

e87c3256

x86/sgx: Add new sgx_epc_page flag bit to mark free pages · b7d30662

由 Tony Luck 提交于 10月 26, 2021

mainline inclusion
from mainline-5.17
commit d6d261bd
category: feature
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5EZFM
CVE: NA

Intel-SIG: commit d6d261bd x86/sgx: Add new sgx_epc_page flag
bit to mark free pages.
Backport for SGX MCA recovery co-existence support

--------------------------------

SGX EPC pages go through the following life cycle:

        DIRTY ---> FREE ---> IN-USE --\
                    ^                 |
                    \-----------------/

Recovery action for poison for a DIRTY or FREE page is simple. Just
make sure never to allocate the page. IN-USE pages need some extra
handling.

Add a new flag bit SGX_EPC_PAGE_IS_FREE that is set when a page
is added to a free list and cleared when the page is allocated.

Notes:

1) These transitions are made while holding the node->lock so that
   future code that checks the flags while holding the node->lock
   can be sure that if the SGX_EPC_PAGE_IS_FREE bit is set, then the
   page is on the free list.

2) Initially while the pages are on the dirty list the
   SGX_EPC_PAGE_IS_FREE bit is cleared.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: NJarkko Sakkinen <jarkko@kernel.org>
Tested-by: NReinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-2-tony.luck@intel.comSigned-off-by: NZhiquan Li <zhiquan1.li@intel.com>

b7d30662

13 7月, 2022 2 次提交

Intel: AVX VNNI: x86: Enumerate AVX Vector Neural Network instructions · de9dc9ca

由 Kyung Min Park 提交于 6月 16, 2022

mainline inclusion
from mainline-5.11
commit b85a0425
category: feature
feature: SPR New instructions
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I596EH
CVE: N/A

Intel-SIG: commit b85a0425 x86: Enumerate AVX Vector Neural Network instructions
Backport for SPR core AVX VNNI support.

----------------------------

Add AVX version of the Vector Neural Network (VNNI) Instructions.
A processor supports AVX VNNI instructions if CPUID.0x07.0x1:EAX[4] is
present. The following instructions are available when this feature is
present.
1. VPDPBUS: Multiply and Add Unsigned and Signed Bytes
2. VPDPBUSDS: Multiply and Add Unsigned and Signed Bytes with Saturation
3. VPDPWSSD: Multiply and Add Signed Word Integers
4. VPDPWSSDS: Multiply and Add Signed Integers with Saturation

The only in-kernel usage of this is kvm passthrough. The CPU feature
flag is shown as "avx_vnni" in /proc/cpuinfo.

This instruction is currently documented in the latest "extensions"
manual (ISE). It will appear in the "main" manual (SDM) in the future.
Signed-off-by: NKyung Min Park <kyung.min.park@intel.com>
Signed-off-by: NYang Zhong <yang.zhong@intel.com>
Reviewed-by: NTony Luck <tony.luck@intel.com>
Message-Id: <20210105004909.42000-2-yang.zhong@intel.com>
Acked-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NLuming Yu <luming.yu@intel.com>

de9dc9ca

Intel: 5G ISA: x86: Enumerate AVX512 FP16 CPUID feature flag · 82b4f867

由 Kyung Min Park 提交于 6月 16, 2022

mainline inclusion
from mainline-5.11
commit e1b35da5
category: feature
feature: SPR New instructions
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I596EH
CVE: N/A

Intel-SIG: commit e1b35da5 x86: Enumerate AVX512 FP16 CPUID feature flag
Backport for SPR core 5G ISA support.

-------------------------------------

Enumerate AVX512 Half-precision floating point (FP16) CPUID feature
flag. Compared with using FP32, using FP16 cut the number of bits
required for storage in half, reducing the exponent from 8 bits to 5,
and the mantissa from 23 bits to 10. Using FP16 also enables developers
to train and run inference on deep learning models fast when all
precision or magnitude (FP32) is not needed.

A processor supports AVX512 FP16 if CPUID.(EAX=7,ECX=0):EDX[bit 23]
is present. The AVX512 FP16 requires AVX512BW feature be implemented
since the instructions for manipulating 32bit masks are associated with
AVX512BW.

The only in-kernel usage of this is kvm passthrough. The CPU feature
flag is shown as "avx512_fp16" in /proc/cpuinfo.

Signed-off-by: Kyung Min Park kyung.min.park@intel.com
Acked-by: Dave Hansen dave.hansen@intel.com
Reviewed-by: Tony Luck tony.luck@intel.com
Message-Id: 20201208033441.28207-2-kyung.min.park@intel.com
Acked-by: Borislav Petkov bp@suse.de
Signed-off-by: Paolo Bonzini pbonzini@redhat.com
Signed-off-by: Luming Yu luming.yu@intel.com

82b4f867

08 7月, 2022 1 次提交

x86/cpu: Add definitions for the Intel Hardware Feedback Interface · 967320d3

由 Ricardo Neri 提交于 1月 27, 2022

mainline inclusion
from mainline-5.18
commit 7b8f40b3
category: feature
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I5DSOL

Intel_SIG: commit 7b8f40b3 x86/cpu: Add definitions for the Intel
Hardware Feedback Interface.
Backport for Intel HFI (Hardware Feedback Interface) support

-------------------------------------

Add the CPUID feature bit and the model-specific registers needed to
identify and configure the Intel Hardware Feedback Interface.
Acked-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NRicardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Nyingbao jia <yingbao.jia@intel.com>
Signed-off-by: NJun Tian <jun.j.tian@intel.com>

967320d3

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功