提交 · d8f77f894df37178b9aae18eb3554d3857f67a65 · openeuler / Kernel

01 6月, 2023 1 次提交

sched: fix performance degradation on lmbench · d8f77f89

由 Hui Tang 提交于 6月 01, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7A718

--------------------------------

There are worse performance with the 'Fixes'
when running "./lat_ctx -P $SYNC_MAX -s 64 16".

The 'Fixes' which allocates memory for p->prefer_cpus
even if "prefer_cpus" not be set.

Before the 'Fixes', only test "p->prefer_cpus",
after, add test "!cpumask_empty(p->prefer_cpus)"
which causing performance degradation.

select_task_rq_fair
  ->set_task_select_cpus
    ->prefer_cpus_valid  ----  test cpumask_empty(p->prefer_cpus)

Fixes: ebeb84ad ("cpuset: Introduce new interface for scheduler ...")
Signed-off-by: NHui Tang <tanghui20@huawei.com>

d8f77f89

18 5月, 2023 1 次提交

config: enable bcache for x86 by default · 79f2ef71

由 Tom_zc 提交于 5月 17, 2023

driver inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6N4EP
CVE: NA

--------------------------------

openEuler supports bcache by default on x86 platforms,
Solve the kabi change caused by opening bcache.
Signed-off-by: NChao Zhu <tom_toworld@163.com>

79f2ef71

08 4月, 2023 3 次提交

sched: Add statistics for scheduler dynamic affinity · e287d8eb

由 tanghui 提交于 11月 16, 2022

hulk inclusion
category: feature
bugzilla: 186575, https://gitee.com/openeuler/kernel/issues/I526XC

--------------------------------
Signed-off-by: Ntanghui <tanghui20@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>

e287d8eb

sched: Adjust wakeup cpu range according CPU util dynamicly · 2a3bb3c0

由 tanghui 提交于 11月 16, 2022

hulk inclusion
category: feature
bugzilla: 186575, https://gitee.com/openeuler/kernel/issues/I526XC

--------------------------------

Compare taskgroup 'util_avg' in perferred cpu with capacity preferred cpu,
dynamicly adjust cpu range for task wakeup process.
Signed-off-by: Ntanghui <tanghui20@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>

2a3bb3c0

cpuset: Introduce new interface for scheduler dynamic affinity · ebeb84ad

由 tanghui 提交于 11月 16, 2022

hulk inclusion
category: feature
bugzilla: 186575, https://gitee.com/openeuler/kernel/issues/I526XC

--------------------------------

Add 'prefer_cpus' sysfs and related interface in cgroup cpuset.
Signed-off-by: Ntanghui <tanghui20@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>

ebeb84ad

28 2月, 2023 2 次提交

fix kabi broken due to import of 5.15-stable io_uring · 51d36de8

由 Li Lingfeng 提交于 2月 28, 2023

Offering: HULK
hulk inclusion
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6BTWC

-------------------

Commit 5125c6de8709("[Backport] io_uring: import 5.15-stable io_uring")
changes some structs, so we need to fix kabi broken problem.
Signed-off-by: NLi Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

51d36de8

io_uring: import 5.15-stable io_uring · 3f5e2ba9

由 Jens Axboe 提交于 2月 28, 2023

stable inclusion
from stable-v5.10.162
commit 788d0824269bef539fe31a785b1517882eafed93
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6BTWC
CVE: CVE-2023-0240

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.10.167&id=788d0824269bef539fe31a785b1517882eafed93

--------------------------------

No upstream commit exists.

This imports the io_uring codebase from 5.15.85, wholesale. Changes
from that code base:

- Drop IOCB_ALLOC_CACHE, we don't have that in 5.10.
- Drop MKDIRAT/SYMLINKAT/LINKAT. Would require further VFS backports,
  and we don't support these in 5.10 to begin with.
- sock_from_file() old style calling convention.
- Use compat_get_bitmap() only for CONFIG_COMPAT=y
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NLi Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

3f5e2ba9

02 12月, 2022 1 次提交

kernel/sched: Remove dl_boosted flag comment · f6cd0518

由 Hui Su 提交于 12月 02, 2022

stable inclusion
from stable-v5.10.140
commit c30c0f720533c6bcab2e16b90d302afb50a61d2a
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I63FTT

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c30c0f720533c6bcab2e16b90d302afb50a61d2a

--------------------------------

commit 0e387249 upstream.

since commit 2279f540 ("sched/deadline: Fix priority
inheritance with multiple scheduling classes"), we should not
keep it here.
Signed-off-by: NHui Su <suhui_kernel@163.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lore.kernel.org/r/20220107095254.GA49258@localhost.localdomainSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>

f6cd0518

29 11月, 2022 2 次提交

sched: programmable: Add three hooks in select_task_rq_fair() · 0abedfba

由 Chen Hui 提交于 11月 25, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5KUFB
CVE: NA

--------------------------------

Add three hooks of sched type in select_task_rq_fair(), as follows:
'cfs_select_rq'
	Replace the original core selection policy or
	implement dynamic CPU affinity.

'cfs_select_rq_exit'
	Restoring the CPU affinity of the task before exiting of
	'select_task_rq_fair'.

	To be used with 'cfs_select_rq' hook to implement
	dynamic CPU affinity.

'cfs_wake_affine'
	Determine on which CPU task can run soonest. Allow user to
	implement deferent policies.
Signed-off-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NHui Tang <tanghui20@huawei.com>

0abedfba

bpf:programmable: Add cpumask ops collection · 2bc344ac

由 Chen Hui 提交于 11月 25, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5KUFB
CVE: NA

--------------------------------

Add cpumask ops collection, such as cpumask_empty,
cpumask_and, cpumask_andnot, cpumask_subset,
cpumask_equal, cpumask_copy.
Signed-off-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NHui Tang <tanghui20@huawei.com>

2bc344ac

25 11月, 2022 3 次提交

bpf: sched: Add helper functions to get cpu statistics · 3c2d0f4d

由 Chen Hui 提交于 11月 25, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5KUFB
CVE: NA

--------------------------------

Add the helper functions to get cpu statistics, as follows:
1.acquire cfs/rt/irq cpu load statitic.
2.acquire multiple types of nr_running statitic.
3.acquire cpu idle statitic.
4.acquire cpu capacity.

Based on CPU statistics in different dimensions, specific scheduling
policies can be implemented in bpf program.
Signed-off-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Signed-off-by: NRen Zhijie <renzhijie2@huawei.com>
Signed-off-by: NHui Tang <tanghui20@huawei.com>

3c2d0f4d

sched: programmable: Add user interface of task group tag · 7a4a3fd9

由 Chen Hui 提交于 11月 25, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5KUFB
CVE: NA

--------------------------------

Add user interface of task group tag, bridges the information
gap between user-mode and kernel-mode.
Signed-off-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NRen Zhijie <renzhijie2@huawei.com>
Signed-off-by: NHui Tang <tanghui20@huawei.com>

7a4a3fd9

sched: programmable: Add a tag for the task · 224980fd

由 Chen Hui 提交于 11月 25, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5KUFB
CVE: NA

--------------------------------

Add a tag for the task, useful to identify the special task.
User can use the file system interface to mark different tags for
specific workloads.
The kernel subsystems can use the set_* helpers to mark it too.
The bpf prog obtains the tags to detect different workloads.
Signed-off-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NRen Zhijie <renzhijie2@huawei.com>
Signed-off-by: NHui Tang <tanghui20@huawei.com>

224980fd

24 11月, 2022 2 次提交

sched: Fix kABI for task->pasid_activated · 3922e762

由 Xiaochen Shen 提交于 11月 21, 2022

category: bugfix
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I596WO
CVE: NA

Intel-SIG: sched: Fix kABI for task->pasid_activated.

-------------------------------------------------

In commit
("sched: Define and initialize a flag to identify valid PASID in the task"),
a new single bit field 'pasid_activated' is added to the middle of
'struct task_struct' that causes kABI breakage.

Fix it by using KABI_FILL_HOLE() API for task->pasid_activated.
Signed-off-by: NXiaochen Shen <xiaochen.shen@intel.com>

3922e762

sched: Define and initialize a flag to identify valid PASID in the task · ad0345ab

由 Peter Zijlstra 提交于 2月 07, 2022

mainline inclusion
from mainline-v5.18
commit a3d29e82
category: feature
bugzilla: https://gitee.com/openeuler/intel-kernel/issues/I596WO
CVE: NA

Intel-SIG: commit a3d29e82 sched: Define and initialize a flag to identify valid PASID in the task.
Incremental backporting patches for DSA/IAA on Intel Xeon platform.

--------------------------------

Add a new single bit field to the task structure to track whether this task
has initialized the IA32_PASID MSR to the mm's PASID.

Initialize the field to zero when creating a new task with fork/clone.
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Co-developed-by: NFenghua Yu <fenghua.yu@intel.com>
Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NTony Luck <tony.luck@intel.com>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220207230254.3342514-8-fenghua.yu@intel.comSigned-off-by: NXiaochen Shen <xiaochen.shen@intel.com>

ad0345ab

21 11月, 2022 4 次提交

mm: oom_kill: fix KABI broken by "oom_kill.c: futex: delay the OOM reaper to... · cdad681c

由 Ma Wupeng 提交于 11月 21, 2022

mm: oom_kill: fix KABI broken by "oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup"

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I61FDP
CVE: NA

-------------------------------

Move oom_reaper_timer from task_struct to task_struct_resvd to fix KABI
broken.
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NNanyong Sun <sunnanyong@huawei.com>
Reviewed-by: Nchenhui 00515652 <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

cdad681c

oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup · e3bd8c4b

由 Nico Pache 提交于 11月 21, 2022

mainline inclusion
from mainline-v5.18-rc4
commit e4a38402
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I61FDP
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=e4a38402c36e42df28eb1a5394be87e6571fb48a

--------------------------------

The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which
can be targeted by the oom reaper.  This mapping is used to store the
futex robust list head; the kernel does not keep a copy of the robust
list and instead references a userspace address to maintain the
robustness during a process death.

A race can occur between exit_mm and the oom reaper that allows the oom
reaper to free the memory of the futex robust list before the exit path
has handled the futex death:

    CPU1                               CPU2
    --------------------------------------------------------------------
    page_fault
    do_exit "signal"
    wake_oom_reaper
                                        oom_reaper
                                        oom_reap_task_mm (invalidates mm)
    exit_mm
    exit_mm_release
    futex_exit_release
    futex_cleanup
    exit_robust_list
    get_user (EFAULT- can't access memory)

If the get_user EFAULT's, the kernel will be unable to recover the
waiters on the robust_list, leaving userspace mutexes hung indefinitely.

Delay the OOM reaper, allowing more time for the exit path to perform
the futex cleanup.

Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer

Based on a patch by Michal Hocko.

Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 [1]
Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com
Fixes: 21292580 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: NJoel Savitz <jsavitz@redhat.com>
Signed-off-by: NNico Pache <npache@redhat.com>
Co-developed-by: NJoel Savitz <jsavitz@redhat.com>
Suggested-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Herton R. Krzesinski <herton@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Savitz <jsavitz@redhat.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NNanyong Sun <sunnanyong@huawei.com>
Reviewed-by: Nchenhui 00515652 <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e3bd8c4b

fork: Allocate a new task_struct_resvd object for fork task · 5efc447b

由 Zheng Zucheng 提交于 11月 21, 2022

hulk inclusion
category: feature
bugzilla: 187196, https://gitee.com/openeuler/kernel/issues/I612CS

-------------------------------

Allocate a new task_struct_resvd object for the recently cloned task
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NNanyong Sun <sunnanyong@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: Nchenhui 00515652 <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5efc447b

arm64: add dump_user_range() to machine check safe · 396d6972

由 Tong Tiangen 提交于 11月 18, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5GB28
CVE: NA

-------------------------------

In the dump_user_range() processing, the data of the user process is
dump to corefile, when hardware memory error is encountered during dump,
only the relevant processes are affected, so killing the user process and
isolate the user page with hardware memory errors is a more reasonable
choice than kernel panic.

The dump_user_range() typical usage scenarios is coredump. Coredump file
writing to fs is related to the specific implementation of fs's write_iter
operation. This patch only supports two typical fs write function
(_copy_from_iter/iov_iter_copy_from_user_atomic) which is used  by
ext4/tmpfs/pipefs.
Signed-off-by: NTong Tiangen <tongtiangen@huawei.com>

396d6972

18 11月, 2022 1 次提交

sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed · 28be61ed

由 Waiman Long 提交于 11月 18, 2022

stable inclusion
from stable-v5.10.137
commit 336626564b58071b8980a4e6a31a8f5d92705d9b
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I60PLB

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=336626564b58071b8980a4e6a31a8f5d92705d9b

--------------------------------

[ Upstream commit b6e8d40d ]

With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
that the cpuset will just use the effective CPUs of its parent. So
cpuset_can_attach() can call task_can_attach() with an empty mask.
This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
to dl_bw_of() to crash due to percpu value access of an out of bound
CPU value. For example:

	[80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
	  :
	[80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
	  :
	[80468.207946] Call Trace:
	[80468.208947]  cpuset_can_attach+0xa0/0x140
	[80468.209953]  cgroup_migrate_execute+0x8c/0x490
	[80468.210931]  cgroup_update_dfl_csses+0x254/0x270
	[80468.211898]  cgroup_subtree_control_write+0x322/0x400
	[80468.212854]  kernfs_fop_write_iter+0x11c/0x1b0
	[80468.213777]  new_sync_write+0x11f/0x1b0
	[80468.214689]  vfs_write+0x1eb/0x280
	[80468.215592]  ksys_write+0x5f/0xe0
	[80468.216463]  do_syscall_64+0x5c/0x80
	[80468.224287]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
to be used by tasks within the cpuset anyway.

Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
reflect the change. In addition, a check is added to task_can_attach()
to guard against the possibility that cpumask_any_and() may return a
value >= nr_cpu_ids.

Fixes: 7f51412a ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NJuri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.comSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>

28be61ed

30 9月, 2022 6 次提交

sched: fix kabi for core scheduling · ec840b5b

由 Lin Shengwang 提交于 9月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5OOWG
CVE: NA

--------------------------------------------------------------------------
Signed-off-by: NLin Shengwang <linshengwang1@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ec840b5b

sched: prctl() core-scheduling interface · 0d6f9178

由 Chris Hyser 提交于 9月 30, 2022

mainline inclusion
from mainline-v5.14-rc1
commit 7ac592aa
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5OOWG
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7ac592aa35a684ff1858fb9ec282886b9e3575ac

--------------------------------------------------------------------------

This patch provides support for setting and copying core scheduling
'task cookies' between threads (PID), processes (TGID), and process
groups (PGID).

The value of core scheduling isn't that tasks don't share a core,
'nosmt' can do that. The value lies in exploiting all the sharing
opportunities that exist to recover possible lost performance and that
requires a degree of flexibility in the API.

From a security perspective (and there are others), the thread,
process and process group distinction is an existent hierarchal
categorization of tasks that reflects many of the security concerns
about 'data sharing'. For example, protecting against cache-snooping
by a thread that can just read the memory directly isn't all that
useful.

With this in mind, subcommands to CREATE/SHARE (TO/FROM) provide a
mechanism to create and share cookies. CREATE/SHARE_TO specify a
target pid with enum pidtype used to specify the scope of the targeted
tasks. For example, PIDTYPE_TGID will share the cookie with the
process and all of it's threads as typically desired in a security
scenario.

API:

  prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL)

where 'tgtpid/srcpid == 0' implies the current process and pidtype is
kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.

For return values, EINVAL, ENOMEM are what they say. ESRCH means the
tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission
access to tgtpid/srcpid. ENODEV indicates your machines lacks SMT.

[peterz: complete rewrite]
Signed-off-by: NChris Hyser <chris.hyser@oracle.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDon Hiatt <dhiatt@digitalocean.com>
Tested-by: NHongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123309.039845339@infradead.orgSigned-off-by: NLin Shengwang <linshengwang1@huawei.com>
Reviewed-by: Nlihua <hucool.lihua@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

0d6f9178

sched: Inherit task cookie on fork() · c7666af0

由 Peter Zijlstra 提交于 9月 30, 2022

mainline inclusion
from mainline-v5.14-rc1
commit 85dd3f61
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5OOWG
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=85dd3f61203c5cfa72b308ff327b5fbf3fc1ce5e

--------------------------------------------------------------------------

Note that sched_core_fork() is called from under tasklist_lock, and
not from sched_fork() earlier. This avoids a few races later.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDon Hiatt <dhiatt@digitalocean.com>
Tested-by: NHongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.980003687@infradead.orgSigned-off-by: NLin Shengwang <linshengwang1@huawei.com>
Reviewed-by: Nlihua <hucool.lihua@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c7666af0

sched: Trivial core scheduling cookie management · be234044

由 Peter Zijlstra 提交于 9月 30, 2022

mainline inclusion
from mainline-v5.14-rc1
commit 6e33cad0
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5OOWG
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6e33cad0af49336952e5541464bd02f5b5fd433e

--------------------------------------------------------------------------

In order to not have to use pid_struct, create a new, smaller,
structure to manage task cookies for core scheduling.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDon Hiatt <dhiatt@digitalocean.com>
Tested-by: NHongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.919768100@infradead.orgSigned-off-by: NLin Shengwang <linshengwang1@huawei.com>
Reviewed-by: Nlihua <hucool.lihua@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

be234044

sched: Trivial forced-newidle balancer · 74ddc15c

由 Peter Zijlstra 提交于 9月 30, 2022

mainline inclusion
from mainline-v5.14-rc1
commit d2dfa17b
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5OOWG
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d2dfa17bc7de67e99685c4d6557837bf801a102c

--------------------------------------------------------------------------

When a sibling is forced-idle to match the core-cookie; search for
matching tasks to fill the core.

rcu_read_unlock() can incur an infrequent deadlock in
sched_core_balance(). Fix this by using the RCU-sched flavor instead.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDon Hiatt <dhiatt@digitalocean.com>
Tested-by: NHongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.800048269@infradead.orgSigned-off-by: NLin Shengwang <linshengwang1@huawei.com>
Reviewed-by: Nlihua <hucool.lihua@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

74ddc15c

sched: Basic tracking of matching tasks · c29a9a91

由 Peter Zijlstra 提交于 9月 30, 2022

mainline inclusion
from mainline-v5.14-rc1
commit 8a311c74
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5OOWG
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8a311c740b53324ec584e0e3bb7077d56b123c28

--------------------------------------------------------------------------

Introduce task_struct::core_cookie as an opaque identifier for core
scheduling. When enabled; core scheduling will only allow matching
task to be on the core; where idle matches everything.

When task_struct::core_cookie is set (and core scheduling is enabled)
these tasks are indexed in a second RB-tree, first on cookie value
then on scheduling function, such that matching task selection always
finds the most elegible match.

NOTE: *shudder* at the overhead...

NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative
is per class tracking of cookies and that just duplicates a lot of
stuff for no raisin (the 2nd copy lives in the rt-mutex PI code).

[Joel: folded fixes]
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDon Hiatt <dhiatt@digitalocean.com>
Tested-by: NHongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.496975854@infradead.orgSigned-off-by: NLin Shengwang <linshengwang1@huawei.com>
Reviewed-by: Nlihua <hucool.lihua@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c29a9a91

23 5月, 2022 1 次提交

Revert "module, async: async_synchronize_full() on module init iff async is used" · e65497c4

由 Igor Pylypiv 提交于 5月 23, 2022

stable inclusion
from stable-v5.10.102
commit de55891e162cac0ae058e05c2527fd32cc435ac0
bugzilla: https://gitee.com/openeuler/kernel/issues/I567K6

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=de55891e162cac0ae058e05c2527fd32cc435ac0

--------------------------------

[ Upstream commit 67d6212a ]

This reverts commit 774a1221.

We need to finish all async code before the module init sequence is
done.  In the reverted commit the PF_USED_ASYNC flag was added to mark a
thread that called async_schedule().  Then the PF_USED_ASYNC flag was
used to determine whether or not async_synchronize_full() needs to be
invoked.  This works when modprobe thread is calling async_schedule(),
but it does not work if module dispatches init code to a worker thread
which then calls async_schedule().

For example, PCI driver probing is invoked from a worker thread based on
a node where device is attached:

	if (cpu < nr_cpu_ids)
		error = work_on_cpu(cpu, local_pci_probe, &ddi);
	else
		error = local_pci_probe(&ddi);

We end up in a situation where a worker thread gets the PF_USED_ASYNC
flag set instead of the modprobe thread.  As a result,
async_synchronize_full() is not invoked and modprobe completes without
waiting for the async code to finish.

The issue was discovered while loading the pm80xx driver:
(scsi_mod.scan=async)

modprobe pm80xx                      worker
...
  do_init_module()
  ...
    pci_call_probe()
      work_on_cpu(local_pci_probe)
                                     local_pci_probe()
                                       pm8001_pci_probe()
                                         scsi_scan_host()
                                           async_schedule()
                                           worker->flags |= PF_USED_ASYNC;
                                     ...
      < return from worker >
  ...
  if (current->flags & PF_USED_ASYNC) <--- false
  	async_synchronize_full();

Commit 21c3c5d2 ("block: don't request module during elevator init")
fixed the deadlock issue which the reverted commit 774a1221
("module, async: async_synchronize_full() on module init iff async is
used") tried to fix.

Since commit 0fdff3ec ("async, kmod: warn on synchronous
request_module() from async workers") synchronous module loading from
async is not allowed.

Given that the original deadlock issue is fixed and it is no longer
allowed to call synchronous request_module() from async we can remove
PF_USED_ASYNC flag to make module init consistently invoke
async_synchronize_full() unless async module probe is requested.
Signed-off-by: NIgor Pylypiv <ipylypiv@google.com>
Reviewed-by: NChangyuan Lyu <changyuanl@google.com>
Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e65497c4

10 5月, 2022 2 次提交

sched: Add statistics for qos smt expeller · 42f42fee

由 Guan Jing 提交于 5月 10, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I52611
CVE: NA

--------------------------------

We have added two statistics for qos smt expeller:
a) nr_qos_smt_send_ipi:the times of ipi which online task expel offline tasks;
b) nr_qos_smt_expelled:the statistics that offline task will not be picked times.
Signed-off-by: NGuan Jing <guanjing6@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

42f42fee

sched: Implement the function of qos smt expeller · fd5207be

由 Guan Jing 提交于 5月 10, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I52611
CVE: NA

--------------------------------

We implement the function of qos smt expeller by this following two points：
a)when online tasks and offline tasks are running on the same physical cpu,
online tasks will send ipi to expel offline tasks on the smt sibling cpus.
b)when online tasks are running, the smt sibling cpus will not allow
offline tasks to be selected.
Signed-off-by: NGuan Jing <guanjing6@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

fd5207be

23 2月, 2022 1 次提交

mm: Introduce reliable flag for user task · 8ee6e050

由 Peng Wu 提交于 2月 23, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4PM01
CVE: NA

------------------------------------------

Adding reliable flag for user task. User task with reliable flag can
only alloc memory from mirrored region. PF_RELIABLE is added to represent
the task's reliable flag.

- For init task, which is regarded as special task which alloc memory
  from mirrored region.

- For normal user tasks, The reliable flag can be set via procfs interface
  shown as below and can be inherited via fork().

User can change a user task's reliable flag by

	$ echo [0/1] > /proc/<pid>/reliable

and check a user task's reliable flag by

	$ cat /proc/<pid>/reliable

Note, global init task's reliable file can not be accessed.
Signed-off-by: NPeng Wu <wupeng58@huawei.com>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8ee6e050

29 1月, 2022 2 次提交

KABI: add reserve space for sched structures · b4d81894

由 Guan Jing 提交于 1月 29, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KAP1?from=project-issue
CVE: NA

-------------------------------

We reserve some fields beforehand for sched structures prone to change,
therefore, we can hot add/change features of sched with this enhancement.
After reserving, normally cache does not matter as the reserved fields
are not accessed at all.
Signed-off-by: NGuan Jing <guanjing6@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b4d81894

sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y · e75571a4

由 Ard Biesheuvel 提交于 1月 29, 2022

mainline inclusion
from mainline-v5.16-rc1
commit bcf9033e
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4Q94A
CVE: NA

---------------------------

THREAD_INFO_IN_TASK moved the CPU field out of thread_info, but this
causes some issues on architectures that define raw_smp_processor_id()
in terms of this field, due to the fact that #include'ing linux/sched.h
to get at struct task_struct is problematic in terms of circular
dependencies.

Given that thread_info and task_struct are the same data structure
anyway when THREAD_INFO_IN_TASK=y, let's move it back so that having
access to the type definition of struct thread_info is sufficient to
reference the CPU number of the current task.

Note that this requires THREAD_INFO_IN_TASK's definition of the
task_thread_info() helper to be updated, as task_cpu() takes a
pointer-to-const, whereas task_thread_info() (which is used to generate
lvalues as well), needs a non-const pointer. So make it a macro instead.
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: NMark Rutland <mark.rutland@arm.com>
Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NGuan Jing <guanjing6@huawei.com>

Conflicts:
	arch/arm64/kernel/head.S
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e75571a4

31 12月, 2021 1 次提交

KABI:reserve space for sched structures · 800d1646

由 Guan Jing 提交于 12月 31, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KAP1?from=project-issue
CVE: NA

--------
We reserve some fields beforehand for sched structures prone to change,
therefore, we can hot add/change features of sched with this enhancement.
After reserving, normally cache does not matter as the reserved fields
are not accessed at all.
Signed-off-by: NGuan Jing <guanjing6@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

800d1646

29 12月, 2021 1 次提交

tick/nohz: Kick only _queued_ task whose tick dependency is updated · 3e8db947

由 Marcelo Tosatti 提交于 12月 29, 2021

mainline inclusion
from mainline-5.14-rc1
commit a1dfb631
category: feature
feature: Deep isolation
bugzilla: https://gitee.com/openeuler/kernel/issues/I4N00D
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a1dfb6311c7739e21e160bc4c5575a1b21b48c87

--------------------------------

When the tick dependency of a task is updated, we want it to aknowledge
the new state and restart the tick if needed. If the task is not
running, we don't need to kick it because it will observe the new
dependency upon scheduling in. But if the task is running, we may need
to send an IPI to it so that it gets notified.

Unfortunately we don't have the means to check if a task is running
in a race free way. Checking p->on_cpu in a synchronized way against
p->tick_dep_mask would imply adding a full barrier between
prepare_task_switch() and tick_nohz_task_switch(), which we want to
avoid in this fast-path.

Therefore we blindly fire an IPI to the task's CPU.

Meanwhile we can check if the task is queued on the CPU rq because
p->on_rq is always set to TASK_ON_RQ_QUEUED _before_ schedule() and its
full barrier that precedes tick_nohz_task_switch(). And if the task is
queued on a nohz_full CPU, it also has fair chances to be running as the
isolation constraints prescribe running single tasks on full dynticks
CPUs.

So use this as a trick to check if we can spare an IPI toward a
non-running task.

NOTE: For the ordering to be correct, it is assumed that we never
deactivate a task while it is running, the only exception being the task
deactivating itself while scheduling out.
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512232924.150322-9-frederic@kernel.orgSigned-off-by: NYunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: NChao Liu <liuchao173@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

3e8db947

10 12月, 2021 1 次提交

sched: Introduce handle priority reversion mechanism · a41f60f0

由 Zheng Zucheng 提交于 12月 10, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LH1X
CVE: NA

--------------------------------

When online tasks occupy cpu long time, offline task will not get cpu to run,
the priority inversion issue may be triggered in this case. If the above case
occurs, we will unthrottle offline tasks and let its get a chance to run.
When online tasks occupy cpu over 5s(defaule value), we will unthrottle offline
tasks and enter a msleep loop before exit to usermode util the cpu goto idle.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a41f60f0

15 11月, 2021 2 次提交

bpf: Add ambient BPF runtime context stored in current · 460f91d2

由 Andrii Nakryiko 提交于 11月 15, 2021

mainline inclusion
from mainline-v5.15-rc1
commit c7603cfa
category: bugfix
bugzilla: 182996 https://gitee.com/openeuler/kernel/issues/I4DDEL

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c7603cfa04e7c3a435b31d065f7cbdc829428f6e

--------------------------------

b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
helper") fixed the problem with cgroup-local storage use in BPF by
pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
possible BPF program preemptions and nested executions.

While this seems to work good in practice, it introduces new and unnecessary
failure mode in which not all BPF programs might be executed if we fail to
find an unused slot for cgroup storage, however unlikely it is. It might also
not be so unlikely when/if we allow sleepable cgroup BPF programs in the
future.

Further, the way that cgroup storage is implemented as ambiently-available
property during entire BPF program execution is a convenient way to pass extra
information to BPF program and helpers without requiring user code to pass
around extra arguments explicitly. So it would be good to have a generic
solution that can allow implementing this without arbitrary restrictions.
Ideally, such solution would work for both preemptable and sleepable BPF
programs in exactly the same way.

This patch introduces such solution, bpf_run_ctx. It adds one pointer field
(bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
macros in such a way that it always stays valid throughout BPF program
execution. BPF program preemption is handled by remembering previous
current->bpf_ctx value locally while executing nested BPF program and
restoring old value after nested BPF program finishes. This is handled by two
helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
supposed to be used before and after BPF program runs, respectively.

Restoring old value of the pointer handles preemption, while bpf_run_ctx
pointer being a property of current task_struct naturally solves this problem
for sleepable BPF programs by "following" BPF program execution as it is
scheduled in and out of CPU. It would even allow CPU migration of BPF
programs, even though it's not currently allowed by BPF infra.

This patch cleans up cgroup local storage handling as a first application. The
design itself is generic, though, with bpf_run_ctx being an empty struct that
is supposed to be embedded into a specific struct for a given BPF program type
(bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
this mechanism for other uses within tracing BPF programs.

To verify that this change doesn't revert the fix to the original cgroup
storage issue, I ran the same repro as in the original report ([0]) and didn't
get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).

  [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/

Fixes: b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NYonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org
Conflicts:
	include/linux/bpf.h
	include/linux/sched.h
	kernel/fork.c
	net/bpf/test_run.c
Signed-off-by: NPu Lehui <pulehui@huawei.com>
Reviewed-by: NKuohai Xu <xukuohai@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

460f91d2

sched: Always inline is_percpu_thread() · 45167ba0

由 Peter Zijlstra 提交于 11月 15, 2021

stable inclusion
from stable-5.10.74
commit bb893f075431e97dd72cde0721957a73d26578a8
bugzilla: 182986 https://gitee.com/openeuler/kernel/issues/I4I3MG

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=bb893f075431e97dd72cde0721957a73d26578a8

--------------------------------

[ Upstream commit 83d40a61 ]

  vmlinux.o: warning: objtool: check_preemption_disabled()+0x81: call to is_percpu_thread() leaves .noinstr.text section
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928084218.063371959@infradead.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

45167ba0

21 10月, 2021 1 次提交

x86/mce: Avoid infinite loop for copy from user recovery · c6a9d0e7

由 Tony Luck 提交于 10月 21, 2021

stable inclusion
from stable-5.10.68
commit 619d747c1850bab61625ca9d8b4730f470a5947b
bugzilla: 182671 https://gitee.com/openeuler/kernel/issues/I4EWUH

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=619d747c1850bab61625ca9d8b4730f470a5947b

--------------------------------

commit 81065b35 upstream.

There are two cases for machine check recovery:

1) The machine check was triggered by ring3 (application) code.
   This is the simpler case. The machine check handler simply queues
   work to be executed on return to user. That code unmaps the page
   from all users and arranges to send a SIGBUS to the task that
   triggered the poison.

2) The machine check was triggered in kernel code that is covered by
   an exception table entry. In this case the machine check handler
   still queues a work entry to unmap the page, etc. but this will
   not be called right away because the #MC handler returns to the
   fix up code address in the exception table entry.

Problems occur if the kernel triggers another machine check before the
return to user processes the first queued work item.

Specifically, the work is queued using the ->mce_kill_me callback
structure in the task struct for the current thread. Attempting to queue
a second work item using this same callback results in a loop in the
linked list of work functions to call. So when the kernel does return to
user, it enters an infinite loop processing the same entry for ever.

There are some legitimate scenarios where the kernel may take a second
machine check before returning to the user.

1) Some code (e.g. futex) first tries a get_user() with page faults
   disabled. If this fails, the code retries with page faults enabled
   expecting that this will resolve the page fault.

2) Copy from user code retries a copy in byte-at-time mode to check
   whether any additional bytes can be copied.

On the other side of the fence are some bad drivers that do not check
the return value from individual get_user() calls and may access
multiple user addresses without noticing that some/all calls have
failed.

Fix by adding a counter (current->mce_count) to keep track of repeated
machine checks before task_work() is called. First machine check saves
the address information and calls task_work_add(). Subsequent machine
checks before that task_work call back is executed check that the address
is in the same page as the first machine check (since the callback will
offline exactly one page).

Expected worst case is four machine checks before moving on (e.g. one
user access with page faults disabled, then a repeat to the same address
with page faults enabled ... repeat in copy tail bytes). Just in case
there is some code that loops forever enforce a limit of 10.

 [ bp: Massage commit message, drop noinstr, fix typo, extend panic
   messages. ]

Fixes: 5567d11c ("x86/mce: Send #MC singal from task work")
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c6a9d0e7

28 7月, 2021 1 次提交

preempt/dynamic: Provide cond_resched() and might_resched() static calls · 9253d578

由 Peter Zijlstra (Intel) 提交于 7月 27, 2021

mainline inclusion
from mainline-5.12-rc1
commit b965f1dd
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I410UT
CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b965f1ddb47daa5b8b2e2bc9c921431236830367

---------------------------

Provide static calls to control cond_resched() (called in !CONFIG_PREEMPT)
and might_resched() (called in CONFIG_PREEMPT_VOLUNTARY) to that we
can override their behaviour when preempt= is overriden.

Since the default behaviour is full preemption, both their calls are
ignored when preempt= isn't passed.

  [fweisbec: branch might_resched() directly to __cond_resched(), only
             define static calls when PREEMPT_DYNAMIC]
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210118141223.123667-6-frederic@kernel.orgSigned-off-by: NMa Junhai <majunhai2@huawei.com>

 Conflicts:
	include/linux/kernel.h
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

9253d578

14 7月, 2021 1 次提交

sched: Fix offline task can't be killed in a timely · bf8d90ce

由 Zheng Zucheng 提交于 7月 12, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZX4D
CVE: NA

--------------------------------

If online tasks occupy 100% CPU resources, offline tasks can't be scheduled
since offline tasks are throttled, as a result, offline task can't timely
respond after receiving SIGKILL signal.
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

bf8d90ce

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功