提交 · dad99a5750c4c942300c6626888a6ad52e62bf24 · openeuler / Kernel

31 10月, 2022 2 次提交

futex: introduce the direct-thread-switch mechanism · dad99a57

由 briansun 提交于 10月 25, 2022

openeuler inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4L9RU
CVE: NA

Reference: https://lore.kernel.org/lkml/20200722234538.166697-2-posk@posk.io/

-------------------

In some scenarios, we need to run several low-thrashing required threads
together which act as logical operations like PV operations. This kind of
thread always falls asleep and wakes other threads up, and thread switching
requires the kernel to do several scheduling related overheads (Select the
proper core to execute, wake the task up, enqueue the task, mark the task
scheduling flag, pick the task at the proper time, dequeue the task and do
context switching). These overheads mentioned above are not accepted for the
low-thrashing threads. Therefore, we require a mechanism to decline the
unnecessary overhead and to swap threads directly without affecting the
fairness of CFS tasks.

To achieve this goal, we implemented the direct-thread-switch mechanism
based on the futex_swap patch*, which switches the DTS task directly with
the shared schedule entity. Also, we ensured the kernel keeps secure and
consistent basically.
Signed-off-by: NZhi Song <hizhisong@gmail.com>

dad99a57

futex/sched: add wake_up_process_prefer_current_cpu, use in FUTEX_SWAP · fc4a2354

由 Peter Oskolkov 提交于 7月 22, 2020

openeuler inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4L9RU
CVE: NA

-------------------

As described in the previous patch in this patchset
("futex: introduce FUTEX_SWAP operation"), it is often
beneficial to wake a task and run it on the same CPU
where the current going to sleep task it running.

Internally at Google, switchto_switch sycall not only
migrates the wakee to the current CPU, but also moves
the waker's load stats to the wakee, thus ensuring
that the migration to the current CPU does not interfere
with load balancing. switchto_switch also does the
context switch into the wakee, bypassing schedule().

This patchset does not go that far yet, it simply
migrates the wakee to the current CPU and calls schedule().

In follow-up patches I will try to fune-tune the behavior by adjusting
load stats and schedule(): our internal switchto_switch
is still about 2x faster than FUTEX_SWAP (see numbers below).

And now about performance: futex_swap benchmark
from the last patch in this patchset produces this typical
output:

$ ./futex_swap -i 100000

------- running SWAP_WAKE_WAIT -----------

completed 100000 swap and back iterations in 820683263 ns: 4103 ns per swap
PASS

------- running SWAP_SWAP -----------

completed 100000 swap and back iterations in 124034476 ns: 620 ns per swap
PASS

In the above, the first benchmark (SWAP_WAKE_WAIT) calls FUTEX_WAKE,
then FUTEX_WAIT; the second benchmark (SWAP_SWAP) calls FUTEX_SWAP.

If the benchmark is restricted to a single cpu:

$ taskset -c 1 ./futex_swap -i 1000000

The numbers are very similar, as expected (with wake+wait being
a bit slower than swap due to two vs one syscalls).

Please also note that switchto_switch is about 2x faster than
FUTEX_SWAP because it does a contex switch to the wakee immediately,
bypassing schedule(), so this is one of the options I'll
explore in further patches (if/when this initial patchset is
accepted).

Tested: see the last patch is this patchset.
Signed-off-by: NPeter Oskolkov <posk@google.com>

fc4a2354

28 10月, 2022 1 次提交

mm: multi-gen LRU: groundwork · dca02ff3

由 Yu Zhao 提交于 1月 25, 2021

mainline inclusion
from mainline-v6.1-rc1
commit ec1c86b2
category: feature
bugzilla: https://gitee.com/openeuler/open-source-summer/issues/I55Z0L
CVE: NA
Reference: https://android-review.googlesource.com/c/kernel/common/+/2050910/10

----------------------------------------------------------------------

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in page->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of
working set estimation and proactive reclaim. These features are
required to optimize job scheduling (bin packing) in data centers. The
variable size of the sliding window is designed for such use cases
[1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive"
will be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
applications usually do not prepare themselves for major page
faults like they do for blocked I/O. E.g., GUI applications
commonly use dedicated I/O threads to avoid blocking the rendering
threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present; the latter channel is assumed to follow the
latter pattern unless outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults". Three macros,
i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are
added in this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
has not been used since then. This protocol, AKA second chance,
requires a minimum of two generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lore.kernel.org/r/20220309021230.721028-6-yuzhao@google.com/Signed-off-by: NYu Zhao <yuzhao@google.com>
Acked-by: NBrian Geffon <bgeffon@google.com>
Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: NSteven Barrett <steven@liquorix.net>
Acked-by: NSuleiman Souhlal <suleiman@google.com>
Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
Tested-by: NDonald Carr <d@chaos-reins.com>
Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
Tested-by: NSofia Trinh <sofia.trinh@edi.works>
Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: NKalesh Singh <kaleshsingh@google.com>
Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512
Signed-off-by: NYuLinjia <3110442349@qq.com>

dca02ff3

08 10月, 2022 1 次提交

lite-lockdep: add basic lock acquisition records · d801dbbb

由 weiqingv 提交于 9月 14, 2022

ECNU inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5R8DS

--------------------------------

Construct a new tool for lightweight lock traces. The basic data structures and hook points are similar
to Lockdep in this commit. Various lock instances are mapped to lite lock classes. The initialization,
acquisition and release of lite lock classes are hooked to obtain lock information. The held locks of
each task_struct are dynamically recorded. When running into some abnormal cases such as hung tasks,
the lock states are supported to be output. Differ from Lockdep, locks are only recorded without
coupled context and circular dependency checks, which leads to lower overhead. For now, mutexes,
spinlocks, and rwsems are supported.
Signed-off-by: Nweiqingv <709088312@qq.com>

d801dbbb

22 8月, 2022 5 次提交

sched: programmable: Add three hooks in select_task_rq_fair() · 7a1cc837

由 Chen Hui 提交于 8月 22, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5KUFB
CVE: NA

--------------------------------

Add three hooks of sched type in select_task_rq_fair(), as follows:
'cfs_select_rq'
	Replace the original core selection policy or
	implement dynamic CPU affinity.

'cfs_select_rq_exit'
	Restoring the CPU affinity of the task before exiting of
	'select_task_rq_fair'.

	To be used with 'cfs_select_rq' hook to implement
	dynamic CPU affinity.

'cfs_wake_affine'
	Determine on which CPU task can run soonest. Allow user to
	implement deferent policies.
Signed-off-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NHui Tang <tanghui20@huawei.com>

7a1cc837

bpf:programmable: Add cpumask ops collection · dc85e050

由 Chen Hui 提交于 8月 22, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5KUFB
CVE: NA

--------------------------------

Add cpumask ops collection, such as cpumask_empty,
cpumask_and, cpumask_andnot, cpumask_subset,
cpumask_equal, cpumask_copy.
Signed-off-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NHui Tang <tanghui20@huawei.com>

dc85e050

bpf: sched: Add four helper functions to get cpu stat · b2927f82

由 Chen Hui 提交于 8月 22, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5KUFB
CVE: NA

--------------------------------

Add four helper functions to get cpu stat, as follows:
1.acquire cfs/rt/irq cpu load statitic.
2.acquire multiple types of nr_running statitic.
3.acquire cpu idle statitic.
4.acquire cpu capacity.

Based on CPU statistics in different dimensions, specific scheduling
policies can be implemented in bpf program.
Signed-off-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Signed-off-by: NRen Zhijie <renzhijie2@huawei.com>

b2927f82

sched: programmable: Add a tag for the task · b07932e2

由 Chen Hui 提交于 8月 22, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5KUFB
CVE: NA

--------------------------------

Add a tag for the task, useful to identify the special task.
User can use the file system interface to mark different tags for
specific workloads.
The kernel subsystems can use the set_* helpers to mark it too.
The bpf prog obtains the tags to detect different workloads.
Signed-off-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NRen Zhijie <renzhijie2@huawei.com>

b07932e2

sched: programmable: Add user interface of task group tag · 77277550

由 Chen Hui 提交于 8月 22, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5KUFB
CVE: NA

--------------------------------

Add user interface of task group tag, bridges the information
gap between user-mode and kernel-mode.
Signed-off-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NRen Zhijie <renzhijie2@huawei.com>

77277550

18 8月, 2022 1 次提交

arm64: add dump_user_range() to machine check safe · 877fa656

由 Tong Tiangen 提交于 8月 17, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5GB28
CVE: NA

-------------------------------

In the dump_user_range() processing, the data of the user process is
dump to corefile, when hardware memory error is encountered during dump,
only the relevant processes are affected, so killing the user process and
isolate the user page with hardware memory errors is a more reasonable
choice than kernel panic.

The dump_user_range() typical usage scenarios is coredump. Coredump file
writing to fs is related to the specific implementation of fs's write_iter
operation. This patch only supports two typical fs write function
(_copy_from_iter/iov_iter_copy_from_user_atomic) which is used  by
ext4/tmpfs/pipefs.
Signed-off-by: NTong Tiangen <tongtiangen@huawei.com>

877fa656

23 5月, 2022 1 次提交

Revert "module, async: async_synchronize_full() on module init iff async is used" · e65497c4

由 Igor Pylypiv 提交于 5月 23, 2022

stable inclusion
from stable-v5.10.102
commit de55891e162cac0ae058e05c2527fd32cc435ac0
bugzilla: https://gitee.com/openeuler/kernel/issues/I567K6

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=de55891e162cac0ae058e05c2527fd32cc435ac0

--------------------------------

[ Upstream commit 67d6212a ]

This reverts commit 774a1221.

We need to finish all async code before the module init sequence is
done.  In the reverted commit the PF_USED_ASYNC flag was added to mark a
thread that called async_schedule().  Then the PF_USED_ASYNC flag was
used to determine whether or not async_synchronize_full() needs to be
invoked.  This works when modprobe thread is calling async_schedule(),
but it does not work if module dispatches init code to a worker thread
which then calls async_schedule().

For example, PCI driver probing is invoked from a worker thread based on
a node where device is attached:

	if (cpu < nr_cpu_ids)
		error = work_on_cpu(cpu, local_pci_probe, &ddi);
	else
		error = local_pci_probe(&ddi);

We end up in a situation where a worker thread gets the PF_USED_ASYNC
flag set instead of the modprobe thread.  As a result,
async_synchronize_full() is not invoked and modprobe completes without
waiting for the async code to finish.

The issue was discovered while loading the pm80xx driver:
(scsi_mod.scan=async)

modprobe pm80xx                      worker
...
  do_init_module()
  ...
    pci_call_probe()
      work_on_cpu(local_pci_probe)
                                     local_pci_probe()
                                       pm8001_pci_probe()
                                         scsi_scan_host()
                                           async_schedule()
                                           worker->flags |= PF_USED_ASYNC;
                                     ...
      < return from worker >
  ...
  if (current->flags & PF_USED_ASYNC) <--- false
  	async_synchronize_full();

Commit 21c3c5d2 ("block: don't request module during elevator init")
fixed the deadlock issue which the reverted commit 774a1221
("module, async: async_synchronize_full() on module init iff async is
used") tried to fix.

Since commit 0fdff3ec ("async, kmod: warn on synchronous
request_module() from async workers") synchronous module loading from
async is not allowed.

Given that the original deadlock issue is fixed and it is no longer
allowed to call synchronous request_module() from async we can remove
PF_USED_ASYNC flag to make module init consistently invoke
async_synchronize_full() unless async module probe is requested.
Signed-off-by: NIgor Pylypiv <ipylypiv@google.com>
Reviewed-by: NChangyuan Lyu <changyuanl@google.com>
Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e65497c4

10 5月, 2022 2 次提交

sched: Add statistics for qos smt expeller · 42f42fee

由 Guan Jing 提交于 5月 10, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I52611
CVE: NA

--------------------------------

We have added two statistics for qos smt expeller:
a) nr_qos_smt_send_ipi:the times of ipi which online task expel offline tasks;
b) nr_qos_smt_expelled:the statistics that offline task will not be picked times.
Signed-off-by: NGuan Jing <guanjing6@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

42f42fee

sched: Implement the function of qos smt expeller · fd5207be

由 Guan Jing 提交于 5月 10, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I52611
CVE: NA

--------------------------------

We implement the function of qos smt expeller by this following two points：
a)when online tasks and offline tasks are running on the same physical cpu,
online tasks will send ipi to expel offline tasks on the smt sibling cpus.
b)when online tasks are running, the smt sibling cpus will not allow
offline tasks to be selected.
Signed-off-by: NGuan Jing <guanjing6@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

fd5207be

23 2月, 2022 1 次提交

mm: Introduce reliable flag for user task · 8ee6e050

由 Peng Wu 提交于 2月 23, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4PM01
CVE: NA

------------------------------------------

Adding reliable flag for user task. User task with reliable flag can
only alloc memory from mirrored region. PF_RELIABLE is added to represent
the task's reliable flag.

- For init task, which is regarded as special task which alloc memory
  from mirrored region.

- For normal user tasks, The reliable flag can be set via procfs interface
  shown as below and can be inherited via fork().

User can change a user task's reliable flag by

	$ echo [0/1] > /proc/<pid>/reliable

and check a user task's reliable flag by

	$ cat /proc/<pid>/reliable

Note, global init task's reliable file can not be accessed.
Signed-off-by: NPeng Wu <wupeng58@huawei.com>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8ee6e050

29 1月, 2022 2 次提交

KABI: add reserve space for sched structures · b4d81894

由 Guan Jing 提交于 1月 29, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KAP1?from=project-issue
CVE: NA

-------------------------------

We reserve some fields beforehand for sched structures prone to change,
therefore, we can hot add/change features of sched with this enhancement.
After reserving, normally cache does not matter as the reserved fields
are not accessed at all.
Signed-off-by: NGuan Jing <guanjing6@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b4d81894

sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y · e75571a4

由 Ard Biesheuvel 提交于 1月 29, 2022

mainline inclusion
from mainline-v5.16-rc1
commit bcf9033e
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4Q94A
CVE: NA

---------------------------

THREAD_INFO_IN_TASK moved the CPU field out of thread_info, but this
causes some issues on architectures that define raw_smp_processor_id()
in terms of this field, due to the fact that #include'ing linux/sched.h
to get at struct task_struct is problematic in terms of circular
dependencies.

Given that thread_info and task_struct are the same data structure
anyway when THREAD_INFO_IN_TASK=y, let's move it back so that having
access to the type definition of struct thread_info is sufficient to
reference the CPU number of the current task.

Note that this requires THREAD_INFO_IN_TASK's definition of the
task_thread_info() helper to be updated, as task_cpu() takes a
pointer-to-const, whereas task_thread_info() (which is used to generate
lvalues as well), needs a non-const pointer. So make it a macro instead.
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: NMark Rutland <mark.rutland@arm.com>
Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NGuan Jing <guanjing6@huawei.com>

Conflicts:
	arch/arm64/kernel/head.S
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e75571a4

31 12月, 2021 1 次提交

KABI:reserve space for sched structures · 800d1646

由 Guan Jing 提交于 12月 31, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KAP1?from=project-issue
CVE: NA

--------
We reserve some fields beforehand for sched structures prone to change,
therefore, we can hot add/change features of sched with this enhancement.
After reserving, normally cache does not matter as the reserved fields
are not accessed at all.
Signed-off-by: NGuan Jing <guanjing6@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

800d1646

29 12月, 2021 1 次提交

tick/nohz: Kick only _queued_ task whose tick dependency is updated · 3e8db947

由 Marcelo Tosatti 提交于 12月 29, 2021

mainline inclusion
from mainline-5.14-rc1
commit a1dfb631
category: feature
feature: Deep isolation
bugzilla: https://gitee.com/openeuler/kernel/issues/I4N00D
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a1dfb6311c7739e21e160bc4c5575a1b21b48c87

--------------------------------

When the tick dependency of a task is updated, we want it to aknowledge
the new state and restart the tick if needed. If the task is not
running, we don't need to kick it because it will observe the new
dependency upon scheduling in. But if the task is running, we may need
to send an IPI to it so that it gets notified.

Unfortunately we don't have the means to check if a task is running
in a race free way. Checking p->on_cpu in a synchronized way against
p->tick_dep_mask would imply adding a full barrier between
prepare_task_switch() and tick_nohz_task_switch(), which we want to
avoid in this fast-path.

Therefore we blindly fire an IPI to the task's CPU.

Meanwhile we can check if the task is queued on the CPU rq because
p->on_rq is always set to TASK_ON_RQ_QUEUED _before_ schedule() and its
full barrier that precedes tick_nohz_task_switch(). And if the task is
queued on a nohz_full CPU, it also has fair chances to be running as the
isolation constraints prescribe running single tasks on full dynticks
CPUs.

So use this as a trick to check if we can spare an IPI toward a
non-running task.

NOTE: For the ordering to be correct, it is assumed that we never
deactivate a task while it is running, the only exception being the task
deactivating itself while scheduling out.
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512232924.150322-9-frederic@kernel.orgSigned-off-by: NYunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: NChao Liu <liuchao173@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

3e8db947

10 12月, 2021 1 次提交

sched: Introduce handle priority reversion mechanism · a41f60f0

由 Zheng Zucheng 提交于 12月 10, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LH1X
CVE: NA

--------------------------------

When online tasks occupy cpu long time, offline task will not get cpu to run,
the priority inversion issue may be triggered in this case. If the above case
occurs, we will unthrottle offline tasks and let its get a chance to run.
When online tasks occupy cpu over 5s(defaule value), we will unthrottle offline
tasks and enter a msleep loop before exit to usermode util the cpu goto idle.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a41f60f0

15 11月, 2021 2 次提交

bpf: Add ambient BPF runtime context stored in current · 460f91d2

由 Andrii Nakryiko 提交于 11月 15, 2021

mainline inclusion
from mainline-v5.15-rc1
commit c7603cfa
category: bugfix
bugzilla: 182996 https://gitee.com/openeuler/kernel/issues/I4DDEL

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c7603cfa04e7c3a435b31d065f7cbdc829428f6e

--------------------------------

b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
helper") fixed the problem with cgroup-local storage use in BPF by
pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
possible BPF program preemptions and nested executions.

While this seems to work good in practice, it introduces new and unnecessary
failure mode in which not all BPF programs might be executed if we fail to
find an unused slot for cgroup storage, however unlikely it is. It might also
not be so unlikely when/if we allow sleepable cgroup BPF programs in the
future.

Further, the way that cgroup storage is implemented as ambiently-available
property during entire BPF program execution is a convenient way to pass extra
information to BPF program and helpers without requiring user code to pass
around extra arguments explicitly. So it would be good to have a generic
solution that can allow implementing this without arbitrary restrictions.
Ideally, such solution would work for both preemptable and sleepable BPF
programs in exactly the same way.

This patch introduces such solution, bpf_run_ctx. It adds one pointer field
(bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
macros in such a way that it always stays valid throughout BPF program
execution. BPF program preemption is handled by remembering previous
current->bpf_ctx value locally while executing nested BPF program and
restoring old value after nested BPF program finishes. This is handled by two
helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
supposed to be used before and after BPF program runs, respectively.

Restoring old value of the pointer handles preemption, while bpf_run_ctx
pointer being a property of current task_struct naturally solves this problem
for sleepable BPF programs by "following" BPF program execution as it is
scheduled in and out of CPU. It would even allow CPU migration of BPF
programs, even though it's not currently allowed by BPF infra.

This patch cleans up cgroup local storage handling as a first application. The
design itself is generic, though, with bpf_run_ctx being an empty struct that
is supposed to be embedded into a specific struct for a given BPF program type
(bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
this mechanism for other uses within tracing BPF programs.

To verify that this change doesn't revert the fix to the original cgroup
storage issue, I ran the same repro as in the original report ([0]) and didn't
get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).

  [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/

Fixes: b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NYonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org
Conflicts:
	include/linux/bpf.h
	include/linux/sched.h
	kernel/fork.c
	net/bpf/test_run.c
Signed-off-by: NPu Lehui <pulehui@huawei.com>
Reviewed-by: NKuohai Xu <xukuohai@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

460f91d2

sched: Always inline is_percpu_thread() · 45167ba0

由 Peter Zijlstra 提交于 11月 15, 2021

stable inclusion
from stable-5.10.74
commit bb893f075431e97dd72cde0721957a73d26578a8
bugzilla: 182986 https://gitee.com/openeuler/kernel/issues/I4I3MG

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=bb893f075431e97dd72cde0721957a73d26578a8

--------------------------------

[ Upstream commit 83d40a61 ]

  vmlinux.o: warning: objtool: check_preemption_disabled()+0x81: call to is_percpu_thread() leaves .noinstr.text section
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928084218.063371959@infradead.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

45167ba0

21 10月, 2021 1 次提交

x86/mce: Avoid infinite loop for copy from user recovery · c6a9d0e7

由 Tony Luck 提交于 10月 21, 2021

stable inclusion
from stable-5.10.68
commit 619d747c1850bab61625ca9d8b4730f470a5947b
bugzilla: 182671 https://gitee.com/openeuler/kernel/issues/I4EWUH

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=619d747c1850bab61625ca9d8b4730f470a5947b

--------------------------------

commit 81065b35 upstream.

There are two cases for machine check recovery:

1) The machine check was triggered by ring3 (application) code.
   This is the simpler case. The machine check handler simply queues
   work to be executed on return to user. That code unmaps the page
   from all users and arranges to send a SIGBUS to the task that
   triggered the poison.

2) The machine check was triggered in kernel code that is covered by
   an exception table entry. In this case the machine check handler
   still queues a work entry to unmap the page, etc. but this will
   not be called right away because the #MC handler returns to the
   fix up code address in the exception table entry.

Problems occur if the kernel triggers another machine check before the
return to user processes the first queued work item.

Specifically, the work is queued using the ->mce_kill_me callback
structure in the task struct for the current thread. Attempting to queue
a second work item using this same callback results in a loop in the
linked list of work functions to call. So when the kernel does return to
user, it enters an infinite loop processing the same entry for ever.

There are some legitimate scenarios where the kernel may take a second
machine check before returning to the user.

1) Some code (e.g. futex) first tries a get_user() with page faults
   disabled. If this fails, the code retries with page faults enabled
   expecting that this will resolve the page fault.

2) Copy from user code retries a copy in byte-at-time mode to check
   whether any additional bytes can be copied.

On the other side of the fence are some bad drivers that do not check
the return value from individual get_user() calls and may access
multiple user addresses without noticing that some/all calls have
failed.

Fix by adding a counter (current->mce_count) to keep track of repeated
machine checks before task_work() is called. First machine check saves
the address information and calls task_work_add(). Subsequent machine
checks before that task_work call back is executed check that the address
is in the same page as the first machine check (since the callback will
offline exactly one page).

Expected worst case is four machine checks before moving on (e.g. one
user access with page faults disabled, then a repeat to the same address
with page faults enabled ... repeat in copy tail bytes). Just in case
there is some code that loops forever enforce a limit of 10.

 [ bp: Massage commit message, drop noinstr, fix typo, extend panic
   messages. ]

Fixes: 5567d11c ("x86/mce: Send #MC singal from task work")
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c6a9d0e7

28 7月, 2021 1 次提交

preempt/dynamic: Provide cond_resched() and might_resched() static calls · 9253d578

由 Peter Zijlstra (Intel) 提交于 7月 27, 2021

mainline inclusion
from mainline-5.12-rc1
commit b965f1dd
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I410UT
CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b965f1ddb47daa5b8b2e2bc9c921431236830367

---------------------------

Provide static calls to control cond_resched() (called in !CONFIG_PREEMPT)
and might_resched() (called in CONFIG_PREEMPT_VOLUNTARY) to that we
can override their behaviour when preempt= is overriden.

Since the default behaviour is full preemption, both their calls are
ignored when preempt= isn't passed.

  [fweisbec: branch might_resched() directly to __cond_resched(), only
             define static calls when PREEMPT_DYNAMIC]
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210118141223.123667-6-frederic@kernel.orgSigned-off-by: NMa Junhai <majunhai2@huawei.com>

 Conflicts:
	include/linux/kernel.h
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

9253d578

14 7月, 2021 1 次提交

sched: Fix offline task can't be killed in a timely · bf8d90ce

由 Zheng Zucheng 提交于 7月 12, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZX4D
CVE: NA

--------------------------------

If online tasks occupy 100% CPU resources, offline tasks can't be scheduled
since offline tasks are throttled, as a result, offline task can't timely
respond after receiving SIGKILL signal.
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

bf8d90ce

03 7月, 2021 1 次提交

sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling · ba635552

由 Dietmar Eggemann 提交于 6月 18, 2021

stable inclusion
from stable-5.10.44
commit 190a7f908993cc7f5e8bbe67aa88eeb0bdbbbaa5
bugzilla: 109295
CVE: NA

--------------------------------

commit 68d7a190 upstream.

The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent
unnecessary util_est updates uses the LSB of util_est.enqueued. It is
exposed via _task_util_est() (and task_util_est()).

Commit 92a801e5 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages")
mentions that the LSB is lost for util_est resolution but
find_energy_efficient_cpu() checks if task_util_est() returns 0 to
return prev_cpu early.

_task_util_est() returns the max value of util_est.ewma and
util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED.
So task_util_est() returning the max of task_util() and
_task_util_est() will never return 0 under the default
SCHED_FEAT(UTIL_EST, true).

To fix this use the MSB of util_est.enqueued instead and keep the flag
util_est internal, i.e. don't export it via _task_util_est().

The maximal possible util_avg value for a task is 1024 so the MSB of
'unsigned int util_est.enqueued' isn't used to store a util value.

As a caveat the code behind the util_est_se trace point has to filter
UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should
be easy to do.

This also fixes an issue report by Xuewen Yan that util_est_update()
only used UTIL_AVG_UNCHANGED for the subtrahend of the equation:

  last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED)

Fixes: b89997aa sched/pelt: Fix task util_est update filtering
Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NXuewen Yan <xuewen.yan@unisoc.com>
Reviewed-by: NVincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210602145808.1562603-1-dietmar.eggemann@arm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ba635552

09 4月, 2021 1 次提交

resctrlfs: mpam: Build basic framework for mpam · e37caef1

由 Yang Yingliang 提交于 2月 26, 2021

hulk inclusion
category: feature
feature: ARM MPAM support
bugzilla: 48265
CVE: NA

--------------------------------

Build basic framework for mpam.
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e37caef1

07 1月, 2021 1 次提交

thread: move thread bits accessors to separated file · ad9032a0

由 Yury Norov 提交于 12月 22, 2020

maillist inclusion
category: feature
bugzilla: 46790
CVE: NA

Reference: https://github.com/norov/linux/commits/ilp32-5.2

--------------------------------

Thread bits may be accessed from low-level code, so isolating is a measure
to avoid circular dependencies in header files.

The exact reason for circular dependency is WARN_ON() macro added in patch
edd63a27 "set_restore_sigmask() is never called without SIGPENDING (and
never should be)"
Signed-off-by: NYury Norov <ynorov@caviumnetworks.com>
Signed-off-by: NYury Norov <ynorov@marvell.com>

 Conflicts:
	include/linux/sched.h
[wangxiongfeng: small context conflicts because of inclusion of header
file]
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>

ad9032a0

17 11月, 2020 2 次提交

sched/deadline: Fix priority inheritance with multiple scheduling classes · 2279f540

由 Juri Lelli 提交于 11月 17, 2020

Glenn reported that "an application [he developed produces] a BUG in
deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
PTHREAD_PRIO_INHERIT mutexes.  I believe the bug is triggered when a CFS
task that was boosted by a SCHED_DEADLINE task boosts another CFS task
(nested priority inheritance).

 ------------[ cut here ]------------
 kernel BUG at kernel/sched/deadline.c:1462!
 invalid opcode: 0000 [#1] PREEMPT SMP
 CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ...
 Hardware name: ...
 RIP: 0010:enqueue_task_dl+0x335/0x910
 Code: ...
 RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
 RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
 RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
 RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
 R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
 R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
 FS:  00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  ? intel_pstate_update_util_hwp+0x13/0x170
  rt_mutex_setprio+0x1cc/0x4b0
  task_blocks_on_rt_mutex+0x225/0x260
  rt_spin_lock_slowlock_locked+0xab/0x2d0
  rt_spin_lock_slowlock+0x50/0x80
  hrtimer_grab_expiry_lock+0x20/0x30
  hrtimer_cancel+0x13/0x30
  do_nanosleep+0xa0/0x150
  hrtimer_nanosleep+0xe1/0x230
  ? __hrtimer_init_sleeper+0x60/0x60
  __x64_sys_nanosleep+0x8d/0xa0
  do_syscall_64+0x4a/0x100
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7fa58b52330d
 ...
 ---[ end trace 0000000000000002 ]—

He also provided a simple reproducer creating the situation below:

 So the execution order of locking steps are the following
 (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
 are mutexes that are enabled * with priority inheritance.)

 Time moves forward as this timeline goes down:

 N1              N2               D1
 |               |                |
 |               |                |
 Lock(M1)        |                |
 |               |                |
 |             Lock(M2)           |
 |               |                |
 |               |              Lock(M2)
 |               |                |
 |             Lock(M1)           |
 |             (!!bug triggered!) |

Daniel reported a similar situation as well, by just letting ksoftirqd
run with DEADLINE (and eventually block on a mutex).

Problem is that boosted entities (Priority Inheritance) use static
DEADLINE parameters of the top priority waiter. However, there might be
cases where top waiter could be a non-DEADLINE entity that is currently
boosted by a DEADLINE entity from a different lock chain (i.e., nested
priority chains involving entities of non-DEADLINE classes). In this
case, top waiter static DEADLINE parameters could be null (initialized
to 0 at fork()) and replenish_dl_entity() would hit a BUG().

Fix this by keeping track of the original donor and using its parameters
when a task is boosted.
Reported-by: NGlenn Elliott <glenn@aurora.tech>
Reported-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com

2279f540

sched: Fix data-race in wakeup · f97bb527

由 Peter Zijlstra 提交于 11月 17, 2020

Mel reported that on some ARM64 platforms loadavg goes bananas and
Will tracked it down to the following race:

  CPU0					CPU1

  schedule()
    prev->sched_contributes_to_load = X;
    deactivate_task(prev);

					try_to_wake_up()
					  if (p->on_rq &&) // false
					  if (smp_load_acquire(&p->on_cpu) && // true
					      ttwu_queue_wakelist())
					        p->sched_remote_wakeup = Y;

    smp_store_release(prev->on_cpu, 0);

where both p->sched_contributes_to_load and p->sched_remote_wakeup are
in the same word, and thus the stores X and Y race (and can clobber
one another's data).

Whereas prior to commit c6e7bd7a ("sched/core: Optimize ttwu()
spinning on p->on_cpu") the p->on_cpu handoff serialized access to
p->sched_remote_wakeup (just as it still does with
p->sched_contributes_to_load) that commit broke that by calling
ttwu_queue_wakelist() with p->on_cpu != 0.

However, due to

  p->XXX = X			ttwu()
  schedule()			  if (p->on_rq && ...) // false
    smp_mb__after_spinlock()	  if (smp_load_acquire(&p->on_cpu) &&
    deactivate_task()		      ttwu_queue_wakelist())
      p->on_rq = 0;		        p->sched_remote_wakeup = Y;

We can be sure any 'current' store is complete and 'current' is
guaranteed asleep. Therefore we can move p->sched_remote_wakeup into
the current flags word.

Note: while the observed failure was loadavg accounting gone wrong due
to ttwu() cobbering p->sched_contributes_to_load, the reverse problem
is also possible where schedule() clobbers p->sched_remote_wakeup,
this could result in enqueue_entity() wrecking ->vruntime and causing
scheduling artifacts.

Fixes: c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: NMel Gorman <mgorman@techsingularity.net>
Debugged-by: NWill Deacon <will@kernel.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201117083016.GK3121392@hirez.programming.kicks-ass.net

f97bb527

17 10月, 2020 1 次提交

sched.h: drop in_ubsan field when UBSAN is in trap mode · 5cf53f3c

由 Elena Petrova 提交于 10月 15, 2020

in_ubsan field of task_struct is only used in lib/ubsan.c, which in its
turn is used only `ifneq ($(CONFIG_UBSAN_TRAP),y)`.

Removing unnecessary field from a task_struct will help preserve the ABI
between vanilla and CONFIG_UBSAN_TRAP'ed kernels. In particular, this
will help enabling bounds sanitizer transparently for Android's GKI.
Signed-off-by: NElena Petrova <lenaptr@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Link: https://lkml.kernel.org/r/20200910134802.3160311-1-lenaptr@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5cf53f3c

14 10月, 2020 1 次提交

kasan/kunit: add KUnit Struct to Current Task · 393824f6

由 Patricia Alfonso 提交于 10月 13, 2020

Patch series "KASAN-KUnit Integration", v14.

This patchset contains everything needed to integrate KASAN and KUnit.

KUnit will be able to:
(1) Fail tests when an unexpected KASAN error occurs
(2) Pass tests when an expected KASAN error occurs

Convert KASAN tests to KUnit with the exception of copy_user_test because
KUnit is unable to test those.

Add documentation on how to run the KASAN tests with KUnit and what to
expect when running these tests.

This patch (of 5):

In order to integrate debugging tools like KASAN into the KUnit framework,
add KUnit struct to the current task to keep track of the current KUnit
test.
Signed-off-by: NPatricia Alfonso <trishalfonso@google.com>
Signed-off-by: NDavid Gow <davidgow@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NAndrey Konovalov <andreyknvl@google.com>
Reviewed-by: NBrendan Higgins <brendanhiggins@google.com>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Shuah Khan <shuah@kernel.org>
Link: https://lkml.kernel.org/r/20200915035828.570483-1-davidgow@google.com
Link: https://lkml.kernel.org/r/20200915035828.570483-2-davidgow@google.com
Link: https://lkml.kernel.org/r/20200910070331.3358048-1-davidgow@google.com
Link: https://lkml.kernel.org/r/20200910070331.3358048-2-davidgow@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

393824f6

07 10月, 2020 1 次提交

x86/mce: Recover from poison found while copying from user space · c0ab7ffc

由 Tony Luck 提交于 10月 06, 2020

Existing kernel code can only recover from a machine check on code that
is tagged in the exception table with a fault handling recovery path.

Add two new fields in the task structure to pass information from
machine check handler to the "task_work" that is queued to run before
the task returns to user mode:

+ mce_vaddr: will be initialized to the user virtual address of the fault
  in the case where the fault occurred in the kernel copying data from
  a user address.  This is so that kill_me_maybe() can provide that
  information to the user SIGBUS handler.

+ mce_kflags: copy of the struct mce.kflags needed by kill_me_maybe()
  to determine if mce_vaddr is applicable to this error.

Add code to recover from a machine check while copying data from user
space to the kernel. Action for this case is the same as if the user
touched the poison directly; unmap the page and send a SIGBUS to the task.

Use a new helper function to share common code between the "fault
in user mode" case and the "fault while copying from user" case.

New code paths will be activated by the next patch which sets
MCE_IN_KERNEL_COPYIN.
Suggested-by: NBorislav Petkov <bp@alien8.de>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-6-tony.luck@intel.com

c0ab7ffc

03 10月, 2020 1 次提交

sched/debug: Add new tracepoint to track cpu_capacity · 51cf18c9

由 Vincent Donnefort 提交于 8月 28, 2020

rq->cpu_capacity is a key element in several scheduler parts, such as EAS
task placement and load balancing. Tracking this value enables testing
and/or debugging by a toolkit.
Signed-off-by: NVincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1598605249-72651-1-git-send-email-vincent.donnefort@arm.com

51cf18c9

01 10月, 2020 1 次提交

io_uring: don't rely on weak ->files references · 0f212204

由 Jens Axboe 提交于 9月 13, 2020

Grab actual references to the files_struct. To avoid circular references
issues due to this, we add a per-task note that keeps track of what
io_uring contexts a task has used. When the tasks execs or exits its
assigned files, we cancel requests based on this tracking.

With that, we can grab proper references to the files table, and no
longer need to rely on stashing away ring_fd and ring_file to check
if the ring_fd may have been closed.

Cc: stable@vger.kernel.org # v5.5+
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0f212204

26 8月, 2020 2 次提交

sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together · 01ccf592

由 Sebastian Andrzej Siewior 提交于 8月 19, 2020

The bits PF_IO_WORKER and PF_WQ_WORKER are tested together in
sched_submit_work() which is considered to be a hot path.
If the two bits cross the 8 or 16 bit boundary then most architecture
require multiple load instructions in order to create the constant
value. Also, such a value can not be encoded within the compare opcode.

By moving the bit definition within the same block, the compiler can
create/use one immediate value.

For some reason gcc-10 on ARM64 requires both bits to be next to each
other in order to issue "tst reg, val; bne label". Otherwise the result
is "mov reg1, val; tst reg, reg1; bne label".

Move PF_VCPU out of the way so that PF_IO_WORKER can be next to
PF_WQ_WORKER.
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200819195505.y3fxk72sotnrkczi@linutronix.de

01ccf592

sched: Use __always_inline on is_idle_task() · c94a88f3

由 Marco Elver 提交于 8月 20, 2020

is_idle_task() may be used from noinstr functions such as
irqentry_enter(). Since the compiler is free to not inline regular
inline functions, switch to using __always_inline.
Signed-off-by: NMarco Elver <elver@google.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200820172046.GA177701@elver.google.com

c94a88f3

06 8月, 2020 2 次提交

posix-cpu-timers: Provide mechanisms to defer timer handling to task_work · 1fb497dd

由 Thomas Gleixner 提交于 7月 30, 2020

Running posix CPU timers in hard interrupt context has a few downsides:

 - For PREEMPT_RT it cannot work as the expiry code needs to take
   sighand lock, which is a 'sleeping spinlock' in RT. The original RT
   approach of offloading the posix CPU timer handling into a high
   priority thread was clumsy and provided no real benefit in general.

 - For fine grained accounting it's just wrong to run this in context of
   the timer interrupt because that way a process specific CPU time is
   accounted to the timer interrupt.

 - Long running timer interrupts caused by a large amount of expiring
   timers which can be created and armed by unpriviledged user space.

There is no hard requirement to expire them in interrupt context.

If the signal is targeted at the task itself then it won't be delivered
before the task returns to user space anyway. If the signal is targeted at
a supervisor process then it might be slightly delayed, but posix CPU
timers are inaccurate anyway due to the fact that they are tied to the
tick.

Provide infrastructure to schedule task work which allows splitting the
posix CPU timer code into a quick check in interrupt context and a thread
context expiry and signal delivery function. This has to be enabled by
architectures as it requires that the architecture specific KVM
implementation handles pending task work before exiting to guest mode.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200730102337.783470146@linutronix.de

1fb497dd

locking/seqlock, headers: Untangle the spaghetti monster · 0cd39f46

由 Peter Zijlstra 提交于 8月 06, 2020

By using lockdep_assert_*() from seqlock.h, the spaghetti monster
attacked.

Attack back by reducing seqlock.h dependencies from two key high level headers:

 - <linux/seqlock.h>:               -Remove <linux/ww_mutex.h>
 - <linux/time.h>:                  -Remove <linux/seqlock.h>
 - <linux/sched.h>:                 +Add    <linux/seqlock.h>

The price was to add it to sched.h ...

Core header fallout, we add direct header dependencies instead of gaining them
parasitically from higher level headers:

 - <linux/dynamic_queue_limits.h>:  +Add <asm/bug.h>
 - <linux/hrtimer.h>:               +Add <linux/seqlock.h>
 - <linux/ktime.h>:                 +Add <asm/bug.h>
 - <linux/lockdep.h>:               +Add <linux/smp.h>
 - <linux/sched.h>:                 +Add <linux/seqlock.h>
 - <linux/videodev2.h>:             +Add <linux/kernel.h>

Arch headers fallout:

 - PARISC: <asm/timex.h>:           +Add <asm/special_insns.h>
 - SH:     <asm/io.h>:              +Add <asm/page.h>
 - SPARC:  <asm/timer_64.h>:        +Add <uapi/asm/asi.h>
 - SPARC:  <asm/vvar.h>:            +Add <asm/processor.h>, <asm/barrier.h>
                                    -Remove <linux/seqlock.h>
 - X86:    <asm/fixmap.h>:          +Add <asm/pgtable_types.h>
                                    -Remove <asm/acpi.h>

There's also a bunch of parasitic header dependency fallout in .c files, not listed
separately.

[ mingo: Extended the changelog, split up & fixed the original patch. ]
Co-developed-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200804133438.GK2674@hirez.programming.kicks-ass.net

0cd39f46

31 7月, 2020 2 次提交

kcsan: Improve IRQ state trace reporting · 92c209ac

由 Marco Elver 提交于 7月 29, 2020

To improve the general usefulness of the IRQ state trace events with
KCSAN enabled, save and restore the trace information when entering and
exiting the KCSAN runtime as well as when generating a KCSAN report.

Without this, reporting the IRQ trace events (whether via a KCSAN report
or outside of KCSAN via a lockdep report) is rather useless due to
continuously being touched by KCSAN. This is because if KCSAN is
enabled, every instrumented memory access causes changes to IRQ trace
events (either by KCSAN disabling/enabling interrupts or taking
report_lock when generating a report).

Before "lockdep: Prepare for NMI IRQ state tracking", KCSAN avoided
touching the IRQ trace events via raw_local_irq_save/restore() and
lockdep_off/on().

Fixes: 248591f5 ("kcsan: Make KCSAN compatible with new IRQ state tracking")
Signed-off-by: NMarco Elver <elver@google.com>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200729110916.3920464-2-elver@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

92c209ac

lockdep: Refactor IRQ trace events fields into struct · 0584df9c

由 Marco Elver 提交于 7月 29, 2020

Refactor the IRQ trace events fields, used for printing information
about the IRQ trace events, into a separate struct 'irqtrace_events'.

This improves readability by separating the information only used in
reporting, as well as enables (simplified) storing/restoring of
irqtrace_events snapshots.

No functional change intended.
Signed-off-by: NMarco Elver <elver@google.com>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200729110916.3920464-1-elver@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

0584df9c

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功