提交 · e7b88a8afaa422663613dc9dcf939f6889809362 · openanolis / cloud-kernel

09 6月, 2020 7 次提交

psi: Move PF_MEMSTALL out of task->flags · e7b88a8a

由 Yafang Shao 提交于 3月 16, 2020

task #28327019

commit 1066d1b6974e095d5a6c472ad9180a957b496cd6 upstream

The task->flags is a 32-bits flag, in which 31 bits have already been
consumed. So it is hardly to introduce other new per process flag.
Currently there're still enough spaces in the bit-field section of
task_struct, so we can define the memstall state as a single bit in
task_struct instead.
This patch also removes an out-of-date comment pointed by Matthew.
Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/1584408485-1921-1-git-send-email-laoar.shao@gmail.comSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

e7b88a8a

psi: Optimize switching tasks inside shared cgroups · 0e5c5cd8

由 Johannes Weiner 提交于 3月 16, 2020

task #28327019

commit 36b238d5717279163859fb6ba0f4360abcafab83 upstream

When switching tasks running on a CPU, the psi state of a cgroup
containing both of these tasks does not change. Right now, we don't
exploit that, and can perform many unnecessary state changes in nested
hierarchies, especially when most activity comes from one leaf cgroup.

This patch implements an optimization where we only update cgroups
whose state actually changes during a task switch. These are all
cgroups that contain one task but not the other, up to the first
shared ancestor. When both tasks are in the same group, we don't need
to update anything at all.

We can identify the first shared ancestor by walking the groups of the
incoming task until we see TSK_ONCPU set on the local CPU; that's the
first group that also contains the outgoing task.

The new psi_task_switch() is similar to psi_task_change(). To allow
code reuse, move the task flag maintenance code into a new function
and the poll/avg worker wakeups into the shared psi_group_change().
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200316191333.115523-3-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

0e5c5cd8

sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime · 837b1ac1

由 Johannes Weiner 提交于 12月 03, 2019

task #28327019

commit 3dfbe25c27eab7c90c8a7e97b4c354a9d24dd985 upstream

Jingfeng reports rare div0 crashes in psi on systems with some uptime:

[58914.066423] divide error: 0000 [#1] SMP
[58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
[58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
[58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
[58914.102728] Workqueue: events psi_update_work
[58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
[58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
[58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
[58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
[58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
[58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
[58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
[58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
[58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
[58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
[58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[58914.200360] PKRU: 55555554
[58914.203221] Stack:
[58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
[58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
[58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[58914.228734] Call Trace:
[58914.231337] [] psi_update_work+0x22/0x60
[58914.237067] [] process_one_work+0x189/0x420
[58914.243063] [] worker_thread+0x4e/0x4b0
[58914.248701] [] ? process_one_work+0x420/0x420
[58914.254869] [] kthread+0xe6/0x100
[58914.259994] [] ? kthread_park+0x60/0x60
[58914.265640] [] ret_from_fork+0x39/0x50
[58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1
[58914.279691] RIP [] psi_update_stats+0x1c1/0x330

The crashing instruction is trying to divide the observed stall time
by the sampling period. The period, stored in R8, is not 0, but we are
dividing by the lower 32 bits only, which are all 0 in this instance.

We could switch to a 64-bit division, but the period shouldn't be that
big in the first place. It's the time between the last update and the
next scheduled one, and so should always be around 2s and comfortably
fit into 32 bits.

The bug is in the initialization of new cgroups: we schedule the first
sampling event in a cgroup as an offset of sched_clock(), but fail to
initialize the last_update timestamp, and it defaults to 0. That
results in a bogusly large sampling period the first time we run the
sampling code, and consequently we underreport pressure for the first
2s of a cgroup's life. But worse, if sched_clock() is sufficiently
advanced on the system, and the user gets unlucky, the period's lower
32 bits can all be 0 and the sampling division will crash.

Fix this by initializing the last update timestamp to the creation
time of the cgroup, thus correctly marking the start of the first
pressure sampling period in a new cgroup.
Reported-by: NJingfeng Xie <xiejingfeng@linux.alibaba.com>
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

837b1ac1

psi: Fix cpu.pressure for cpu.max and competing cgroups · 9bf3d89c

由 Johannes Weiner 提交于 3月 16, 2020

task #28327019

commit b05e75d611380881e73edc58a20fd8c6bb71720b upstream

For simplicity, cpu pressure is defined as having more than one
runnable task on a given CPU. This works on the system-level, but it
has limitations in a cgrouped reality: When cpu.max is in use, it
doesn't capture the time in which a task is not executing on the CPU
due to throttling. Likewise, it doesn't capture the time in which a
competing cgroup is occupying the CPU - meaning it only reflects
cgroup-internal competitive pressure, not outside pressure.

Enable tracking of currently executing tasks, and then change the
definition of cpu pressure in a cgroup from

	NR_RUNNING > 1

to

	NR_RUNNING > ON_CPU

which will capture the effects of cpu.max as well as competition from
outside the cgroup.

After this patch, a cgroup running `stress -c 1` with a cpu.max
setting of 5000 10000 shows ~50% continuous CPU pressure.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200316191333.115523-2-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

9bf3d89c

sched/psi: Fix OOB write when writing 0 bytes to PSI files · d76b7846

由 Suren Baghdasaryan 提交于 2月 03, 2020

task #28327019

commit 6fcca0fa48118e6d63733eb4644c6cd880c15b8f upstream

Issuing write() with count parameter set to 0 on any file under
/proc/pressure/ will cause an OOB write because of the access to
buf[buf_size-1] when NUL-termination is performed. Fix this by checking
for buf_size to be non-zero.
Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.comSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

d76b7846

sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled · f408f751

由 Wang Long 提交于 12月 18, 2019

task #28327019

commit 3d817689a62cf71bbb290af18cd26cf9764f38fe upstream

when CONFIG_PSI_DEFAULT_DISABLED set to N or the command line set psi=0,
I think we should not create /proc/pressure and
/proc/pressure/{io|memory|cpu}.

In the future, user maybe determine whether the psi feature is enabled by
checking the existence of the /proc/pressure dir or
/proc/pressure/{io|memory|cpu} files.
Signed-off-by: NWang Long <w@laoqinren.net>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/1576672698-32504-1-git-send-email-w@laoqinren.netSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

f408f751

psi: Fix a division error in psi poll() · 01a6f356

由 Johannes Weiner 提交于 12月 03, 2019

task #28327019

commit c3466952ca1514158d7c16c9cfc48c27d5c5dc0f upstream

The psi window size is a u64 an can be up to 10 seconds right now,
which exceeds the lower 32 bits of the variable. We currently use
div_u64 for it, which is meant only for 32-bit divisors. The result is
garbage pressure sampling values and even potential div0 crashes.

Use div64_u64.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Jingfeng Xie <xiejingfeng@linux.alibaba.com>
Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

01a6f356

08 6月, 2020 1 次提交

sched/fair: Don't NUMA balance for kthreads · 3b541a29

由 Jens Axboe 提交于 5月 26, 2020

fix #28384496

commit 18f855e574d9799a0e7489f8ae6fd8447d0dd74a upstream

Stefano reported a crash with using SQPOLL with io_uring:

  BUG: kernel NULL pointer dereference, address: 00000000000003b0
  CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
  RIP: 0010:task_numa_work+0x4f/0x2c0
  Call Trace:
   task_work_run+0x68/0xa0
   io_sq_thread+0x252/0x3d0
   kthread+0xf9/0x130
   ret_from_fork+0x35/0x40

which is task_numa_work() oopsing on current->mm being NULL.

The task work is queued by task_tick_numa(), which checks if current->mm is
NULL at the time of the call. But this state isn't necessarily persistent,
if the kthread is using use_mm() to temporarily adopt the mm of a task.

Change the task_tick_numa() check to exclude kernel threads in general,
as it doesn't make sense to attempt ot balance for kthreads anyway.
Reported-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dkSigned-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

3b541a29

28 5月, 2020 1 次提交

nsfs: clean-up ns_get_path() signature to return int · 39f702ab

由 Aleksa Sarai 提交于 12月 07, 2019

to #26323588

commit ce623f89872df4253719be71531116751eeab85f upstream.

ns_get_path() and ns_get_path_cb() only ever return either NULL or an
ERR_PTR. It is far more idiomatic to simply return an integer, and it
makes all of the callers of ns_get_path() more straightforward to read.

Fixes: e149ed2b ("take the targets of /proc/*/ns/* symlinks to separate fs")
Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

39f702ab

26 5月, 2020 1 次提交

alinux: sched: Fix regression caused by nr_uninterruptible · 1607a485

由 Yihao Wu 提交于 5月 26, 2020

fix #27788368

per cgroup nr_uninterruptible tracking leads to huge performance regression
of hackbench. This patch delete nr_uninterruptible related code for now, to
address performance regression issue.

Fixes: 9410d314 ("alinux: cpuacct: Export nr_running & nr_uninterruptible")
Fixes: 36da4fe9 ("alinux: sched: Maintain "nr_uninterruptible" in runqueue")
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>

1607a485

07 5月, 2020 1 次提交

alinux: sched: Fix p->cpu build error on aarch64 · 5be663e3

由 Yihao Wu 提交于 5月 01, 2020

to #27372989

CONFIG_THREAD_INFO_IN_TASK is not set for aarch64, task_struct has no
cpu member. We should use the helper functiton task_cpu instead.

Fixes: 9e7b35d6 ("alinux: sched: Introduce per-cgroup iowait accounting")
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

5be663e3

06 5月, 2020 2 次提交

blktrace: fix dereference after null check · d9dbee73

由 Cengiz Can 提交于 3月 04, 2020

to #24913189

commit 153031a301bb07194e9c37466cfce8eacb977621 upstream.

There was a recent change in blktrace.c that added a RCU protection to
`q->blk_trace` in order to fix a use-after-free issue during access.

However the change missed an edge case that can lead to dereferencing of
`bt` pointer even when it's NULL:

Coverity static analyzer marked this as a FORWARD_NULL issue with CID
1460458.

```
/kernel/trace/blktrace.c: 1904 in sysfs_blk_trace_attr_store()
1898            ret = 0;
1899            if (bt == NULL)
1900                    ret = blk_trace_setup_queue(q, bdev);
1901
1902            if (ret == 0) {
1903                    if (attr == &dev_attr_act_mask)
>>>     CID 1460458:  Null pointer dereferences  (FORWARD_NULL)
>>>     Dereferencing null pointer "bt".
1904                            bt->act_mask = value;
1905                    else if (attr == &dev_attr_pid)
1906                            bt->pid = value;
1907                    else if (attr == &dev_attr_start_lba)
1908                            bt->start_lba = value;
1909                    else if (attr == &dev_attr_end_lba)
```

Added a reassignment with RCU annotation to fix the issue.

Fixes: c780e86dd48 ("blktrace: Protect q->blk_trace with RCU")
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: NCengiz Can <cengiz@kernel.wtf>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
References: CVE-2019-19768
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d9dbee73

blktrace: Protect q->blk_trace with RCU · 62b6bd7c

由 Jan Kara 提交于 2月 06, 2020

to #24913189

commit c780e86dd48ef6467a1146cf7d0fe1e05a635039 upstream.

KASAN is reporting that __blk_add_trace() has a use-after-free issue
when accessing q->blk_trace. Indeed the switching of block tracing (and
thus eventual freeing of q->blk_trace) is completely unsynchronized with
the currently running tracing and thus it can happen that the blk_trace
structure is being freed just while __blk_add_trace() works on it.
Protect accesses to q->blk_trace by RCU during tracing and make sure we
wait for the end of RCU grace period when shutting down tracing. Luckily
that is rare enough event that we can afford that. Note that postponing
the freeing of blk_trace to an RCU callback should better be avoided as
it could have unexpected user visible side-effects as debugfs files
would be still existing for a short while block tracing has been shut
down.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711
CC: stable@vger.kernel.org
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Tested-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reported-by: NTristan Madani <tristmd@gmail.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[bwh: Backported to 4.19: adjust context]
Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
References: CVE-2019-19768
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

62b6bd7c

01 5月, 2020 1 次提交

alinux: sched: Fix nr_migrations compile errors · 435d7069

由 Yihao Wu 提交于 4月 30, 2020

to #27363370

When CONFIG_SCHED_SLI is not set, compiler gives errors about
task_ca_increase_nr_migrations redefinition. It should've been
an empty implementation in this case.

Fixes: 965d75d3 ("alinux: cpuacct: make cpuacct record nr_migrations")
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

435d7069

26 4月, 2020 2 次提交

alinux: kernel: Reduce tasklist_lock contention at fork and exit · f14304cd

由 Xunlei Pang 提交于 11月 16, 2018

to #16966300
to #16966377

We observed heavy tasklist_lock contention when offline tasks
start and end, which caused long scheduling latency and long
period with interrupt off since write_lock_irq() is called.

In extreme cases with tons of concurrent fork and exit events,
it can cause the system hung.

This patch changed them to use trylock, then operations within
lock consume little time, so it naturally addresses the issue.

After this patch, when I launched and killed thousands of tasks,
the latency can reduce from tens of milliscends to around 2ms on
my box.

The patch can pass Unixbench tests, no regression introduced.
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Acked-by: NShile Zhang <shile.zhang@linux.alibaba.com>

f14304cd

alinux: mm: restrict the print message frequency further when memcg oom triggers · 8962f125

由 zhongjiang-ali 提交于 2月 24, 2020

to #24843736

It is because too much memcg oom printed message will trigger the softlockup.
In general, we use the same ratelimit oom_rc between system and memcg
to limit the print message. But it is more frequent to exceed its limit
of the memcg, thus it would will result in oom easily. And A lot of
printed information will be outputed. It's likely to trigger softlockup.

The patch use different ratelimit to limit the memcg and system oom. And
we test the patch using the default value in the memcg, The issue will
go.

[xuyu: adjust corresponding sysctl indexes]
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>

8962f125

24 4月, 2020 9 次提交

alinux: sched: Introduce per-cgroup iowait accounting · 9e7b35d6

由 Yihao Wu 提交于 4月 21, 2020

to #26424323

We account iowait when the cgroup's se is idle, and it has blocked
task on the hierarchy of se->my_q.

To achieve this, we also add cg_nr_running to track the hierarchical
number of blocked tasks. We do it when a blocked task wakes up or
a task is blocked.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

9e7b35d6

alinux: sched: Introduce per-cgroup steal accounting · c7552980

由 Yihao Wu 提交于 3月 10, 2020

to #26424323

From the previous patch. We know there are 4 possible states.
Since steal state's transition is complex. We choose to account
its supplement.

        steal = elapse - idle - sum_exec_raw - ineffective

Where elapse is the time since the cgroup is created. sum_exec_raw is
the running time including IRQ time. ineffective is the total time that
the cpuacct-binded cpuset doesn't allow this cpu for the cgroup.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

c7552980

alinux: sched: Introduce per-cgroup idle accounting · 61e58859

由 Yihao Wu 提交于 3月 10, 2020

to #26424323

Since we concern idle, let's take idle as the center state. And omit
transition between other stats. Below is the state transition graph:

                                sleep->deque
+-----------+ cpumask +-------+ exit->deque +-------+
|ineffective|-------- | idle  | <-----------|running|
+-----------+         +-------+             +-------+
                        ^ |
 unthrtl child -> deque | |
          wake -> deque | |thrtl chlid -> enque
       migrate -> deque | |migrate -> enque
                        | v
                      +-------+
                      | steal |
                      +-------+

We conclude idle state condition as:

!se->on_rq && !my_q->throttled && cpu allowed.

From this graph and condition, we can hook (de|en)queue_task_fair
update_cpumasks_hier, (un|)throttle_cfs_rq to account idle state.

In the hooked functions, we also check the conditions, to avoid
accounting unwanted cpu clocks.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

61e58859

alinux: cpuacct: make cpuacct record nr_migrations · 965d75d3

由 Yihao Wu 提交于 1月 17, 2020

to #26424323

This patch makes cpuacct to be able to monitor the number of
across-cpu-migrations. Output as follows:

   [root@caspar /sys/fs/cgroup/cpuacct]
   # cat cpuacct.proc_stat
   user 7727
   nice 4
   <snip>
   nr_migrations 48432
Signed-off-by: NZhu Yanhai <zhu.yanhai@linux.alibaba.com>
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

965d75d3

alinux: cpuacct: Export nr_running & nr_uninterruptible · 9410d314

由 Yihao Wu 提交于 1月 17, 2020

to #26424323

cpu cgroup's nr_running and nr_uninterruptible are useful
for troubleshooting. Export them in cpuacct.proc_stat.
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

9410d314

alinux: sched/cputime: Fix guest cputime of cpuacct.proc_stat · 004b3ba9

由 Shanpei Chen 提交于 11月 06, 2019

to #26424323

For container only cases, since guest cputime is always 0, we don't
calculate it and return 0 directly before.

Howerver, when running vm inside a cgroup, we expect the cgroup to
maintain guest cputime correctly.
Tested-by: NYihao Wu <wuyihao@linux.alibaba.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

004b3ba9

alinux: cpuacct/proc_stat: Consider isolcpus · 8ab45c97

由 Xunlei Pang 提交于 11月 06, 2019

to #26424323

When "isolcpus=" is passed, skip all its accountings.
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Tested-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

8ab45c97

alinux: cpuacct: export cpuacct.proc_stat interface · 9be0ac2b

由 Xunlei Pang 提交于 7月 23, 2019

to #26424323

Add the cgroup file "cpuacct.proc_stat", we'll export per-cgroup
cpu usages and some other scheduler statistics in this interface.
Reviewed-by: NMichael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

9be0ac2b

alinux: sched: Maintain "nr_uninterruptible" in runqueue · 36da4fe9

由 Xunlei Pang 提交于 7月 24, 2019

to #26424323

It's relatively easy to maintain nr_uninterruptible in scheduler
compared to doing it in cpuacct, we assume that "cpu,cpuacct" are
bound together, so that it can be used for per-cgroup load.

This will be needed to calculate per-cgroup load average later.
Reviewed-by: NMichael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

36da4fe9

23 4月, 2020 1 次提交

mm, compaction: capture a page under direct compaction · 35d915be

由 Mel Gorman 提交于 4月 04, 2020

to #26255339

commit 5e1f0f098b4649fad53011246bcaeff011ffdf5d upstream

Compaction is inherently race-prone as a suitable page freed during
compaction can be allocated by any parallel task. This patch uses a
capture_control structure to isolate a page immediately when it is freed
by a direct compactor in the slow path of the page allocator. The
intent is to avoid redundant scanning.

5.0.0-rc1 5.0.0-rc1
selective-v3r17 capture-v3r19
Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%)
Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%)
Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%)
Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%)
Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%)
Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%*
Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%)
Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%)

Latency is only moderately affected but the devil is in the details. A
closer examination indicates that base page fault latency is reduced but
latency of huge pages is increased as it takes creater care to succeed.
Part of the "problem" is that allocation success rates are close to 100%
even when under pressure and compaction gets harder

5.0.0-rc1 5.0.0-rc1
selective-v3r17 capture-v3r19
Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%)
Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%)
Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%)
Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%)
Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%)
Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%)
Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%)
Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%)

And scan rates are reduced as expected by 6% for the migration scanner
and 29% for the free scanner indicating that there is less redundant
work.

Compaction migrate scanned 20815362 19573286
Compaction free scanned 16352612 11510663

[mgorman@techsingularity.net: remove redundant check]
Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

35d915be

22 4月, 2020 4 次提交

sysctl: handle overflow in proc_get_long · 662ef34f

由 Christian Brauner 提交于 3月 07, 2019

fix #27124689

commit 7f2923c4f73f21cfd714d12a2d48de8c21f11cfe upstream.

proc_get_long() is a funny function.  It uses simple_strtoul() and for a
good reason.  proc_get_long() wants to always succeed the parse and
return the maybe incorrect value and the trailing characters to check
against a pre-defined list of acceptable trailing values.  However,
simple_strtoul() explicitly ignores overflows which can cause funny
things like the following to happen:

  echo 18446744073709551616 > /proc/sys/fs/file-max
  cat /proc/sys/fs/file-max
  0

(Which will cause your system to silently die behind your back.)

On the other hand kstrtoul() does do overflow detection but does not
return the trailing characters, and also fails the parse when anything
other than '\n' is a trailing character whereas proc_get_long() wants to
be more lenient.

Now, before adding another kstrtoul() function let's simply add a static
parse strtoul_lenient() which:
 - fails on overflow with -ERANGE
 - returns the trailing characters to the caller

The reason why we should fail on ERANGE is that we already do a partial
fail on overflow right now.  Namely, when the TMPBUFLEN is exceeded.  So
we already reject values such as 184467440737095516160 (21 chars) but
accept values such as 18446744073709551616 (20 chars) but both are
overflows.  So we should just always reject 64bit overflows and not
special-case this based on the number of chars.

Link: http://lkml.kernel.org/r/20190107222700.15954-2-christian@brauner.ioSigned-off-by: NChristian Brauner <christian@brauner.io>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

662ef34f

sched: Avoid scale real weight down to zero · 9b83fd88

由 Michael Wang 提交于 3月 27, 2020

fix #26198889

commit 26cf52229efc87e2effa9d788f9b33c40fb3358a linux-next

During our testing, we found a case that shares no longer
working correctly, the cgroup topology is like:

  /sys/fs/cgroup/cpu/A		(shares=102400)
  /sys/fs/cgroup/cpu/A/B	(shares=2)
  /sys/fs/cgroup/cpu/A/B/C	(shares=1024)

  /sys/fs/cgroup/cpu/D		(shares=1024)
  /sys/fs/cgroup/cpu/D/E	(shares=1024)
  /sys/fs/cgroup/cpu/D/E/F	(shares=1024)

The same benchmark is running in group C & F, no other tasks are
running, the benchmark is capable to consumed all the CPUs.

We suppose the group C will win more CPU resources since it could
enjoy all the shares of group A, but it's F who wins much more.

The reason is because we have group B with shares as 2, since
A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus,
so A->cfs_rq.load.weight become very small.

And in calc_group_shares() we calculate shares as:

  load = max(scale_load_down(cfs_rq->load.weight),
cfs_rq->avg.load_avg);
  shares = (tg_shares * load) / tg_weight;

Since the 'cfs_rq->load.weight' is too small, the load become 0
after scale down, although 'tg_shares' is 102400, shares of the se
which stand for group A on root cfs_rq become 2.

While the se of D on root cfs_rq is far more bigger than 2, so it
wins the battle.

Thus when scale_load_down() scale real weight down to 0, it's no
longer telling the real story, the caller will have the wrong
information and the calculation will be buggy.

This patch add check in scale_load_down(), so the real weight will
be >= MIN_SHARES after scale, after applied the group C wins as
expected.
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NMichael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba.comAcked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

9b83fd88

sched/fair: Fix race between runtime distribution and assignment · 70a23044

由 Huaixin Chang 提交于 3月 24, 2020

fix #25892693

commit 26a8b12747c975b33b4a82d62e4a307e1c07f31b upstream

Currently, there is a potential race between distribute_cfs_runtime()
and assign_cfs_rq_runtime(). Race happens when cfs_b->runtime is read,
distributes without holding lock and finds out there is not enough
runtime to charge against after distribution. Because
assign_cfs_rq_runtime() might be called during distribution, and use
cfs_b->runtime at the same time.

Fibtest is the tool to test this race. Assume all gcfs_rq is throttled
and cfs period timer runs, slow threads might run and sleep, returning
unused cfs_rq runtime and keeping min_cfs_rq_runtime in their local
pool. If all this happens sufficiently quickly, cfs_b->runtime will drop
a lot. If runtime distributed is large too, over-use of runtime happens.

A runtime over-using by about 70 percent of quota is seen when we
test fibtest on a 96-core machine. We run fibtest with 1 fast thread and
95 slow threads in test group, configure 10ms quota for this group and
see the CPU usage of fibtest is 17.0%, which is far from than the
expected 10%.

On a smaller machine with 32 cores, we also run fibtest with 96
threads. CPU usage is more than 12%, which is also more than expected
10%. This shows that on similar workloads, this race do affect CPU
bandwidth control.

Solve this by holding lock inside distribute_cfs_runtime().

Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
Reviewed-by: NBen Segall <bsegall@google.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Link: https://lore.kernel.org/lkml/20200325092602.22471-1-changhuaixin@linux.alibaba.com/Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

70a23044

alinux: cgroup: Fix task_css_check rcu warnings · 798cfa76

由 Xunlei Pang 提交于 3月 23, 2020

to #26424323

task_css() should be protected by rcu, fix several callers.

Fixes: 1f49a738 ("alinux: psi: Support PSI under cgroup v1")
Acked-by: NMichael Wang <yun.wany@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NYang Shi <yang.shi@linux.alibaba.com>

798cfa76

17 4月, 2020 1 次提交

alinux: kernel: reap zombie process by specified pid · ac2b5c94

由 zhongjiang-ali 提交于 2月 26, 2020

to #26788859

We've met several real-world issues that the child reaper
(i.e. systemd) gets stuck in some aborted status and cann't
reap its zombie children, so we provide the interface to do
By specified the pid.
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>

ac2b5c94

18 3月, 2020 9 次提交

bpf/sockmap: Read psock ingress_msg before sk_receive_queue · 350f8ab8

由 Lingpeng Chen 提交于 3月 06, 2020

commit e7a5f1f1cd0008e5ad379270a8657e121eedb669 upstream

Right now in tcp_bpf_recvmsg, sock read data first from sk_receive_queue
if not empty than psock->ingress_msg otherwise. If a FIN packet arrives
and there's also some data in psock->ingress_msg, the data in
psock->ingress_msg will be purged. It is always happen when request to a
HTTP1.0 server like python SimpleHTTPServer since the server send FIN
packet after data is sent out.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: NArika Chen <eaglesora@gmail.com>
Suggested-by: NArika Chen <eaglesora@gmail.com>
Signed-off-by: NLingpeng Chen <forrest0579@gmail.com>
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NSong Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200109014833.18951-1-forrest0579@gmail.com
[tonylu: patched modified to match BIG rework between v4.19 and upstream]
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Acked-by: NDust Li <dust.li@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

350f8ab8

alinux: mm, memcg: account number of processes in the css · 2061acd6

由 Xu Yu 提交于 3月 13, 2020

Since commit e0205ae40f12 ("mm: memcontrol: use CSS_TASK_ITER_PROCS at
mem_cgroup_scan_tasks()") made mem_cgroup_scan_tasks() to check only one
thread from each thread group, we can make cgroup_subsys_state::nr_tasks
to record only the thread group leader, i.e., process, instead of
thread(s). Furthermore, this renames cgroup_subsys_state::nr_tasks to
cgroup_subsys_state::nr_procs.

Fixes: f061cd88 ("alinux: kernel: cgroup: account number of tasks in
the css and its descendants")
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

2061acd6

alinux: mm: add proc interface to control context readahead · 2e38a0f2

由 Xiaoguang Wang 提交于 3月 09, 2020

For some workloads whose io activities are mostly random, context
readahead feature can introduce unnecessary io read operations, which
will impact app's performance. Context readahead's algorithm is
straightforward and not that smart.

This patch adds "/proc/sys/vm/enable_context_readahead" to control
whether to disable or enable this feature. Currently we enable context
readahead default, user can echo 0 to /proc/sys/vm/enable_context_readahead
to disable context readahead.

We also have tested mongodb's performance in 'random point select' case,
With context readahead enabled:
  mongodb eps 12409
With context readahead disabled:
  mongodb eps 14443
About 16% performance improvement.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2e38a0f2

alinux: kernel: cgroup: account number of tasks in the css and its descendants · 1e91d392

由 Wenwei Tao 提交于 8月 26, 2019

Account number of the tasks in the css and its descendants, this is
prepared for the incoming memcg priority patch.

In memcg priority oom, we will select victim cgroup which has victim
tasks in it. We need to know whether the memcg and its descendants
have tasks before the selection can move on.
Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

1e91d392

io-wq: small threadpool implementation for io_uring · 8a308e54

由 Jens Axboe 提交于 10月 22, 2019

commit 771b53d033e8663abdf59704806aa856b236dcdb upstream.

This adds support for io-wq, a smaller and specialized thread pool
implementation. This is meant to replace workqueues for io_uring. Among
the reasons for this addition are:

- We can assign memory context smarter and more persistently if we
  manage the life time of threads.

- We can drop various work-arounds we have in io_uring, like the
  async_list.

- We can implement hashed work insertion, to manage concurrency of
  buffered writes without needing a) an extra workqueue, or b)
  needlessly making the concurrency of said workqueue very low
  which hurts performance of multiple buffered file writers.

- We can implement cancel through signals, for cancelling
  interruptible work like read/write (or send/recv) to/from sockets.

- We need the above cancel for being able to assign and use file tables
  from a process.

- We can implement a more thorough cancel operation in general.

- We need it to move towards a syslet/threadlet model for even faster
  async execution. For that we need to take ownership of the used
  threads.

This list is just off the top of my head. Performance should be the
same, or better, at least that's what I've seen in my testing. io-wq
supports basic NUMA functionality, setting up a pool per node.

io-wq hooks up to the scheduler schedule in/out just like workqueue
and uses that to drive the need for more/less workers.
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[Joseph: Cherry-pick allow_kernel_signal() from upstream commit 33da8e7c814f]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

8a308e54

sched/core, workqueues: Distangle worker accounting from rq lock · 143495ca

由 Thomas Gleixner 提交于 3月 13, 2019

commit 6d25be5782e482eb93e3de0c94d0a517879377d0 upstream.

The worker accounting for CPU bound workers is plugged into the core
scheduler code and the wakeup code. This is not a hard requirement and
can be avoided by keeping track of the state in the workqueue code
itself.

Keep track of the sleeping state in the worker itself and call the
notifier before entering the core scheduler. There might be false
positives when the task is woken between that call and actually
scheduling, but that's not really different from scheduling and being
woken immediately after switching away. When nr_running is updated when
the task is retunrning from schedule() then it is later compared when it
is done from ttwu().

[ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ]
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NTejun Heo <tj@kernel.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

143495ca

signal: simplify set_user_sigmask/restore_user_sigmask · f12f9562

由 Oleg Nesterov 提交于 7月 16, 2019

commit b772434be0891ed1081a08ae7cfd4666728f8e82 upstream.

task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
syscall paths.  This means that set_user_sigmask() can save ->blocked in
->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
was modified.

This way the callers do not need 2 sigset_t's passed to set/restore and
restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
into the trivial helper which just calls restore_saved_sigmask().

Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f12f9562

signal: remove the wrong signal_pending() check in restore_user_sigmask() · a48e4674

由 Oleg Nesterov 提交于 6月 28, 2019

commit 97abc889ee296faf95ca0e978340fb7b942a3e32 upstream.

This is the minimal fix for stable, I'll send cleanups later.

Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
the visible change which breaks user-space: a signal temporary unblocked
by set_user_sigmask() can be delivered even if the caller returns
success or timeout.

Change restore_user_sigmask() to accept the additional "interrupted"
argument which should be used instead of signal_pending() check, and
update the callers.

Eric said:

: For clarity.  I don't think this is required by posix, or fundamentally to
: remove the races in select.  It is what linux has always done and we have
: applications who care so I agree this fix is needed.
:
: Further in any case where the semantic change that this patch rolls back
: (aka where allowing a signal to be delivered and the select like call to
: complete) would be advantage we can do as well if not better by using
: signalfd.
:
: Michael is there any chance we can get this guarantee of the linux
: implementation of pselect and friends clearly documented.  The guarantee
: that if the system call completes successfully we are guaranteed that no
: signal that is unblocked by using sigmask will be delivered?

Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Reported-by: NEric Wong <e@80x24.org>
Tested-by: NEric Wong <e@80x24.org>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: <stable@vger.kernel.org>	[5.0+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

a48e4674

module/ftrace: handle patchable-function-entry · 0f596d7d

由 Mark Rutland 提交于 10月 16, 2019

backport from a1326b17ac03a9012cb3d01e434aacb4d67a416c upstream

When using patchable-function-entry, the compiler will record the
callsites into a section named "__patchable_function_entries" rather
than "__mcount_loc". Let's abstract this difference behind a new
FTRACE_CALLSITE_SECTION, so that architectures don't have to handle this
explicitly (e.g. with custom module linker scripts).

As parisc currently handles this explicitly, it is fixed up accordingly,
with its custom linker script removed. Since FTRACE_CALLSITE_SECTION is
only defined when DYNAMIC_FTRACE is selected, the parisc module loading
code is updated to only use the definition in that case. When
DYNAMIC_FTRACE is not selected, modules shouldn't have this section, so
this removes some redundant work in that case.

To make sure that this is keep up-to-date for modules and the main
kernel, a comment is added to vmlinux.lds.h, with the existing ifdeffery
simplified for legibility.

I built parisc generic-{32,64}bit_defconfig with DYNAMIC_FTRACE enabled,
and verified that the section made it into the .ko files for modules.
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Acked-by: NHelge Deller <deller@gmx.de>
Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: NTorsten Duwe <duwe@suse.de>
Tested-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
Tested-by: NSven Schnelle <svens@stackframe.org>
Tested-by: NTorsten Duwe <duwe@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: linux-parisc@vger.kernel.org
Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

0f596d7d

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功