提交 · 6df59de040e14a055f1275de66a76eaeb1e95c2d · openeuler / Kernel

24 9月, 2022 2 次提交

mm: reliable: Fix ret errno to EACCES · 6df59de0

由 Ma Wupeng 提交于 9月 24, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Commit bc25b308 ("mm: reliable: Use EINVAL in reliable_check") update
EPERM to EINVAL try to pass LTP's proc01 test, however LTP only treat
EACCESS as whitelist in the scenario.

To solve this problem, update EINVAL to EACCESS.

Fixes: bc25b308 ("mm: reliable: Use EINVAL in reliable_check")
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NChen Wandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6df59de0

mm: reliable: Use EINVAL in reliable_check · bc25b308

由 Ma Wupeng 提交于 9月 24, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

LTP's proc01 test was failing be because this ret code (1):

	proc01      1  TFAIL  :  proc01.c:400: read failed: /proc/self/task/1406366/reliable: errno=EPERM(1): Operation not permitted

To slove this problem, replace EPERM with EINVAL in reliable_check().
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

bc25b308

22 8月, 2022 1 次提交

sched: programmable: Add user interface of task tag · d5aa2997

由 Chen Hui 提交于 8月 22, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5KUFB
CVE: NA

--------------------------------

Add user interface of task tag, bridges the information
gap between user-mode and kernel mode.

To do:
  /proc/${pid}/task/${pid}/tag
Signed-off-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NRen Zhijie <renzhijie2@huawei.com>

d5aa2997

07 6月, 2022 1 次提交

proc: Fix a dentry lock race between release_task and lookup · ad8b5995

由 Zhihao Cheng 提交于 6月 07, 2022

hulk inclusion
category: bugfix
bugzilla: 186846, https://gitee.com/openeuler/kernel/issues/I5A80Q

--------------------------------

Commit 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
moved proc_flush_task() behind __exit_signal(). Then, process systemd
can take long period high cpu usage during releasing task in following
concurrent processes:

  systemd                                 ps
kernel_waitid                 stat(/proc/tgid)
  do_wait                       filename_lookup
    wait_consider_task            lookup_fast
      release_task
        __exit_signal
          __unhash_process
            detach_pid
              __change_pid // remove task->pid_links
                                     d_revalidate -> pid_revalidate  // 0
                                     d_invalidate(/proc/tgid)
                                       shrink_dcache_parent(/proc/tgid)
                                         d_walk(/proc/tgid)
                                           spin_lock_nested(/proc/tgid/fd)
                                           // iterating opened fd
        proc_flush_pid                                    |
           d_invalidate (/proc/tgid/fd)                   |
              shrink_dcache_parent(/proc/tgid/fd)         |
                shrink_dentry_list(subdirs)               ↓
                  shrink_lock_dentry(/proc/tgid/fd) --> race on dentry lock

Function d_invalidate() will remove dentry from hash firstly, but why does
proc_flush_pid() process dentry '/proc/tgid/fd' before dentry '/proc/tgid'?
That's because proc_pid_make_inode() adds proc inode in reverse order by
invoking hlist_add_head_rcu(). But proc should not add any inodes under
'/proc/tgid' except '/proc/tgid/task/pid', fix it by adding inode into
'pid->inodes' only if the inode is /proc/tgid or /proc/tgid/task/pid.

Performance regression:
Create 200 tasks, each task open one file for 50,000 times. Kill all
tasks when opened files exceed 10,000,000 (cat /proc/sys/fs/file-nr).

Before fix:
$ time killall -wq aa
  real    4m40.946s   # During this period, we can see 'ps' and 'systemd'
			taking high cpu usage.

After fix:
$ time killall -wq aa
  real    1m20.732s   # During this period, we can see 'systemd' taking
			high cpu usage.

Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216054Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ad8b5995

23 2月, 2022 1 次提交

mm: Introduce reliable flag for user task · 8ee6e050

由 Peng Wu 提交于 2月 23, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4PM01
CVE: NA

------------------------------------------

Adding reliable flag for user task. User task with reliable flag can
only alloc memory from mirrored region. PF_RELIABLE is added to represent
the task's reliable flag.

- For init task, which is regarded as special task which alloc memory
  from mirrored region.

- For normal user tasks, The reliable flag can be set via procfs interface
  shown as below and can be inherited via fork().

User can change a user task's reliable flag by

	$ echo [0/1] > /proc/<pid>/reliable

and check a user task's reliable flag by

	$ cat /proc/<pid>/reliable

Note, global init task's reliable file can not be accessed.
Signed-off-by: NPeng Wu <wupeng58@huawei.com>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8ee6e050

30 12月, 2021 1 次提交

share_pool: Add proc interfaces to show sp info · d83dcc99

由 Wang Wensheng 提交于 12月 30, 2021

ascend inclusion
category: Feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4NDAW
CVE: NA

-------------------

1. /proc/sharepool/* those interfaces show the system-wide processes
   that are in the sharepool group and all the groups.
2. /proc/<pid>/sp_group expose the per-task sp_group state value.
Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com>
Signed-off-by: NTang Yizhou <tangyizhou@huawei.com>
Reviewed-by: Kefeng Wang<wangkefeng.wang@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

d83dcc99

15 10月, 2021 1 次提交

proc: Avoid mixing integer types in mem_rw() · 905e07a8

由 Marcelo Henrique Cerri 提交于 10月 14, 2021

stable inclusion
from stable-5.10.54
commit fc6ac92cfcab3269a4398f185f8ac460dd359b86
bugzilla: 175586 https://gitee.com/openeuler/kernel/issues/I4DVDU

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=fc6ac92cfcab3269a4398f185f8ac460dd359b86

--------------------------------

[ Upstream commit d238692b ]

Use size_t when capping the count argument received by mem_rw(). Since
count is size_t, using min_t(int, ...) can lead to a negative value
that will later be passed to access_remote_vm(), which can cause
unexpected behavior.

Since we are capping the value to at maximum PAGE_SIZE, the conversion
from size_t to int when passing it to access_remote_vm() as "len"
shouldn't be a problem.

Link: https://lkml.kernel.org/r/20210512125215.3348316-1-marcelo.cerri@canonical.comReviewed-by: NDavid Disseldorp <ddiss@suse.de>
Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: NMarcelo Henrique Cerri <marcelo.cerri@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Souza Cascardo <cascardo@canonical.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

905e07a8

06 7月, 2021 1 次提交

seccomp/cache: Report cache data through /proc/pid/seccomp_cache · 39be1ac0

由 YiFei Zhu 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit 0d8315dd
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0d8315dddd2899f519fe1ca3d4d5cdaf44ea421e

-------------------------------------------------

Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating syscall
numbers to syscall names is possible through FTRACE_SYSCALL
infrastructure but it does not provide support for compat syscalls.

This will create a file for each PID as /proc/pid/seccomp_cache.
The file will be empty when no seccomp filters are loaded, or be
in the format of:
<arch name> <decimal syscall number> <ALLOW | FILTER>
where ALLOW means the cache is guaranteed to allow the syscall,
and filter means the cache will pass the syscall to the BPF filter.

For the docker default profile on x86_64 it looks like:
x86_64 0 ALLOW
x86_64 1 ALLOW
x86_64 2 ALLOW
x86_64 3 ALLOW
[...]
x86_64 132 ALLOW
x86_64 133 ALLOW
x86_64 134 FILTER
x86_64 135 FILTER
x86_64 136 FILTER
x86_64 137 ALLOW
x86_64 138 ALLOW
x86_64 139 FILTER
x86_64 140 ALLOW
x86_64 141 ALLOW
[...]

This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
of N because I think certain users of seccomp might not want the
application to know which syscalls are definitely usable. For
the same reason, it is also guarded by CAP_SYS_ADMIN.
Suggested-by: NJann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/94e663fa53136f5a11f432c661794d1ee7060779.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

39be1ac0

03 7月, 2021 2 次提交

proc: only require mm_struct for writing · 2b61fd82

由 Linus Torvalds 提交于 6月 18, 2021

stable inclusion
from stable-5.10.44
commit ef9a0d224bafc0f4f8f85d0eb69fc59a6fbd1318
bugzilla: 109295
CVE: NA

--------------------------------

commit 94f0b2d4 upstream.

Commit 591a22c1 ("proc: Track /proc/$pid/attr/ opener mm_struct") we
started using __mem_open() to track the mm_struct at open-time, so that
we could then check it for writes.

But that also ended up making the permission checks at open time much
stricter - and not just for writes, but for reads too.  And that in turn
caused a regression for at least Fedora 29, where NIC interfaces fail to
start when using NetworkManager.

Since only the write side wanted the mm_struct test, ignore any failures
by __mem_open() at open time, leaving reads unaffected.  The write()
time verification of the mm_struct pointer will then catch the failure
case because a NULL pointer will not match a valid 'current->mm'.

Link: https://lore.kernel.org/netdev/YMjTlp2FSJYvoyFa@unreal/
Fixes: 591a22c1 ("proc: Track /proc/$pid/attr/ opener mm_struct")
Reported-and-tested-by: NLeon Romanovsky <leon@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

2b61fd82

proc: Track /proc/$pid/attr/ opener mm_struct · ce16dc8c

由 Kees Cook 提交于 6月 18, 2021

stable inclusion
from stable-5.10.44
commit f70102cb369cde6ab7551ca58152d00fd3478fec
bugzilla: 109295
CVE: NA

--------------------------------

commit 591a22c1 upstream.

Commit bfb819ea ("proc: Check /proc/$pid/attr/ writes against file opener")
tried to make sure that there could not be a confusion between the opener of
a /proc/$pid/attr/ file and the writer. It used struct cred to make sure
the privileges didn't change. However, there were existing cases where a more
privileged thread was passing the opened fd to a differently privileged thread
(during container setup). Instead, use mm_struct to track whether the opener
and writer are still the same process. (This is what several other proc files
already do, though for different reasons.)
Reported-by: NChristian Brauner <christian.brauner@ubuntu.com>
Reported-by: NAndrea Righi <andrea.righi@canonical.com>
Tested-by: NAndrea Righi <andrea.righi@canonical.com>
Fixes: bfb819ea ("proc: Check /proc/$pid/attr/ writes against file opener")
Cc: stable@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ce16dc8c

15 6月, 2021 1 次提交

proc: Check /proc/$pid/attr/ writes against file opener · 30d6fe82

由 Kees Cook 提交于 6月 07, 2021

stable inclusion
from stable-5.10.42
commit fb003a1bd60358c0ccee0145079de258a6cf0ba8
bugzilla: 55093
CVE: NA

--------------------------------

commit bfb819ea upstream.

Fix another "confused deputy" weakness[1]. Writes to /proc/$pid/attr/
files need to check the opener credentials, since these fds do not
transition state across execve(). Without this, it is possible to
trick another process (which may have different credentials) to write
to its own /proc/$pid/attr/ files, leading to unexpected and possibly
exploitable behaviors.

[1] https://www.kernel.org/doc/html/latest/security/credentials.html?highlight=confused#open-file-credentials

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

30d6fe82

14 4月, 2021 2 次提交

memig: add memig-swap feature to openEuler · 8a655676

由 liubo 提交于 2月 24, 2021

euleros inclusion
category: feature
feature: add memig swap feature patch to openEuler kernel
bugzilla: 48246

-------------------------------------------------

reason:This patch is used to add memig swap feature to openEuler system.
memig_swap.ko is used to transfer the address
passed in the user state for page migration
Signed-off-by: Nyanxiaodan <yanxiaodan@huawei.com>
Signed-off-by: Nlinmiaohe <linmiaohe@huawei.com>
Signed-off-by: Nlouhongxiang <louhongxiang@huawei.com>
Signed-off-by: Nliubo <liubo254@huawei.com>
Signed-off-by: Ngeruijun <geruijun@huawei.com>
Signed-off-by: Nliangchenshu <liangchenshu@huawei.com>
Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8a655676

memig: add memig-scan feature to openEuler · c13e5b6a

由 liubo 提交于 2月 24, 2021

euleros inclusion
category: feature
feature: add memig scan feature patch to openEuler kernel
bugzilla: 48246

-------------------------------------------------

reason:This patch is used to add memig scan feature to openEuler system.
memig_scan.ko is used to scan the virtual address of the target process
and return the address access information to
the user mode for grading cold and hot pages.
Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Nyanxiaodan <yanxiaodan@huawei.com>
Signed-off-by: NFeilong Lin <linfeilong@huawei.com>
Signed-off-by: Ngeruijun <geruijun@huawei.com>
Signed-off-by: Nliubo <liubo254@huawei.com>
Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c13e5b6a

29 1月, 2021 1 次提交

proc: fix ubsan warning in mem_lseek · 1bb26e86

由 yangerkun 提交于 1月 20, 2021

hulk inclusion
category: bugfix
bugzilla: 47438
CVE: NA
---------------------------

UBSAN has reported a overflow with mem_lseek. And it's fine with
mem_open set file mode with FMODE_UNSIGNED_OFFSET(memory_lseek).
However, another file use mem_lseek do lseek can have not
FMODE_UNSIGNED_OFFSET(proc_kpagecount_operations/proc_pagemap_operations),
fix it by checking overflow and FMODE_UNSIGNED_OFFSET.
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>

==================================================================
UBSAN: Undefined behaviour in ../fs/proc/base.c:941:15
signed integer overflow:
4611686018427387904 + 4611686018427387904 cannot be represented in type 'long long int'
CPU: 4 PID: 4762 Comm: syz-executor.1 Not tainted 4.4.189 #3
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
[<ffffff90080a5f28>] dump_backtrace+0x0/0x590 arch/arm64/kernel/traps.c:91
[<ffffff90080a64f0>] show_stack+0x38/0x60 arch/arm64/kernel/traps.c:234
[<ffffff9008986a34>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffff9008986a34>] dump_stack+0x128/0x184 lib/dump_stack.c:51
[<ffffff9008a2d120>] ubsan_epilogue+0x34/0x9c lib/ubsan.c:166
[<ffffff9008a2d8b8>] handle_overflow+0x228/0x280 lib/ubsan.c:197
[<ffffff9008a2da2c>] __ubsan_handle_add_overflow+0x4c/0x68 lib/ubsan.c:204
[<ffffff900862b9f4>] mem_lseek+0x12c/0x130 fs/proc/base.c:941
[<ffffff90084ef78c>] vfs_llseek fs/read_write.c:260 [inline]
[<ffffff90084ef78c>] SYSC_lseek fs/read_write.c:285 [inline]
[<ffffff90084ef78c>] SyS_lseek+0x164/0x1f0 fs/read_write.c:276
[<ffffff9008093c80>] el0_svc_naked+0x30/0x34
==================================================================
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
(cherry picked from commit a422358aa04c53a08b215b8dcd6814d916ef5cf1)

Conflicts:
	fs/read_write.c
Signed-off-by: NLi Ming <limingming.li@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1bb26e86

27 1月, 2021 1 次提交

exec: Transform exec_update_mutex into a rw_semaphore · ab84d659

由 Eric W. Biederman 提交于 1月 18, 2021

stable inclusion
from stable-5.10.6
commit ab7709b551de24e7bebf44946120e6740b1e28db
bugzilla: 47418

--------------------------------

[ Upstream commit f7cfd871 ]

Recently syzbot reported[0] that there is a deadlock amongst the users
of exec_update_mutex.  The problematic lock ordering found by lockdep
was:

   perf_event_open  (exec_update_mutex -> ovl_i_mutex)
   chown            (ovl_i_mutex       -> sb_writes)
   sendfile         (sb_writes         -> p->lock)
     by reading from a proc file and writing to overlayfs
   proc_pid_syscall (p->lock           -> exec_update_mutex)

While looking at possible solutions it occured to me that all of the
users and possible users involved only wanted to state of the given
process to remain the same.  They are all readers.  The only writer is
exec.

There is no reason for readers to block on each other.  So fix
this deadlock by transforming exec_update_mutex into a rw_semaphore
named exec_update_lock that only exec takes for writing.

Cc: Jann Horn <jannh@google.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christopher Yeoh <cyeoh@au1.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Fixes: eea96732 ("exec: Add exec_update_mutex to replace cred_guard_mutex")
[0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

ab84d659

03 11月, 2020 1 次提交

mm, oom: keep oom_adj under or at upper limit when printing · 66606567

由 Charles Haithcock 提交于 11月 01, 2020

For oom_score_adj values in the range [942,999], the current
calculations will print 16 for oom_adj. This patch simply limits the
output so output is inline with docs.
Signed-off-by: NCharles Haithcock <chaithco@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Link: https://lkml.kernel.org/r/20201020165130.33927-1-chaithco@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

66606567

17 10月, 2020 1 次提交

io-wq: inherit audit loginuid and sessionid · 4ea33a97

由 Jens Axboe 提交于 10月 15, 2020

Make sure the async io-wq workers inherit the loginuid and sessionid from
the original task, and restore them to unset once we're done with the
async work item.

While at it, disable the ability for kernel threads to write to their own
loginuid.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4ea33a97

14 10月, 2020 1 次提交

mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary · 67197a4f

由 Suren Baghdasaryan 提交于 10月 13, 2020

Currently __set_oom_adj loops through all processes in the system to keep
oom_score_adj and oom_score_adj_min in sync between processes sharing
their mm.  This is done for any task with more that one mm_users, which
includes processes with multiple threads (sharing mm and signals).
However for such processes the loop is unnecessary because their signal
structure is shared as well.

Android updates oom_score_adj whenever a tasks changes its role
(background/foreground/...) or binds to/unbinds from a service, making it
more/less important.  Such operation can happen frequently.  We noticed
that updates to oom_score_adj became more expensive and after further
investigation found out that the patch mentioned in "Fixes" introduced a
regression.  Using Pixel 4 with a typical Android workload, write time to
oom_score_adj increased from ~3.57us to ~362us.  Moreover this regression
linearly depends on the number of multi-threaded processes running on the
system.

Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
(CLONE_VM && !CLONE_THREAD && !CLONE_VFORK).  Change __set_oom_adj to use
MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
update should be synchronized between multiple processes.  To prevent
races between clone() and __set_oom_adj(), when oom_score_adj of the
process being cloned might be modified from userspace, we use
oom_adj_mutex.  Its scope is changed to global.

The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
the case of vfork().  To prevent performance regressions of vfork(), we
skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
specified.  Clearing the MMF_MULTIPROCESS flag (when the last process
sharing the mm exits) is left out of this patch to keep it simple and
because it is believed that this threading model is rare.  Should there
ever be a need for optimizing that case as well, it can be done by hooking
into the exit path, likely following the mm_update_next_owner pattern.

With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
quite rare, the regression is gone after the change is applied.

[surenb@google.com: v3]
  Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com

Fixes: 44a70ade ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
Reported-by: NTim Murray <timmurray@google.com>
Suggested-by: NMichal Hocko <mhocko@kernel.org>
Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NOleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Christian Kellner <christian@kellner.me>
Cc: Adrian Reber <areber@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.comDebugged-by: NMinchan Kim <minchan@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

67197a4f

13 8月, 2020 1 次提交

mm, oom: make the calculation of oom badness more accurate · 9066e5cf

由 Yafang Shao 提交于 8月 11, 2020

Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process.  Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
[7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
[7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
[7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
[7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
[7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
[7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
[7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
[7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
[7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
[7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
[7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
[7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
[7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
[7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
[7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
[7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
[7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
[7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
[7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
[7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page.  That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1.  Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value.  As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate.  We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
  Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NYafang Shao <laoar.shao@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9066e5cf

20 7月, 2020 1 次提交

proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE · 12886f8a

由 Adrian Reber 提交于 7月 19, 2020

Opening files in /proc/pid/map_files when the current user is
CAP_CHECKPOINT_RESTORE capable in the root namespace is useful for
checkpointing and restoring to recover files that are unreachable via
the file system such as deleted files, or memfd files.
Signed-off-by: NAdrian Reber <areber@redhat.com>
Signed-off-by: NNicolas Viennot <Nicolas.Viennot@twosigma.com>
Reviewed-by: NCyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: NSerge Hallyn <serge@hallyn.com>
Link: https://lore.kernel.org/r/20200719100418.2112740-5-areber@redhat.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>

12886f8a

10 6月, 2020 3 次提交

mmap locking API: convert mmap_sem comments · c1e8d7c6

由 Michel Lespinasse 提交于 6月 08, 2020

Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]
Signed-off-by: NMichel Lespinasse <walken@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c1e8d7c6

mmap locking API: convert mmap_sem call sites missed by coccinelle · 89154dd5

由 Michel Lespinasse 提交于 6月 08, 2020

Convert the last few remaining mmap_sem rwsem calls to use the new mmap
locking API.  These were missed by coccinelle for some reason (I think
coccinelle does not support some of the preprocessor constructs in these
files ?)

[akpm@linux-foundation.org: convert linux-next leftovers]
[akpm@linux-foundation.org: more linux-next leftovers]
[akpm@linux-foundation.org: more linux-next leftovers]
Signed-off-by: NMichel Lespinasse <walken@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-6-walken@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

89154dd5

mmap locking API: use coccinelle to convert mmap_sem rwsem call sites · d8ed45c5

由 Michel Lespinasse 提交于 6月 08, 2020

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)
Signed-off-by: NMichel Lespinasse <walken@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d8ed45c5

19 5月, 2020 1 次提交

proc: proc_pid_ns takes super_block as an argument · 9d78edea

由 Alexey Gladkov 提交于 5月 18, 2020

syzbot found that

  touch /proc/testfile

causes NULL pointer dereference at tomoyo_get_local_path()
because inode of the dentry is NULL.

Before c59f415a, Tomoyo received pid_ns from proc's s_fs_info
directly. Since proc_pid_ns() can only work with inode, using it in
the tomoyo_get_local_path() was wrong.

To avoid creating more functions for getting proc_ns, change the
argument type of the proc_pid_ns() function. Then, Tomoyo can use
the existing super_block to get pid_ns.

Link: https://lkml.kernel.org/r/0000000000002f0c7505a5b0e04c@google.com
Link: https://lkml.kernel.org/r/20200518180738.2939611-1-gladkov.alexey@gmail.com
Reported-by: syzbot+c1af344512918c61362c@syzkaller.appspotmail.com
Fixes: c59f415a ("Use proc_pid_ns() to get pid_namespace from the proc superblock")
Signed-off-by: NAlexey Gladkov <gladkov.alexey@gmail.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

9d78edea

25 4月, 2020 2 次提交

proc: Use PIDTYPE_TGID in next_tgid · 3147d8aa

由 Eric W. Biederman 提交于 2月 24, 2020

Combine the pid_task and thes test has_group_leader_pid into a single
dereference by using pid_task(PIDTYPE_TGID).

This makes the code simpler and proof against needing to even think
about any shenanigans that de_thread might get up to.
Acked-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

3147d8aa

proc: Put thread_pid in release_task not proc_flush_pid · 6ade99ec

由 Eric W. Biederman 提交于 4月 24, 2020

Oleg pointed out that in the unlikely event the kernel is compiled
with CONFIG_PROC_FS unset that release_task will now leak the pid.

Move the put_pid out of proc_flush_pid into release_task to fix this
and to guarantee I don't make that mistake again.

When possible it makes sense to keep get and put in the same function
so it can easily been seen how they pair up.

Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
Reported-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

6ade99ec

22 4月, 2020 3 次提交

proc: use named enums for better readability · e61bb8b3

由 Alexey Gladkov 提交于 4月 19, 2020

Signed-off-by: NAlexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: NAlexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

e61bb8b3

proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option · 24a71ce5

由 Alexey Gladkov 提交于 4月 19, 2020

If "hidepid=4" mount option is set then do not instantiate pids that
we can not ptrace. "hidepid=4" means that procfs should only contain
pids that the caller can ptrace.
Signed-off-by: NDjalal Harouni <tixxdz@gmail.com>
Signed-off-by: NAlexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: NAlexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

24a71ce5

proc: allow to mount many instances of proc in one pid namespace · fa10fed3

由 Alexey Gladkov 提交于 4月 19, 2020

This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.

1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.

2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.

This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.

By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.

Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...

In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.

Selftest has been added to verify new behavior.
Signed-off-by: NAlexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: NAlexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

fa10fed3

16 4月, 2020 1 次提交

proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsets · 94d440d6

由 Andrei Vagin 提交于 4月 11, 2020

Michael Kerrisk suggested to replace numeric clock IDs with symbolic names.

Now the content of these files looks like this:
$ cat /proc/774/timens_offsets
monotonic      864000         0
boottime      1728000         0

For setting offsets, both representations of clocks (numeric and symbolic)
can be used.

As for compatibility, it is acceptable to change things as long as
userspace doesn't care. The format of timens_offsets files is very new and
there are no userspace tools yet which rely on this format.

But three projects crun, util-linux and criu rely on the interface of
setting time offsets and this is why it's required to continue supporting
the numeric clock IDs on write.

Fixes: 04a8682a ("fs/proc: Introduce /proc/pid/timens_offsets")
Suggested-by: NMichael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: NAndrei Vagin <avagin@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NMichael Kerrisk <mtk.manpages@gmail.com>
Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200411154031.642557-1-avagin@gmail.com

94d440d6

10 4月, 2020 1 次提交

proc: Use a dedicated lock in struct pid · 63f818f4

由 Eric W. Biederman 提交于 4月 07, 2020

syzbot wrote:
> ========================================================
> WARNING: possible irq lock inversion dependency detected
> 5.6.0-syzkaller #0 Not tainted
> --------------------------------------------------------
> swapper/1/0 just changed the state of lock:
> ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
> but this lock took another, SOFTIRQ-unsafe lock in the past:
>  (&pid->wait_pidfd){+.+.}-{2:2}
>
>
> and interrupts could create inverse lock ordering between them.
>
>
> other info that might help us debug this:
>  Possible interrupt unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&pid->wait_pidfd);
>                                local_irq_disable();
>                                lock(tasklist_lock);
>                                lock(&pid->wait_pidfd);
>   <Interrupt>
>     lock(tasklist_lock);
>
>  *** DEADLOCK ***
>
> 4 locks held by swapper/1/0:

The problem is that because wait_pidfd.lock is taken under the tasklist
lock.  It must always be taken with irqs disabled as tasklist_lock can be
taken from interrupt context and if wait_pidfd.lock was already taken this
would create a lock order inversion.

Oleg suggested just disabling irqs where I have added extra calls to
wait_pidfd.lock.  That should be safe and I think the code will eventually
do that.  It was rightly pointed out by Christian that sharing the
wait_pidfd.lock was a premature optimization.

It is also true that my pre-merge window testing was insufficient.  So
remove the premature optimization and give struct pid a dedicated lock of
it's own for struct pid things.  I have verified that lockdep sees all 3
paths where we take the new pid->lock and lockdep does not complain.

It is my current day dream that one day pid->lock can be used to guard the
task lists as well and then the tasklist_lock won't need to be held to
deliver signals.  That will require taking pid->lock with irqs disabled.
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

63f818f4

25 3月, 2020 2 次提交

proc: io_accounting: Use new infrastructure to fix deadlocks in execve · 76518d37

由 Bernd Edlinger 提交于 3月 20, 2020

This changes do_io_accounting to use the new exec_update_mutex
instead of cred_guard_mutex.

This fixes possible deadlocks when the trace is accessing
/proc/$pid/io for instance.

This should be safe, as the credentials are only used for reading.
Signed-off-by: NBernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

76518d37

proc: Use new infrastructure to fix deadlocks in execve · 2db9dbf7

由 Bernd Edlinger 提交于 3月 20, 2020

This changes lock_trace to use the new exec_update_mutex
instead of cred_guard_mutex.

This fixes possible deadlocks when the trace is accessing
/proc/$pid/stack for instance.

This should be safe, as the credentials are only used for reading,
and task->mm is updated on execve under the new exec_update_mutex.
Signed-off-by: NBernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

2db9dbf7

25 2月, 2020 1 次提交

proc: Use a list of inodes to flush from proc · 7bc3e6e5

由 Eric W. Biederman 提交于 2月 19, 2020

Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.

This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.

This flushing remains an optimization.  The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died.  When unused the shrinker will
eventually be able to remove these dentries.

There is a case in de_thread where proc_flush_pid can be
called early for a given pid.  Which winds up being
safe (if suboptimal) as this is just an optiimization.

Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.

So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid.  Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place.  This removes a small race
where a dentry could recreated.

As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.

The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.

v2: Initialize pid->inodes.  I somehow failed to get that
    initialization into the initial version of the patch.  A boot
    failure was reported by "kernel test robot <lkp@intel.com>", and
    failure to initialize that pid->inodes matches all of the reported
    symptoms.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

7bc3e6e5

20 1月, 2020 1 次提交

x86/resctrl: Add task resctrl information display · e79f15a4

由 Chen Yu 提交于 1月 15, 2020

Monitoring tools that want to find out which resctrl control and monitor
groups a task belongs to must currently read the "tasks" file in every
group until they locate the process ID.

Add an additional file /proc/{pid}/cpu_resctrl_groups to provide this
information:

1)   res:
     mon:

resctrl is not available.

2)   res:/
     mon:

Task is part of the root resctrl control group, and it is not associated
to any monitor group.

3)  res:/
    mon:mon0

Task is part of the root resctrl control group and monitor group mon0.

4)  res:group0
    mon:

Task is part of resctrl control group group0, and it is not associated
to any monitor group.

5) res:group0
   mon:mon1

Task is part of resctrl control group group0 and monitor group mon1.
Signed-off-by: NChen Yu <yu.c.chen@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Tested-by: NJinshi Chen <jinshi.chen@intel.com>
Link: https://lkml.kernel.org/r/20200115092851.14761-1-yu.c.chen@intel.com

e79f15a4

19 1月, 2020 1 次提交

apparmor: add proc subdir to attrs · 6413f852

由 John Johansen 提交于 2月 04, 2019

This patch provides a /proc/<pid>/attr/apparmor/
subdirectory. Enabling userspace to use the apparmor attributes
without having to worry about collisions with selinux or smack on
interface files in /proc/<pid>/attr.
Signed-off-by: NJohn Johansen <john.johansen@canonical.com>

6413f852

14 1月, 2020 1 次提交

fs/proc: Introduce /proc/pid/timens_offsets · 04a8682a

由 Andrei Vagin 提交于 11月 12, 2019

API to set time namespace offsets for children processes, i.e.:
echo "$clockid $offset_sec $offset_nsec" > /proc/self/timens_offsets
Co-developed-by: NDmitry Safonov <dima@arista.com>
Signed-off-by: NAndrei Vagin <avagin@gmail.com>
Signed-off-by: NDmitry Safonov <dima@arista.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-28-dima@arista.com

04a8682a

09 12月, 2019 1 次提交

namei: allow nd_jump_link() to produce errors · 1bc82070

由 Aleksa Sarai 提交于 12月 07, 2019

In preparation for LOOKUP_NO_MAGICLINKS, it's necessary to add the
ability for nd_jump_link() to return an error which the corresponding
get_link() caller must propogate back up to the VFS.
Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1bc82070

17 7月, 2019 2 次提交

/proc/<pid>/cmdline: add back the setproctitle() special case · d26d0cd9

由 Linus Torvalds 提交于 7月 13, 2019

This makes the setproctitle() special case very explicit indeed, and
handles it with a separate helper function entirely.  In the process, it
re-instates the original semantics of simply stopping at the first NUL
character when the original last NUL character is no longer there.

[ The original semantics can still be seen in mm/util.c: get_cmdline()
  that is limited to a fixed-size buffer ]

This makes the logic about when we use the string lengths etc much more
obvious, and makes it easier to see what we do and what the two very
different cases are.

Note that even when we allow walking past the end of the argument array
(because the setproctitle() might have overwritten and overflowed the
original argv[] strings), we only allow it when it overflows into the
environment region if it is immediately adjacent.

[ Fixed for missing 'count' checks noted by Alexey Izbyshev ]

Link: https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/
Fixes: 5ab82718 ("fs/proc: simplify and clarify get_mm_cmdline() function")
Cc: Jakub Jankowski <shasta@toxcorp.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexey Izbyshev <izbyshev@ispras.ru>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d26d0cd9

/proc/<pid>/cmdline: remove all the special cases · 3d712546

由 Linus Torvalds 提交于 7月 13, 2019

Start off with a clean slate that only reads exactly from arg_start to
arg_end, without any oddities. This simplifies the code and in the
process removes the case that caused us to potentially leak an
uninitialized byte from the temporary kernel buffer.

Note that in order to start from scratch with an understandable base,
this simplifies things _too_ much, and removes all the legacy logic to
handle setproctitle() having changed the argument strings.

We'll add back those special cases very differently in the next commit.

Link: https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/
Fixes: f5b65348 ("proc: fix missing final NUL in get_mm_cmdline() rewrite")
Cc: Alexey Izbyshev <izbyshev@ispras.ru>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3d712546

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功