提交 · bb1a1146467ad812bb65440696df0782e2bc63c8 · openeuler / Kernel

12 10月, 2022 1 次提交

cgroup: add cgroup_v1v2_get_from_[fd/file]() · a6d1ce59

由 Yosry Ahmed 提交于 10月 11, 2022

Add cgroup_v1v2_get_from_fd() and cgroup_v1v2_get_from_file() that
support both cgroup1 and cgroup2.
Signed-off-by: NYosry Ahmed <yosryahmed@google.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

a6d1ce59

27 9月, 2022 1 次提交

mm: multi-gen LRU: kill switch · 354ed597

由 Yu Zhao 提交于 9月 18, 2022

Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
  0x0001: the multi-gen LRU core
  0x0002: walking page table, when arch_has_hw_pte_young() returns
          true
  0x0004: clearing the accessed bit in non-leaf PMD entries, when
          CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
  [yYnN]: apply to all the components above
E.g.,
  echo y >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0007
  echo 5 >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0005

NB: the page table walks happen on the scale of seconds under heavy memory
pressure, in which case the mmap_lock contention is a lesser concern,
compared with the LRU lock contention and the I/O congestion.  So far the
only well-known case of the mmap_lock contention happens on Android, due
to Scudo [1] which allocates several thousand VMAs for merely a few
hundred MBs.  The SPF and the Maple Tree also have provided their own
assessments [2][3].  However, if walking page tables does worsen the
mmap_lock contention, the kill switch can be used to disable it.  In this
case the multi-gen LRU will suffer a minor performance degradation, as
shown previously.

Clearing the accessed bit in non-leaf PMD entries can also be disabled,
since this behavior was not tested on x86 varieties other than Intel and
AMD.

[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
[3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/

Link: https://lkml.kernel.org/r/20220918080010.2920238-11-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
Acked-by: NBrian Geffon <bgeffon@google.com>
Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: NSteven Barrett <steven@liquorix.net>
Acked-by: NSuleiman Souhlal <suleiman@google.com>
Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
Tested-by: NDonald Carr <d@chaos-reins.com>
Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
Tested-by: NSofia Trinh <sofia.trinh@edi.works>
Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

354ed597

09 9月, 2022 1 次提交

sched/psi: Consolidate cgroup_psi() · 57899a66

由 Chengming Zhou 提交于 8月 26, 2022

cgroup_psi() can't return psi_group for root cgroup, so we have many
open code "psi = cgroup_ino(cgrp) == 1 ? &psi_system : cgrp->psi".

This patch move cgroup_psi() definition to <linux/psi.h>, in which
we can return psi_system for root cgroup, so can handle all cgroups.
Signed-off-by: NChengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-9-zhouchengming@bytedance.com

57899a66

02 9月, 2022 1 次提交

cgroup: Implement cgroup_file_show() · e2691f6b

由 Tejun Heo 提交于 8月 27, 2022

Add cgroup_file_show() which allows toggling visibility of a cgroup file
using the new kernfs_show(). This will be used to hide psi interface files
on cgroups where it's disabled.

Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: NChengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: NChengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220828050440.734579-10-tj@kernel.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

e2691f6b

27 8月, 2022 1 次提交

cgroup: Homogenize cgroup_get_from_id() return value · fa7e439c

由 Michal Koutný 提交于 8月 26, 2022

Cgroup id is user provided datum hence extend its return domain to
include possible error reason (similar to cgroup_get_from_fd()).

This change also fixes commit d4ccaf58 ("bpf: Introduce cgroup
iter") that would use NULL instead of proper error handling in
d4ccaf58 ("bpf: Introduce cgroup iter").

Additionally, neither of: fc_appid_store, bpf_iter_attach_cgroup,
mem_cgroup_get_from_ino (callers of cgroup_get_from_fd) is built without
CONFIG_CGROUPS (depends via CONFIG_BLK_CGROUP, direct, transitive
CONFIG_MEMCG respectively) transitive, so drop the singular definition
not needed with !CONFIG_CGROUPS.

Fixes: d4ccaf58 ("bpf: Introduce cgroup iter")
Signed-off-by: NMichal Koutný <mkoutny@suse.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

fa7e439c

16 8月, 2022 2 次提交

sched/psi: Remove redundant cgroup_psi() when !CONFIG_CGROUPS · d7ae5818

由 Hao Jia 提交于 8月 06, 2022

cgroup_psi() is only called under CONFIG_CGROUPS.
We don't need cgroup_psi() when !CONFIG_CGROUPS,
so we can remove it in this case.
Signed-off-by: NHao Jia <jiahao.os@bytedance.com>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

d7ae5818

cgroup: Replace cgroup->ancestor_ids[] with ->ancestors[] · 7f203bc8

由 Tejun Heo 提交于 7月 29, 2022

Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's
no advantage to remembering the IDs instead of the pointers directly and
this makes the array useless for finding an actual ancestor cgroup forcing
cgroup_ancestor() to iteratively walk up the hierarchy instead. Let's
replace cgroup->ancestor_ids[] with ->ancestors[] and remove the walking-up
from cgroup_ancestor().

While at it, improve comments around cgroup_root->cgrp_ancestor_storage.

This patch shouldn't cause user-visible behavior differences.

v2: Update cgroup_ancestor() to use ->ancestors[].

v3: cgroup_root->cgrp_ancestor_storage's type is updated to match
    cgroup->ancestors[]. Better comments.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NNamhyung Kim <namhyung@kernel.org>

7f203bc8

08 6月, 2022 1 次提交

psi: dont alloc memory for psi by default · 5f69a657

由 Chen Wandun 提交于 5月 26, 2022

Memory about struct psi_group is allocated by default for
each cgroup even if psi_disabled is true, in this case, these
allocated memory is waste, so alloc memory for struct psi_group
only when psi_disabled is false.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

5f69a657

12 3月, 2022 1 次提交

cgroup: Fix suspicious rcu_dereference_check() usage warning · f2aa197e

由 Chengming Zhou 提交于 3月 05, 2022

task_css_set_check() will use rcu_dereference_check() to check for
rcu_read_lock_held() on the read-side, which is not true after commit
dc6e0818 ("sched/cpuacct: Optimize away RCU read lock"). This
commit drop explicit rcu_read_lock(), change to RCU-sched read-side
critical section. So fix the RCU warning by adding check for
rcu_read_lock_sched_held().

Fixes: dc6e0818 ("sched/cpuacct: Optimize away RCU read lock")
Reported-by: NLinux Kernel Functional Testing <lkft@linaro.org>
Reported-by: syzbot+16e3f2c77e7c5a0113f9@syzkaller.appspotmail.com
Signed-off-by: NChengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NTejun Heo <tj@kernel.org>
Tested-by: NZhouyi Zhou <zhouzhouyi@gmail.com>
Tested-by: NMarek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20220305034103.57123-1-zhouchengming@bytedance.com

f2aa197e

01 3月, 2022 2 次提交

sched/cpuacct: Remove redundant RCU read lock · 3eba0505

由 Chengming Zhou 提交于 2月 20, 2022

The cpuacct_account_field() and it's cgroup v2 wrapper
cgroup_account_cputime_field() is only called from cputime
in task_group_account_field(), which is already in RCU read-side
critical section. So remove these redundant RCU read lock.
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NChengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220220051426.5274-3-zhouchengming@bytedance.com

3eba0505

sched/cpuacct: Optimize away RCU read lock · dc6e0818

由 Chengming Zhou 提交于 2月 20, 2022

Since cpuacct_charge() is called from the scheduler update_curr(),
we must already have rq lock held, then the RCU read lock can
be optimized away.

And do the same thing in it's wrapper cgroup_account_cputime(),
but we can't use lockdep_assert_rq_held() there, which defined
in kernel/sched/sched.h.
Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NChengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220220051426.5274-2-zhouchengming@bytedance.com

dc6e0818

14 9月, 2021 1 次提交

bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode · 8520e224

由 Daniel Borkmann 提交于 9月 14, 2021

Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).

The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.

Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.

In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.

Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.

[0] https://github.com/nestybox/sysbox/
[1] https://kind.sigs.k8s.io/

Fixes: bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NStanislav Fomichev <sdf@google.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net

8520e224

10 6月, 2021 1 次提交

scsi: cgroup: Add cgroup_get_from_id() · 6b658c48

由 Muneendra Kumar 提交于 6月 08, 2021

Add a new function, cgroup_get_from_id(), to retrieve the cgroup associated
with a cgroup id. Also export the function cgroup_get_e_css() as this is
needed in blk-cgroup.h.

Link: https://lore.kernel.org/r/20210608043556.274139-2-muneendra.kumar@broadcom.comReviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NMuneendra Kumar <muneendra.kumar@broadcom.com>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>

6b658c48

09 6月, 2021 1 次提交

cgroup: make per-cgroup pressure stall tracking configurable · 3958e2d0

由 Suren Baghdasaryan 提交于 5月 24, 2021

PSI accounts stalls for each cgroup separately and aggregates it at each
level of the hierarchy. This causes additional overhead with psi_avgs_work
being called for each cgroup in the hierarchy. psi_avgs_work has been
highly optimized, however on systems with large number of cgroups the
overhead becomes noticeable.
Systems which use PSI only at the system level could avoid this overhead
if PSI can be configured to skip per-cgroup stall accounting.
Add "cgroup_disable=pressure" kernel command-line option to allow
requesting system-wide only pressure stall accounting. When set, it
keeps system-wide accounting under /proc/pressure/ but skips accounting
for individual cgroups and does not expose PSI nodes in cgroup hierarchy.
Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

3958e2d0

25 5月, 2021 1 次提交

cgroup: fix spelling mistakes · 08b2b6fd

由 Zhen Lei 提交于 5月 24, 2021

Fix some spelling mistakes in comments:
hierarhcy ==> hierarchy
automtically ==> automatically
overriden ==> overridden
In absense of .. or ==> In absence of .. and
assocaited ==> associated
taget ==> target
initate ==> initiate
succeded ==> succeeded
curremt ==> current
udpated ==> updated
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

08b2b6fd

11 5月, 2021 1 次提交

cgroup: inline cgroup_task_freeze() · f4f809f6

由 Roman Gushchin 提交于 5月 10, 2021

After the introduction of the cgroup.kill there is only one call site
of cgroup_task_freeze() left: cgroup_exit(). cgroup_task_freeze() is
currently taking rcu_read_lock() to read task's cgroup flags, but
because it's always called with css_set_lock locked, the rcu protection
is excessive.

Simplify the code by inlining cgroup_task_freeze().

v2: fix build
Signed-off-by: NRoman Gushchin <guro@fb.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

f4f809f6

17 2月, 2021 1 次提交

rbtree, perf: Use new rbtree helpers · a3b89864

由 Peter Zijlstra 提交于 4月 29, 2020

Reduce rbtree boiler plate by using the new helpers.

One noteworthy change is unification of the various (partial) compare
functions. We construct a subtree match by forcing the sub-order to
always match, see __group_cmp().

Due to 'const' we had to touch cgroup_id().
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NTejun Heo <tj@kernel.org>
Acked-by: NDavidlohr Bueso <dbueso@suse.de>

a3b89864

19 8月, 2020 1 次提交

cgroup: Use generic ns_common::count · f387882d

由 Kirill Tkhai 提交于 8月 03, 2020

Switch over cgroup namespaces to use the newly introduced common lifetime
counter.

Currently every namespace type has its own lifetime counter which is stored
in the specific namespace struct. The lifetime counters are used
identically for all namespaces types. Namespaces may of course have
additional unrelated counters and these are not altered.

This introduces a common lifetime counter into struct ns_common. The
ns_common struct encompasses information that all namespaces share. That
should include the lifetime counter since its common for all of them.

It also allows us to unify the type of the counters across all namespaces.
Most of them use refcount_t but one uses atomic_t and at least one uses
kref. Especially the last one doesn't make much sense since it's just a
wrapper around refcount_t since 2016 and actually complicates cleanup
operations by having to use container_of() to cast the correct namespace
struct out of struct ns_common.

Having the lifetime counter for the namespaces in one place reduces
maintenance cost. Not just because after switching all namespaces over we
will have removed more code than we added but also because the logic is
more easily understandable and we indicate to the user that the basic
lifetime requirements for all namespaces are currently identical.
Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/159644980994.604812.383801057081594972.stgit@localhost.localdomainSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>

f387882d

08 7月, 2020 1 次提交

cgroup: fix cgroup_sk_alloc() for sk_clone_lock() · ad0f75e5

由 Cong Wang 提交于 7月 02, 2020

When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.

The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd->val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.

This bug seems to be introduced since the beginning, commit
d979a39d ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit 090e28b2
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

Fixes: bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
Reported-by: NCameron Berkenpas <cam@neo-zeon.de>
Reported-by: NPeter Geis <pgwipeout@gmail.com>
Reported-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
Reported-by: NDaniël Sonck <dsonck92@gmail.com>
Reported-by: NZhang Qiang <qiang.zhang@windriver.com>
Tested-by: NCameron Berkenpas <cam@neo-zeon.de>
Tested-by: NPeter Geis <pgwipeout@gmail.com>
Tested-by: NThomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad0f75e5

13 2月, 2020 3 次提交

clone3: allow spawning processes into cgroups · ef2c41cf

由 Christian Brauner 提交于 2月 05, 2020

This adds support for creating a process in a different cgroup than its
parent. Callers can limit and account processes and threads right from
the moment they are spawned:
- A service manager can directly spawn new services into dedicated
  cgroups.
- A process can be directly created in a frozen cgroup and will be
  frozen as well.
- The initial accounting jitter experienced by process supervisors and
  daemons is eliminated with this.
- Threaded applications or even thread implementations can choose to
  create a specific cgroup layout where each thread is spawned
  directly into a dedicated cgroup.

This feature is limited to the unified hierarchy. Callers need to pass
a directory file descriptor for the target cgroup. The caller can
choose to pass an O_PATH file descriptor. All usual migration
restrictions apply, i.e. there can be no processes in inner nodes. In
general, creating a process directly in a target cgroup adheres to all
migration restrictions.

One of the biggest advantages of this feature is that CLONE_INTO_GROUP does
not need to grab the write side of the cgroup cgroup_threadgroup_rwsem.
This global lock makes moving tasks/threads around super expensive. With
clone3() this lock is avoided.

Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

ef2c41cf

cgroup: Clean up css_set task traversal · f43caa2a

由 Michal Koutný 提交于 1月 24, 2020

css_task_iter stores pointer to head of each iterable list, this dates
back to commit 0f0a2b4f ("cgroup: reorganize css_task_iter") when we
did not store cur_cset. Let us utilize list heads directly in cur_cset
and streamline css_task_iter_advance_css_set a bit. This is no
intentional function change.
Signed-off-by: NMichal Koutný <mkoutny@suse.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

f43caa2a

cgroup: Iterate tasks that did not finish do_exit() · 9c974c77

由 Michal Koutný 提交于 1月 24, 2020

PF_EXITING is set earlier than actual removal from css_set when a task
is exitting. This can confuse cgroup.procs readers who see no PF_EXITING
tasks, however, rmdir is checking against css_set membership so it can
transitionally fail with EBUSY.

Fix this by listing tasks that weren't unlinked from css_set active
lists.
It may happen that other users of the task iterator (without
CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This
is equal to the state before commit c03cd773 ("cgroup: Include dying
leaders with live threads in PROCS iterations") but it may be reviewed
later.
Reported-by: NSuren Baghdasaryan <surenb@google.com>
Fixes: c03cd773 ("cgroup: Include dying leaders with live threads in PROCS iterations")
Signed-off-by: NMichal Koutný <mkoutny@suse.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

9c974c77

13 11月, 2019 2 次提交

cgroup: use cgrp->kn->id as the cgroup ID · 74321038

由 Tejun Heo 提交于 11月 04, 2019

cgroup ID is currently allocated using a dedicated per-hierarchy idr
and used internally and exposed through tracepoints and bpf.  This is
confusing because there are tracepoints and other interfaces which use
the cgroupfs ino as IDs.

The preceding changes made kn->id exposed as ino as 64bit ino on
supported archs or ino+gen (low 32bits as ino, high gen).  There's no
reason for cgroup to use different IDs.  The kernfs IDs are unique and
userland can easily discover them and map them back to paths using
standard file operations.

This patch replaces cgroup IDs with kernfs IDs.

* cgroup_id() is added and all cgroup ID users are converted to use it.

* kernfs_node creation is moved to earlier during cgroup init so that
  cgroup_id() is available during init.

* While at it, s/cgroup/cgrp/ in psi helpers for consistency.

* Fallback ID value is changed to 1 to be consistent with root cgroup
  ID.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>

74321038

kernfs: convert kernfs_node->id from union kernfs_node_id to u64 · 67c0496e

由 Tejun Heo 提交于 11月 04, 2019

kernfs_node->id is currently a union kernfs_node_id which represents
either a 32bit (ino, gen) pair or u64 value.  I can't see much value
in the usage of the union - all that's needed is a 64bit ID which the
current code is already limited to.  Using a union makes the code
unnecessarily complicated and prevents using 64bit ino without adding
practical benefits.

This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
ino is stored in the lower 32bits and gen upper.  Accessors -
kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
ino and gen.  This simplifies ID handling less cumbersome and will
allow using 64bit inos on supported archs.

This patch doesn't make any functional changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexei Starovoitov <ast@kernel.org>

67c0496e

25 10月, 2019 1 次提交

cgroup: remove cgroup_enable_task_cg_lists() optimization · 5153faac

由 Tejun Heo 提交于 10月 24, 2019

cgroup_enable_task_cg_lists() is used to lazyily initialize task
cgroup associations on the first use to reduce fork / exit overheads
on systems which don't use cgroup.  Unfortunately, locking around it
has never been actually correct and its value is dubious given how the
vast majority of systems use cgroup right away from boot.

This patch removes the optimization.  For now, replace the cg_list
based branches with WARN_ON_ONCE()'s to be on the safe side.  We can
simplify the logic further in the future.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

5153faac

25 7月, 2019 1 次提交

cpusets: Rebuild root domain deadline accounting information · f9a25f77

由 Mathieu Poirier 提交于 7月 19, 2019

When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch addresses the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.
Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: NMathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
[ Various additional modifications. ]
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

f9a25f77

10 7月, 2019 1 次提交

cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages · 9b0eb69b

由 Tejun Heo 提交于 6月 27, 2019

btrfs is going to use css_put() and wbc helpers to improve cgroup
writeback support.  Add dummy css_get() definition and export wbc
helpers to prepare for module and !CONFIG_CGROUP builds.
Reported-by: Nkbuild test robot <lkp@intel.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9b0eb69b

01 6月, 2019 3 次提交

cgroup: add cgroup_parse_float() · a5e112e6

由 Tejun Heo 提交于 5月 13, 2019

cgroup already uses floating point for percent[ile] numbers and there
are several controllers which want to take them as input.  Add a
generic parse helper to handle inputs.

Update the interface convention documentation about the use of
percentage numbers.  While at it, also clarify the default time unit.
Signed-off-by: NTejun Heo <tj@kernel.org>

a5e112e6

cgroup: Include dying leaders with live threads in PROCS iterations · c03cd773

由 Tejun Heo 提交于 5月 31, 2019

CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
this means that a process with dying leader and live threads will be
skipped.  IOW, cgroup.procs might be empty while cgroup.threads isn't,
which is confusing to say the least.

Fix it by making cset track dying tasks and include dying leaders with
live threads in PROCS iteration.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-and-tested-by: NTopi Miettinen <toiwoton@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>

c03cd773

cgroup: Implement css_task_iter_skip() · b636fd38

由 Tejun Heo 提交于 5月 31, 2019

When a task is moved out of a cset, task iterators pointing to the
task are advanced using the normal css_task_iter_advance() call. This
is fine but we'll be tracking dying tasks on csets and thus moving
tasks from cset->tasks to (to be added) cset->dying_tasks. When we
remove a task from cset->tasks, if we advance the iterators, they may
move over to the next cset before we had the chance to add the task
back on the dying list, which can allow the task to escape iteration.

This patch separates out skipping from advancing. Skipping only moves
the affected iterators to the next pointer rather than fully advancing
it and the following advancing will recognize that the cursor has
already been moved forward and do the rest of advancing. This ensures
that when a task moves from one list to another in its cset, as long
as it moves in the right direction, it's always visible to iteration.

This doesn't cause any visible behavior changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>

b636fd38

30 5月, 2019 1 次提交

cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css() · 18fa84a2

由 Tejun Heo 提交于 5月 29, 2019

A PF_EXITING task can stay associated with an offline css.  If such
task calls task_get_css(), it can get stuck indefinitely.  This can be
triggered by BSD process accounting which writes to a file with
PF_EXITING set when racing against memcg disable as in the backtrace
at the end.

After this change, task_get_css() may return a css which was already
offline when the function was called.  None of the existing users are
affected by this change.

  INFO: rcu_sched self-detected stall on CPU
  INFO: rcu_sched detected stalls on CPUs/tasks:
  ...
  NMI backtrace for cpu 0
  ...
  Call Trace:
   <IRQ>
   dump_stack+0x46/0x68
   nmi_cpu_backtrace.cold.2+0x13/0x57
   nmi_trigger_cpumask_backtrace+0xba/0xca
   rcu_dump_cpu_stacks+0x9e/0xce
   rcu_check_callbacks.cold.74+0x2af/0x433
   update_process_times+0x28/0x60
   tick_sched_timer+0x34/0x70
   __hrtimer_run_queues+0xee/0x250
   hrtimer_interrupt+0xf4/0x210
   smp_apic_timer_interrupt+0x56/0x110
   apic_timer_interrupt+0xf/0x20
   </IRQ>
  RIP: 0010:balance_dirty_pages_ratelimited+0x28f/0x3d0
  ...
   btrfs_file_write_iter+0x31b/0x563
   __vfs_write+0xfa/0x140
   __kernel_write+0x4f/0x100
   do_acct_process+0x495/0x580
   acct_process+0xb9/0xdb
   do_exit+0x748/0xa00
   do_group_exit+0x3a/0xa0
   get_signal+0x254/0x560
   do_signal+0x23/0x5c0
   exit_to_usermode_loop+0x5d/0xa0
   prepare_exit_to_usermode+0x53/0x80
   retint_user+0x8/0x8
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.2+
Fixes: ec438699 ("cgroup, block: implement task_get_css() and use it in bio_associate_current()")

18fa84a2

29 5月, 2019 1 次提交

bpf: decouple the lifetime of cgroup_bpf from cgroup itself · 4bfc0bb2

由 Roman Gushchin 提交于 5月 25, 2019

Currently the lifetime of bpf programs attached to a cgroup is bound
to the lifetime of the cgroup itself. It means that if a user
forgets (or intentionally avoids) to detach a bpf program before
removing the cgroup, it will stay attached up to the release of the
cgroup. Since the cgroup can stay in the dying state (the state
between being rmdir()'ed and being released) for a very long time, it
leads to a waste of memory. Also, it blocks a possibility to implement
the memcg-based memory accounting for bpf objects, because a circular
reference dependency will occur. Charged memory pages are pinning the
corresponding memory cgroup, and if the memory cgroup is pinning
the attached bpf program, nothing will be ever released.

A dying cgroup can not contain any processes, so the only chance for
an attached bpf program to be executed is a live socket associated
with the cgroup. So in order to release all bpf data early, let's
count associated sockets using a new percpu refcounter. On cgroup
removal the counter is transitioned to the atomic mode, and as soon
as it reaches 0, all bpf programs are detached.

Because cgroup_bpf_release() can block, it can't be called from
the percpu ref counter callback directly, so instead an asynchronous
work is scheduled.

The reference counter is not socket specific, and can be used for any
other types of programs, which can be executed from a cgroup-bpf hook
outside of the process context, had such a need arise in the future.
Signed-off-by: NRoman Gushchin <guro@fb.com>
Cc: jolsa@redhat.com
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

4bfc0bb2

06 5月, 2019 1 次提交

cgroup: get rid of cgroup_freezer_frozen_exit() · 96b9c592

由 Roman Gushchin 提交于 4月 26, 2019

A task should never enter the exit path with the task->frozen bit set.
Any frozen task must enter the signal handling loop and the only
way to escape is through cgroup_leave_frozen(true), which
unconditionally drops the task->frozen bit. So it means that
cgroyp_freezer_frozen_exit() has zero chances to be called and
has to be removed.

Let's put a WARN_ON_ONCE() instead of the cgroup_freezer_frozen_exit()
call to catch any potential leak of the task's frozen bit.
Suggested-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NRoman Gushchin <guro@fb.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

96b9c592

20 4月, 2019 1 次提交

cgroup: cgroup v2 freezer · 76f969e8

由 Roman Gushchin 提交于 4月 19, 2019

Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.

This patch implements freezer for cgroup v2.

Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.

This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.

Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.

* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.

There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: NRoman Gushchin <guro@fb.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
No-objection-from-me-by: NOleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com

76f969e8

31 1月, 2019 1 次提交

cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting · 51bee5ab

由 Oleg Nesterov 提交于 1月 28, 2019

The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
needs pids_free() to uncharge the pid.

However, ->free() is called from __put_task_struct()->cgroup_free() and this
is too late. Even the trivial program which does

	for (;;) {
		int pid = fork();
		assert(pid >= 0);
		if (pid)
			wait(NULL);
		else
			exit(0);
	}

can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
implies an RCU gp after the task/pid goes away and before the final put().

Test-case:

	mkdir -p /tmp/CG
	mount -t cgroup2 none /tmp/CG
	echo '+pids' > /tmp/CG/cgroup.subtree_control

	mkdir /tmp/CG/PID
	echo 2 > /tmp/CG/PID/pids.max

	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
	echo $! > /tmp/CG/PID/cgroup.procs

Without this patch the forking process fails soon after migration.

Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
into the new helper, cgroup_release(), called by release_task() which actually
frees the pid(s).
Reported-by: NHerton R. Krzesinski <hkrzesin@redhat.com>
Reported-by: NJan Stancek <jstancek@redhat.com>
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

51bee5ab

08 12月, 2018 1 次提交

blkcg: remove additional reference to the css · fc5a828b

由 Dennis Zhou 提交于 12月 05, 2018

The previous patch in this series removed carrying around a pointer to
the css in blkg. However, the blkg association logic still relied on
taking a reference on the css to ensure we wouldn't fail in getting a
reference for the blkg.

Here the implicit dependency on the css is removed. The association
continues to rely on the tryget logic walking up the blkg tree. This
streamlines the three ways that association can happen: normal, swap,
and writeback.
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Acked-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fc5a828b

02 11月, 2018 1 次提交

blkcg: revert blkcg cleanups series · b5f2954d

由 Dennis Zhou 提交于 11月 01, 2018

This reverts a series committed earlier due to null pointer exception
bug report in [1]. It seems there are edge case interactions that I did
not consider and will need some time to understand what causes the
adverse interactions.

The original series can be found in [2] with a follow up series in [3].

[1] https://www.spinics.net/lists/cgroups/msg20719.html
[2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

This reverts the following commits:
d459d853, b2c3fa54, 101246ec, b3b9f24f, e2b09899,
f0fcb3ec, c839e7a0, bdc24917, 74b7c02a, 5bf9a1f3,
a7b39b4e, 07b05bcc, 49f4c2dc, 27e6fa99Signed-off-by: NDennis Zhou <dennis@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b5f2954d

27 10月, 2018 1 次提交

psi: cgroup support · 2ce7135a

由 Johannes Weiner 提交于 10月 26, 2018

On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups.  In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.

Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NTejun Heo <tj@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2ce7135a

25 9月, 2018 1 次提交

cgroup: Simplify cgroup_ancestor · 808c43b7

由 Andrey Ignatov 提交于 9月 21, 2018

Simplify cgroup_ancestor function. This is follow-up for
commit 77236281 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrey Ignatov <rdna@fb.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

808c43b7

22 9月, 2018 1 次提交

blkcg: remove additional reference to the css · f0fcb3ec

由 Dennis Zhou (Facebook) 提交于 9月 11, 2018

The previous patch in this series removed carrying around a pointer to
the css in blkg. However, the blkg association logic still relied on
taking a reference on the css to ensure we wouldn't fail in getting a
reference for the blkg.

Here the implicit dependency on the css is removed. The association
continues to rely on the tryget logic walking up the blkg tree. This
streamlines the three ways that association can happen: normal, swap,
and writeback.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f0fcb3ec

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功