提交 · 3f3942aca6da351a12543aa776467791b63b3a78 · openanolis / cloud-kernel

16 5月, 2018 1 次提交

proc: introduce proc_create_single{,_data} · 3f3942ac

由 Christoph Hellwig 提交于 5月 15, 2018

Variants of proc_create{,_data} that directly take a seq_file show
callback and drastically reduces the boilerplate code in the callers.

All trivial callers converted over.
Signed-off-by: NChristoph Hellwig <hch@lst.de>

3f3942ac

20 3月, 2018 1 次提交

cgroup: Use rcu_work instead of explicit rcu and work item · 8f36aaec

由 Tejun Heo 提交于 3月 14, 2018

Workqueue now has rcu_work.  Use it instead of open-coding rcu -> work
item bouncing.
Signed-off-by: NTejun Heo <tj@kernel.org>

8f36aaec

22 2月, 2018 1 次提交

cgroup: fix rule checking for threaded mode switching · d1897c95

由 Tejun Heo 提交于 2月 21, 2018

A domain cgroup isn't allowed to be turned threaded if its subtree is
populated or domain controllers are enabled.  cgroup_enable_threaded()
depended on cgroup_can_be_thread_root() test to enforce this rule.  A
parent which has populated domain descendants or have domain
controllers enabled can't become a thread root, so the above rules are
enforced automatically.

However, for the root cgroup which can host mixed domain and threaded
children, cgroup_can_be_thread_root() doesn't check any of those
conditions and thus first level cgroups ends up escaping those rules.

This patch fixes the bug by adding explicit checks for those rules in
cgroup_enable_threaded().
Reported-by: NMichael Kerrisk (man-pages) <mtk.manpages@gmail.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: 8cfd8147 ("cgroup: implement cgroup v2 thread support")
Cc: stable@vger.kernel.org # v4.14+

d1897c95

20 1月, 2018 1 次提交

string: drop __must_check from strscpy() and restore strscpy() usages in cgroup · 08a77676

由 Tejun Heo 提交于 1月 09, 2018

e7fd37ba ("cgroup: avoid copying strings longer than the buffers")
converted possibly unsafe strncpy() usages in cgroup to strscpy().
However, although the callsites are completely fine with truncated
copied, because strscpy() is marked __must_check, it led to the
following warnings.

  kernel/cgroup/cgroup.c: In function ‘cgroup_file_name’:
  kernel/cgroup/cgroup.c:1400:10: warning: ignoring return value of ‘strscpy’, declared with attribute warn_unused_result [-Wunused-result]
     strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX);
	       ^

To avoid the warnings, 50034ed4 ("cgroup: use strlcpy() instead of
strscpy() to avoid spurious warning") switched them to strlcpy().

strlcpy() is worse than strlcpy() because it unconditionally runs
strlen() on the source string, and the only reason we switched to
strlcpy() here was because it was lacking __must_check, which doesn't
reflect any material differences between the two function.  It's just
that someone added __must_check to strscpy() and not to strlcpy().

These basic string copy operations are used in variety of ways, and
one of not-so-uncommon use cases is safely handling truncated copies,
where the caller naturally doesn't care about the return value.  The
__must_check doesn't match the actual use cases and forces users to
opt for inferior variants which lack __must_check by happenstance or
spread ugly (void) casts.

Remove __must_check from strscpy() and restore strscpy() usages in
cgroup.
Signed-off-by: NTejun Heo <tj@kernel.org>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>

08a77676

11 1月, 2018 1 次提交

cgroup: make cgroup.threads delegatable · 4f58424d

由 Roman Gushchin 提交于 1月 10, 2018

Make cgroup.threads file delegatable.
The behavior of cgroup.threads should follow the behavior of cgroup.procs.
Signed-off-by: NRoman Gushchin <guro@fb.com>
Discovered-by: NMichael Kerrisk <mtk.manpages@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

4f58424d

20 12月, 2017 1 次提交

cgroup: fix css_task_iter crash on CSS_TASK_ITER_PROC · 74d0833c

由 Tejun Heo 提交于 12月 20, 2017

While teaching css_task_iter to handle skipping over tasks which
aren't group leaders, bc2fb7ed ("cgroup: add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS") introduced a
silly bug.

CSS_TASK_ITER_PROCS is implemented by repeating
css_task_iter_advance() while the advanced cursor is pointing to a
non-leader thread.  However, the cursor variable, @l, wasn't updated
when the iteration has to advance to the next css_set and the
following repetition would operate on the terminal @l from the
previous iteration which isn't pointing to a valid task leading to
oopses like the following or infinite looping.

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000254
  IP: __task_pid_nr_ns+0xc7/0xf0
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP
  ...
  CPU: 2 PID: 1 Comm: systemd Not tainted 4.14.4-200.fc26.x86_64 #1
  Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 3203 11/09/2017
  task: ffff88c4baee8000 task.stack: ffff96d5c3158000
  RIP: 0010:__task_pid_nr_ns+0xc7/0xf0
  RSP: 0018:ffff96d5c315bd50 EFLAGS: 00010206
  RAX: 0000000000000000 RBX: ffff88c4b68c6000 RCX: 0000000000000250
  RDX: ffffffffa5e47960 RSI: 0000000000000000 RDI: ffff88c490f6ab00
  RBP: ffff96d5c315bd50 R08: 0000000000001000 R09: 0000000000000005
  R10: ffff88c4be006b80 R11: ffff88c42f1b8004 R12: ffff96d5c315bf18
  R13: ffff88c42d7dd200 R14: ffff88c490f6a510 R15: ffff88c4b68c6000
  FS:  00007f9446f8ea00(0000) GS:ffff88c4be680000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000254 CR3: 00000007f956f000 CR4: 00000000003406e0
  Call Trace:
   cgroup_procs_show+0x19/0x30
   cgroup_seqfile_show+0x4c/0xb0
   kernfs_seq_show+0x21/0x30
   seq_read+0x2ec/0x3f0
   kernfs_fop_read+0x134/0x180
   __vfs_read+0x37/0x160
   ? security_file_permission+0x9b/0xc0
   vfs_read+0x8e/0x130
   SyS_read+0x55/0xc0
   entry_SYSCALL_64_fastpath+0x1a/0xa5
  RIP: 0033:0x7f94455f942d
  RSP: 002b:00007ffe81ba2d00 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
  RAX: ffffffffffffffda RBX: 00005574e2233f00 RCX: 00007f94455f942d
  RDX: 0000000000001000 RSI: 00005574e2321a90 RDI: 000000000000002b
  RBP: 0000000000000000 R08: 00005574e2321a90 R09: 00005574e231de60
  R10: 00007f94458c8b38 R11: 0000000000000293 R12: 00007f94458c8ae0
  R13: 00007ffe81ba3800 R14: 0000000000000000 R15: 00005574e2116560
  Code: 04 74 0e 89 f6 48 8d 04 76 48 8d 04 c5 f0 05 00 00 48 8b bf b8 05 00 00 48 01 c7 31 c0 48 8b 0f 48 85 c9 74 18 8b b2 30 08 00 00 <3b> 71 04 77 0d 48 c1 e6 05 48 01 f1 48 3b 51 38 74 09 5d c3 8b
  RIP: __task_pid_nr_ns+0xc7/0xf0 RSP: ffff96d5c315bd50

Fix it by moving the initialization of the cursor below the repeat
label.  While at it, rename it to @next for readability.
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: bc2fb7ed ("cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS")
Cc: stable@vger.kernel.org # v4.14+
Reported-by: NLaura Abbott <labbott@redhat.com>
Reported-by: NBronek Kozicki <brok@incorrekt.com>
Reported-by: NGeorge Amanakis <gamanakis@gmail.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

74d0833c

15 12月, 2017 1 次提交

cgroup: use strlcpy() instead of strscpy() to avoid spurious warning · 50034ed4

由 Arnd Bergmann 提交于 12月 15, 2017

As long as cft->name is guaranteed to be NUL-terminated, using strlcpy() would
work just as well and avoid that warning, so the change below could be folded
into that commit.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NTejun Heo <tj@kernel.org>

50034ed4

12 12月, 2017 1 次提交

cgroup: avoid copying strings longer than the buffers · e7fd37ba

由 Ma Shimiao 提交于 12月 12, 2017

cgroup root name and file name have max length limit, we should
avoid copying longer name than that to the name.

tj: minor update to $SUBJ.
Signed-off-by: NMa Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

e7fd37ba

07 11月, 2017 2 次提交

cgroup: export list of cgroups v2 features using sysfs · 5f2e6734

由 Roman Gushchin 提交于 11月 06, 2017

The active development of cgroups v2 sometimes leads to a creation
of interfaces, which are not turned on by default (to provide
backward compatibility). It's handy to know from userspace, which
cgroup v2 features are supported without calculating it based
on the kernel version. So, let's export the list of such features
using /sys/kernel/cgroup/features pseudo-file.

The list is hardcoded and has to be extended when new functionality
is added. Each feature is printed on a new line.

Example:
  $ cat /sys/kernel/cgroup/features
  nsdelegate
Signed-off-by: NRoman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: NTejun Heo <tj@kernel.org>

5f2e6734

cgroup: export list of delegatable control files using sysfs · 01ee6cfb

由 Roman Gushchin 提交于 11月 06, 2017

Delegatable cgroup v2 control files may require special handling
(e.g. chowning), and the exact list of such files varies between
kernel versions (and likely to be extended in the future).

To guarantee correctness of this list and simplify the life
of userspace (systemd, first of all), let's export the list
via /sys/kernel/cgroup/delegate pseudo-file.

Format is siple: each control file name is printed on a new line.
Example:
  $ cat /sys/kernel/cgroup/delegate
  cgroup.procs
  cgroup.subtree_control
Signed-off-by: NRoman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: NTejun Heo <tj@kernel.org>

01ee6cfb

30 10月, 2017 1 次提交

cgroup: mark @cgrp __maybe_unused in cpu_stat_show() · c3ba1329

由 Tejun Heo 提交于 10月 30, 2017

The local variable @cgrp isn't used if !CONFIG_CGROUP_SCHED.  Mark the
variable with __maybe_unused to avoid a compile warning.
Reported-by: N"kbuild-all@01.org" <kbuild-all@01.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

c3ba1329

27 10月, 2017 1 次提交

cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat · d41bf8c9

由 Tejun Heo 提交于 10月 23, 2017

The basic cpu stat is currently shown with "cpu." prefix in
cgroup.stat, and the same information is duplicated in cpu.stat when
cpu controller is enabled.  This is ugly and not very scalable as we
want to expand the coverage of stat information which is always
available.

This patch makes cgroup core always create "cpu.stat" file and show
the basic cpu stat there and calls the cpu controller to show the
extra stats when enabled.  This ensures that the same information
isn't presented in multiple places and makes future expansion of basic
stats easier.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>

d41bf8c9

11 10月, 2017 1 次提交

Revert "net: defer call to cgroup_sk_alloc()" · 75cb0709

由 Eric Dumazet 提交于 10月 10, 2017

This reverts commit fbb1fb4a.

This was not the proper fix, lets cleanly revert it, so that
following patch can be carried to stable versions.

sock_cgroup_ptr() callers do not expect a NULL return value.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

75cb0709

10 10月, 2017 1 次提交

net: defer call to cgroup_sk_alloc() · fbb1fb4a

由 Eric Dumazet 提交于 10月 08, 2017

sk_clone_lock() might run while TCP/DCCP listener already vanished.

In order to prevent use after free, it is better to defer cgroup_sk_alloc()
to the point we know both parent and child exist, and from process context.

Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fbb1fb4a

05 10月, 2017 2 次提交

bpf: introduce BPF_PROG_QUERY command · 468e2f64

由 Alexei Starovoitov 提交于 10月 02, 2017

introduce BPF_PROG_QUERY command to retrieve a set of either
attached programs to given cgroup or a set of effective programs
that will execute for events within a cgroup
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
for cgroup bits
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

468e2f64

bpf: multi program support for cgroup+bpf · 324bda9e

由 Alexei Starovoitov 提交于 10月 02, 2017

introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
bpf programs to a cgroup.

The difference between three possible flags for BPF_PROG_ATTACH command:
- NONE(default): No further bpf programs allowed in the subtree.
- BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
  the program in this cgroup yields to sub-cgroup program.
- BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
  that cgroup program gets run in addition to the program in this cgroup.

NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
change their behavior. It only clarifies the semantics in relation
to new flag.

Only one program is allowed to be attached to a cgroup with
NONE or BPF_F_ALLOW_OVERRIDE flag.
Multiple programs are allowed to be attached to a cgroup with
BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
(those that were attached first, run first)
The programs of sub-cgroup are executed first, then programs of
this cgroup and then programs of parent cgroup.
All eligible programs are executed regardless of return code from
earlier programs.

To allow efficient execution of multiple programs attached to a cgroup
and to avoid penalizing cgroups without any programs attached
introduce 'struct bpf_prog_array' which is RCU protected array
of pointers to bpf programs.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
for cgroup bits
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

324bda9e

26 9月, 2017 1 次提交

cgroup: statically initialize init_css_set->dfl_cgrp · 38683148

由 Tejun Heo 提交于 9月 25, 2017

Like other csets, init_css_set's dfl_cgrp is initialized when the cset
gets linked.  init_css_set gets linked in cgroup_init().  This has
been fine till now but the recently added basic CPU usage accounting
may end up accessing dfl_cgrp of init before cgroup_init() leading to
the following oops.

  SELinux:  Initializing.
  BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
  IP: account_system_index_time+0x60/0x90
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc2-00003-g041cd640 #10
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
  +1.9.3-20161025_171302-gandalf 04/01/2014
  task: ffffffff81e10480 task.stack: ffffffff81e00000
  RIP: 0010:account_system_index_time+0x60/0x90
  RSP: 0000:ffff880011e03cb8 EFLAGS: 00010002
  RAX: ffffffff81ef8800 RBX: ffffffff81e10480 RCX: 0000000000000003
  RDX: 0000000000000000 RSI: 00000000000f4240 RDI: 0000000000000000
  RBP: ffff880011e03cc0 R08: 0000000000010000 R09: 0000000000000000
  R10: 0000000000000020 R11: 0000003b9aca0000 R12: 000000000001c100
  R13: 0000000000000000 R14: ffffffff81e10480 R15: ffffffff81e03cd8
  FS:  0000000000000000(0000) GS:ffff880011e00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000000b0 CR3: 0000000001e09000 CR4: 00000000000006b0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <IRQ>
   account_system_time+0x45/0x60
   account_process_tick+0x5a/0x140
   update_process_times+0x22/0x60
   tick_periodic+0x2b/0x90
   tick_handle_periodic+0x25/0x70
   timer_interrupt+0x15/0x20
   __handle_irq_event_percpu+0x7e/0x1b0
   handle_irq_event_percpu+0x23/0x60
   handle_irq_event+0x42/0x70
   handle_level_irq+0x83/0x100
   handle_irq+0x6f/0x110
   do_IRQ+0x46/0xd0
   common_interrupt+0x9d/0x9d

Fix it by statically initializing init_css_set.dfl_cgrp so that init's
default cgroup is accessible from the get-go.

Fixes: 041cd640 ("cgroup: Implement cgroup2 basic CPU usage accounting")
Reported-by: N“kbuild-all@01.org” <kbuild-all@01.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

38683148

25 9月, 2017 1 次提交

cgroup: Implement cgroup2 basic CPU usage accounting · 041cd640

由 Tejun Heo 提交于 9月 25, 2017

In cgroup1, while cpuacct isn't actually controlling any resources, it
is a separate controller due to combination of two factors -
1. enabling cpu controller has significant side effects, and 2. we
have to pick one of the hierarchies to account CPU usages on.  cpuacct
controller is effectively used to designate a hierarchy to track CPU
usages on.

cgroup2's unified hierarchy removes the second reason and we can
account basic CPU usages by default.  While we can use cpuacct for
this purpose, both its interface and implementation leave a lot to be
desired - it collects and exposes two sources of truth which don't
agree with each other and some of the exposed statistics don't make
much sense.  Also, it propagates all the way up the hierarchy on each
accounting event which is unnecessary.

This patch adds basic resource accounting mechanism to cgroup2's
unified hierarchy and accounts CPU usages using it.

* All accountings are done per-cpu and don't propagate immediately.
  It just bumps the per-cgroup per-cpu counters and links to the
  parent's updated list if not already on it.

* On a read, the per-cpu counters are collected into the global ones
  and then propagated upwards.  Only the per-cpu counters which have
  changed since the last read are propagated.

* CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
  prefix.  Total usage is collected from scheduling events.  User/sys
  breakdown is sourced from tick sampling and adjusted to the usage
  using cputime_adjust().

This keeps the accounting side hot path O(1) and per-cpu and the read
side O(nr_updated_since_last_read).

v2: Minor changes and documentation updates as suggested by Waiman and
    Roman.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Roman Gushchin <guro@fb.com>

041cd640

22 9月, 2017 1 次提交

cgroup: Reinit cgroup_taskset structure before cgroup_migrate_execute() returns · c4fa6c43

由 Waiman Long 提交于 9月 21, 2017

The cgroup_taskset structure within the larger cgroup_mgctx structure
is supposed to be used once and then discarded. That is not really the
case in the hotplug code path:

cpuset_hotplug_workfn()
 - cgroup_transfer_tasks()
   - cgroup_migrate()
     - cgroup_migrate_add_task()
     - cgroup_migrate_execute()

In this case, the cgroup_migrate() function is called multiple time
with the same cgroup_mgctx structure to transfer the tasks from
one cgroup to another one-by-one. The second time cgroup_migrate()
is called, the cgroup_taskset will be in an incorrect state and so
may cause the system to panic. For example,

  [  150.888410] Faulting instruction address: 0xc0000000001db648
  [  150.888414] Oops: Kernel access of bad area, sig: 11 [#1]
  [  150.888417] SMP NR_CPUS=2048
  [  150.888417] NUMA
  [  150.888419] pSeries
    :
  [  150.888545] NIP [c0000000001db648] cpuset_can_attach+0x58/0x1b0
  [  150.888548] LR [c0000000001db638] cpuset_can_attach+0x48/0x1b0
  [  150.888551] Call Trace:
  [  150.888554] [c0000005f65cb940] [c0000000001db638] cpuset_can_attach+0x48/0x1b 0 (unreliable)
  [  150.888559] [c0000005f65cb9a0] [c0000000001cff04] cgroup_migrate_execute+0xc4/0x4b0
  [  150.888563] [c0000005f65cba20] [c0000000001d7d14] cgroup_transfer_tasks+0x1d4/0x370
  [  150.888568] [c0000005f65cbb70] [c0000000001ddcb0] cpuset_hotplug_workfn+0x710/0x8f0
  [  150.888572] [c0000005f65cbc80] [c00000000012032c] process_one_work+0x1ac/0x4d0
  [  150.888576] [c0000005f65cbd20] [c0000000001206f8] worker_thread+0xa8/0x5b0
  [  150.888580] [c0000005f65cbdc0] [c0000000001293f8] kthread+0x168/0x1b0
  [  150.888584] [c0000005f65cbe30] [c00000000000b368] ret_from_kernel_thread+0x5c/0x74

To allow reuse of the cgroup_mgctx structure, some fields in that
structure are now re-initialized at the end of cgroup_migrate_execute()
function call so that the structure can be reused again in a later
iteration without causing problem.

This bug was introduced in the commit e595cd70 ("group: track
migration context in cgroup_mgctx") in 4.11. This commit moves the
cgroup_taskset initialization out of cgroup_migrate(). The commit
10467270fb3 ("cgroup: don't call migration methods if there are no
tasks to migrate") helped, but did not completely resolve the problem.

Fixes: e595cd70 ("group: track migration context in cgroup_mgctx")
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.11+

c4fa6c43

07 9月, 2017 1 次提交

cgroup: revert ("cgroup: reset css on destruction") · 65f3975f

由 Roman Gushchin 提交于 9月 06, 2017

Commit fa06235b ("cgroup: reset css on destruction") caused
css_reset callback to be called from the offlining path.  Although it
solves the problem mentioned in the commit description ("For instance,
memory cgroup needs to reset memory.low, otherwise pages charged to a
dead cgroup might never get reclaimed."), generally speaking, it's not
correct.

An offline cgroup can still be a resource domain, and we shouldn't grant
it more resources than it had before deletion.

For instance, if an offline memory cgroup has dirty pages, we should
still imply i/o limits during writeback.

The css_reset callback is designed to return the cgroup state into the
original state, that means reset all limits and counters.  It's
spomething different from the offlining, and we shouldn't use it from
the offlining path.  Instead, we should adjust necessary settings from
the per-controller css_offline callbacks (e.g.  reset memory.low).

Link: http://lkml.kernel.org/r/20170727130428.28856-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
Acked-by: NTejun Heo <tj@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

65f3975f

12 8月, 2017 1 次提交

cgroup: remove unneeded checks · 696b98f2

由 Dan Carpenter 提交于 8月 09, 2017

"descendants" and "depth" are declared as int, so they can't be larger
than INT_MAX.  Static checkers complain and it's slightly confusing for
humans as well so let's just remove these conditions.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

696b98f2

11 8月, 2017 1 次提交

cgroup: misc changes · 3e48930c

由 Tejun Heo 提交于 8月 11, 2017

Misc trivial changes to prepare for future changes.  No functional
difference.

* Expose cgroup_get(), cgroup_tryget() and cgroup_parent().

* Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp.

* Rename cgroup_stats_show() to cgroup_stat_show() for consistency
  with the file name.
Signed-off-by: NTejun Heo <tj@kernel.org>

3e48930c

03 8月, 2017 5 次提交

cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy · 13d82fb7

由 Tejun Heo 提交于 8月 02, 2017

Each css_set directly points to the default cgroup it belongs to, so
there's no reason to walk the cgrp_links list on the default
hierarchy.
Signed-off-by: NTejun Heo <tj@kernel.org>

13d82fb7

cgroup: re-use the parent pointer in cgroup_destroy_locked() · 5a621e6c

由 Roman Gushchin 提交于 8月 02, 2017

As we already have a pointer to the parent cgroup in
cgroup_destroy_locked(), we don't need to calculate it again
to pass as an argument for cgroup1_check_for_release().
Signed-off-by: NRoman Gushchin <guro@fb.com>
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com
Cc: linux-kernel@vger.kernel.org

5a621e6c

cgroup: add cgroup.stat interface with basic hierarchy stats · ec39225c

由 Roman Gushchin 提交于 8月 02, 2017

A cgroup can consume resources even after being deleted by a user.
For example, writing back dirty pages should be accounted and
limited, despite the corresponding cgroup might contain no processes
and being deleted by a user.

In the current implementation a cgroup can remain in such "dying" state
for an undefined amount of time. For instance, if a memory cgroup
contains a pge, mlocked by a process belonging to an other cgroup.

Although the lifecycle of a dying cgroup is out of user's control,
it's important to have some insight of what's going on under the hood.

In particular, it's handy to have a counter which will allow
to detect css leaks.

To solve this problem, add a cgroup.stat interface to
the base cgroup control files with the following metrics:

nr_descendants		total number of visible descendant cgroups
nr_dying_descendants	total number of dying descendant cgroups
Signed-off-by: NRoman Gushchin <guro@fb.com>
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

ec39225c

cgroup: implement hierarchy limits · 1a926e0b

由 Roman Gushchin 提交于 7月 28, 2017

Creating cgroup hierearchies of unreasonable size can affect
overall system performance. A user might want to limit the
size of cgroup hierarchy. This is especially important if a user
is delegating some cgroup sub-tree.

To address this issue, introduce an ability to control
the size of cgroup hierarchy.

The cgroup.max.descendants control file allows to set the maximum
allowed number of descendant cgroups.
The cgroup.max.depth file controls the maximum depth of the cgroup
tree. Both are single value r/w files, with "max" default value.

The control files exist on each hierarchy level (including root).
When a new cgroup is created, we check the total descendants
and depth limits on each level, and if none of them are exceeded,
a new cgroup is created.

Only alive cgroups are counted, removed (dying) cgroups are
ignored.
Signed-off-by: NRoman Gushchin <guro@fb.com>
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

1a926e0b

cgroup: keep track of number of descent cgroups · 0679dee0

由 Roman Gushchin 提交于 8月 02, 2017

Keep track of the number of online and dying descent cgroups.

This data will be used later to add an ability to control cgroup
hierarchy (limit the depth and the number of descent cgroups)
and display hierarchy stats.
Signed-off-by: NRoman Gushchin <guro@fb.com>
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

0679dee0

29 7月, 2017 2 次提交

blktrace: add an option to allow displaying cgroup path · 69fd5c39

由 Shaohua Li 提交于 7月 12, 2017

By default we output cgroup id in blktrace. This adds an option to
display cgroup path. Since get cgroup path is a relativly heavy
operation, we don't enable it by default.

with the option enabled, blktrace will output something like this:
dd-1353  [007] d..2   293.015252:   8,0   /test/level  D   R 24 + 8 [dd]
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

69fd5c39

kernfs: add exportfs operations · aa818825

由 Shaohua Li 提交于 7月 12, 2017

Now we have the facilities to implement exportfs operations. The idea is
cgroup can export the fhandle info to userspace, then userspace uses
fhandle to find the cgroup name. Another example is userspace can get
fhandle for a cgroup and BPF uses the fhandle to filter info for the
cgroup.
Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

aa818825

26 7月, 2017 2 次提交

cgroup: add comment to cgroup_enable_threaded() · c705a00d

由 Tejun Heo 提交于 7月 25, 2017

Explain cgroup_enable_threaded() and note that the function can never
be called on the root cgroup.
Signed-off-by: NTejun Heo <tj@kernel.org>
Suggested-by: NWaiman Long <longman@redhat.com>

c705a00d

cgroup: remove unnecessary empty check when enabling threaded mode · 918a8c2c

由 Tejun Heo 提交于 7月 23, 2017

cgroup_enable_threaded() checks that the cgroup doesn't have any tasks
or children and fails the operation if so.  This test is unnecessary
because the first part is already checked by
cgroup_can_be_thread_root() and the latter is unnecessary.  The latter
actually cause a behavioral oddity.  Please consider the following
hierarchy.  All cgroups are domains.

    A
   / \
  B   C
       \
        D

If B is made threaded, C and D becomes invalid domains.  Due to the no
children restriction, threaded mode can't be enabled on C.  For C and
D, the only thing the user can do is removal.

There is no reason for this restriction.  Remove it.
Acked-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

918a8c2c

23 7月, 2017 1 次提交

cgroup: fix error return value from cgroup_subtree_control() · 3c745417

由 Tejun Heo 提交于 7月 23, 2017

While refactoring, f7b2814b ("cgroup: factor out
cgroup_{apply|finalize}_control() from
cgroup_subtree_control_write()") broke error return value from the
function.  The return value from the last operation is always
overridden to zero.  Fix it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.6+
Signed-off-by: NTejun Heo <tj@kernel.org>

3c745417

21 7月, 2017 6 次提交

cgroup: update debug controller to print out thread mode information · 7a0cf0e7

由 Waiman Long 提交于 7月 21, 2017

Update debug controller so that it prints out debug info about thread
mode.

 1) The relationship between proc_cset and threaded_csets are displayed.
 2) The status of being a thread root or threaded cgroup is displayed.

This patch is extracted from Waiman's larger patch.

v2: - Removed [thread root] / [threaded] from debug.cgroup_css_links
      file as the same information is available from cgroup.type.
      Suggested by Waiman.
    - Threaded marking is moved to the previous patch.
Patch-originally-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

7a0cf0e7

cgroup: implement cgroup v2 thread support · 8cfd8147

由 Tejun Heo 提交于 7月 21, 2017

This patch implements cgroup v2 thread support.  The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.

A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file.  When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.

The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint.  Note that the threads aren't allowed
to escape to a different threaded subtree.  To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.

The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree.  This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted.  This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.

As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.

Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.

This patch enables threaded mode for the pids and perf_events
controllers.  Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.

For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.

v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
      Spotted by Waiman.
    - Documentation updated as suggested by Waiman.
    - cgroup.type content slightly reformatted.
    - Mark the debug controller threaded.

v4: - Updated to the general idea of marking specific cgroups
      domain/threaded as suggested by PeterZ.

v3: - Dropped "join" and always make mixed children join the parent's
      threaded subtree.

v2: - After discussions with Waiman, support for mixed thread mode is
      added.  This should address the issue that Peter pointed out
      where any nesting should be avoided for thread subtrees while
      coexisting with other domain cgroups.
    - Enabling / disabling thread mode now piggy backs on the existing
      control mask update mechanism.
    - Bug fixes and cleanup.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>

8cfd8147

cgroup: implement CSS_TASK_ITER_THREADED · 450ee0c1

由 Tejun Heo 提交于 5月 15, 2017

cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the dom_cgrp to which the processes of the subtree conceptually belong
and domain-level resource consumptions not tied to any specific task
are charged.  In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.

This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
which, when used on a dom_cgrp, makes the iteration include the tasks
on all the associated threaded css_sets.  "cgroup.procs" read path is
updated to use it so that reading the file on a proc_cgrp lists all
processes.  This will also be used by controller implementations which
need to walk processes or tasks at the resource domain level.

Task iteration is implemented nested in css_set iteration.  If
CSS_TASK_ITER_THREADED is specified, after walking tasks of each
!threaded css_set, all the associated threaded css_sets are visited
before moving onto the next !threaded css_set.

v2: ->cur_pcset renamed to ->cur_dcset.  Updated for the new
    enable-threaded-per-cgroup behavior.
Signed-off-by: NTejun Heo <tj@kernel.org>

450ee0c1

cgroup: introduce cgroup->dom_cgrp and threaded css_set handling · 454000ad

由 Tejun Heo 提交于 5月 15, 2017

cgroup v2 is in the process of growing thread granularity support.  A
threaded subtree is composed of a thread root and threaded cgroups
which are proper members of the subtree.

The root cgroup of the subtree serves as the domain cgroup to which
the processes (as opposed to threads / tasks) of the subtree
conceptually belong and domain-level resource consumptions not tied to
any specific task are charged.  Inside the subtree, threads won't be
subject to process granularity or no-internal-task constraint and can
be distributed arbitrarily across the subtree.

This patch introduces cgroup->dom_cgrp along with threaded css_set
handling.

* cgroup->dom_cgrp points to self for normal and thread roots.  For
  proper thread subtree members, points to the dom_cgrp (the thread
  root).

* css_set->dom_cset points to self if for normal and thread roots.  If
  threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
  The dom_cgrp serves as the resource domain and keeps the matching
  csses available.  The dom_cset holds those csses and makes them
  easily accessible.

* All threaded csets are linked on their dom_csets to enable iteration
  of all threaded tasks.

* cgroup->nr_threaded_children keeps track of the number of threaded
  children.

This patch adds the above but doesn't actually use them yet.  The
following patches will build on top.

v4: ->nr_threaded_children added.

v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset.  Updated for the new
    enable-threaded-per-cgroup behavior.

v2: Added cgroup_is_threaded() helper.
Signed-off-by: NTejun Heo <tj@kernel.org>

454000ad

cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS · bc2fb7ed

由 Tejun Heo 提交于 5月 15, 2017

css_task_iter currently always walks all tasks.  With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration.  As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
is not specified, it walks all tasks as before.  When asserted, the
iterator only walks the group leaders.

For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".

While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr().  As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.
Signed-off-by: NTejun Heo <tj@kernel.org>

bc2fb7ed

cgroup: reorganize cgroup.procs / task write path · 715c809d

由 Tejun Heo 提交于 5月 15, 2017

Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
handled by __cgroup_procs_write() on both v1 and v2.  This patch
reoragnizes the write path so that there are common helper functions
that different write paths use.

While this somewhat increases LOC, the different paths are no longer
intertwined and each path has more flexibility to implement different
behaviors which will be necessary for the planned v2 thread support.

v3: - Restructured so that cgroup_procs_write_permission() takes
      @src_cgrp and @dst_cgrp.

v2: - Rolled in Waiman's task reference count fix.
    - Updated on top of nsdelegate changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>

715c809d

19 7月, 2017 1 次提交

cgroup: create dfl_root files on subsys registration · 7af608e4

由 Tejun Heo 提交于 7月 18, 2017

On subsystem registration, css_populate_dir() is not called on the new
root css, so the interface files for the subsystem on cgrp_dfl_root
aren't created on registration.  This is a residue from the days when
cgrp_dfl_root was used only as the parking spot for unused subsystems,
which no longer is true as it's used as the root for cgroup2.

This is often fine as later operations tend to create them as a part
of mount (cgroup1) or subtree_control operations (cgroup2); however,
it's not difficult to mount cgroup2 with the controller interface
files missing as Waiman found out.

Fix it by invoking css_populate_dir() on the root css on subsys
registration.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-and-tested-by: NWaiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: NTejun Heo <tj@kernel.org>

7af608e4

17 7月, 2017 1 次提交

cgroup: replace css_set walking populated test with testing cgrp->nr_populated_csets · 27f26753

由 Tejun Heo 提交于 7月 16, 2017

Implement trivial cgroup_has_tasks() which tests whether
cgrp->nr_populated_csets is zero and replace the explicit local
populated test in cgroup_subtree_control().  This simplifies the code
and cgroup_has_tasks() will be used in more places later.
Signed-off-by: NTejun Heo <tj@kernel.org>

27f26753

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功