提交 · 1599a185f0e6113be185b9fb809c621c73865829 · openanolis / cloud-kernel

28 11月, 2017 2 次提交

cpuset: Make cpuset hotplug synchronous · 1599a185

由 Prateek Sood 提交于 11月 15, 2017

Convert cpuset_hotplug_workfn() into synchronous call for cpu hotplug
path. For memory hotplug path it still gets queued as a work item.

Since cpuset_hotplug_workfn() can be made synchronous for cpu hotplug
path, it is not required to wait for cpuset hotplug while thawing
processes.
Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

1599a185

cgroup/cpuset: remove circular dependency deadlock · aa24163b

由 Prateek Sood 提交于 11月 15, 2017

Remove circular dependency deadlock in a scenario where hotplug of CPU is
being done while there is updation in cgroup and cpuset triggered from
userspace.

Process A => kthreadd => Process B => Process C => Process A

Process A
cpu_subsys_offline();
  cpu_down();
    _cpu_down();
      percpu_down_write(&cpu_hotplug_lock); //held
      cpuhp_invoke_callback();
	     workqueue_offline_cpu();
            queue_work_on(); // unbind_work on system_highpri_wq
               __queue_work();
                 insert_work();
                    wake_up_worker();
            flush_work();
               wait_for_completion();

worker_thread();
   manage_workers();
      create_worker();
	     kthread_create_on_node();
		    wake_up_process(kthreadd_task);

kthreadd
kthreadd();
  kernel_thread();
    do_fork();
      copy_process();
        percpu_down_read(&cgroup_threadgroup_rwsem);
          __rwsem_down_read_failed_common(); //waiting

Process B
kernfs_fop_write();
  cgroup_file_write();
    cgroup_procs_write();
      percpu_down_write(&cgroup_threadgroup_rwsem); //held
      cgroup_attach_task();
        cgroup_migrate();
          cgroup_migrate_execute();
            cpuset_can_attach();
              mutex_lock(&cpuset_mutex); //waiting

Process C
kernfs_fop_write();
  cgroup_file_write();
    cpuset_write_resmask();
      mutex_lock(&cpuset_mutex); //held
      update_cpumask();
        update_cpumasks_hier();
          rebuild_sched_domains_locked();
            get_online_cpus();
              percpu_down_read(&cpu_hotplug_lock); //waiting

Eliminating deadlock by reversing the locking order for cpuset_mutex and
cpu_hotplug_lock.
Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

aa24163b

07 11月, 2017 2 次提交

cgroup: export list of cgroups v2 features using sysfs · 5f2e6734

由 Roman Gushchin 提交于 11月 06, 2017

The active development of cgroups v2 sometimes leads to a creation
of interfaces, which are not turned on by default (to provide
backward compatibility). It's handy to know from userspace, which
cgroup v2 features are supported without calculating it based
on the kernel version. So, let's export the list of such features
using /sys/kernel/cgroup/features pseudo-file.

The list is hardcoded and has to be extended when new functionality
is added. Each feature is printed on a new line.

Example:
  $ cat /sys/kernel/cgroup/features
  nsdelegate
Signed-off-by: NRoman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: NTejun Heo <tj@kernel.org>

5f2e6734

cgroup: export list of delegatable control files using sysfs · 01ee6cfb

由 Roman Gushchin 提交于 11月 06, 2017

Delegatable cgroup v2 control files may require special handling
(e.g. chowning), and the exact list of such files varies between
kernel versions (and likely to be extended in the future).

To guarantee correctness of this list and simplify the life
of userspace (systemd, first of all), let's export the list
via /sys/kernel/cgroup/delegate pseudo-file.

Format is siple: each control file name is printed on a new line.
Example:
  $ cat /sys/kernel/cgroup/delegate
  cgroup.procs
  cgroup.subtree_control
Signed-off-by: NRoman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: NTejun Heo <tj@kernel.org>

01ee6cfb

02 11月, 2017 1 次提交

License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318

由 Greg Kroah-Hartman 提交于 11月 01, 2017

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

b2441318

30 10月, 2017 1 次提交

cgroup: mark @cgrp __maybe_unused in cpu_stat_show() · c3ba1329

由 Tejun Heo 提交于 10月 30, 2017

The local variable @cgrp isn't used if !CONFIG_CGROUP_SCHED.  Mark the
variable with __maybe_unused to avoid a compile warning.
Reported-by: N"kbuild-all@01.org" <kbuild-all@01.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

c3ba1329

27 10月, 2017 2 次提交

sched/isolation: Move isolcpus= handling to the housekeeping code · edb93821

由 Frederic Weisbecker 提交于 10月 27, 2017

We want to centralize the isolation features, to be done by the housekeeping
subsystem and scheduler domain isolation is a significant part of it.

No intended behaviour change, we just reuse the housekeeping cpumask
and core code.
Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Link: http://lkml.kernel.org/r/1509072159-31808-11-git-send-email-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

edb93821

cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat · d41bf8c9

由 Tejun Heo 提交于 10月 23, 2017

The basic cpu stat is currently shown with "cpu." prefix in
cgroup.stat, and the same information is duplicated in cpu.stat when
cpu controller is enabled.  This is ugly and not very scalable as we
want to expand the coverage of stat information which is always
available.

This patch makes cgroup core always create "cpu.stat" file and show
the basic cpu stat there and calls the cpu controller to show the
extra stats when enabled.  This ensures that the same information
isn't presented in multiple places and makes future expansion of basic
stats easier.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>

d41bf8c9

11 10月, 2017 1 次提交

Revert "net: defer call to cgroup_sk_alloc()" · 75cb0709

由 Eric Dumazet 提交于 10月 10, 2017

This reverts commit fbb1fb4a.

This was not the proper fix, lets cleanly revert it, so that
following patch can be carried to stable versions.

sock_cgroup_ptr() callers do not expect a NULL return value.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

75cb0709

10 10月, 2017 1 次提交

net: defer call to cgroup_sk_alloc() · fbb1fb4a

由 Eric Dumazet 提交于 10月 08, 2017

sk_clone_lock() might run while TCP/DCCP listener already vanished.

In order to prevent use after free, it is better to defer cgroup_sk_alloc()
to the point we know both parent and child exist, and from process context.

Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fbb1fb4a

05 10月, 2017 2 次提交

bpf: introduce BPF_PROG_QUERY command · 468e2f64

由 Alexei Starovoitov 提交于 10月 02, 2017

introduce BPF_PROG_QUERY command to retrieve a set of either
attached programs to given cgroup or a set of effective programs
that will execute for events within a cgroup
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
for cgroup bits
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

468e2f64

bpf: multi program support for cgroup+bpf · 324bda9e

由 Alexei Starovoitov 提交于 10月 02, 2017

introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
bpf programs to a cgroup.

The difference between three possible flags for BPF_PROG_ATTACH command:
- NONE(default): No further bpf programs allowed in the subtree.
- BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
  the program in this cgroup yields to sub-cgroup program.
- BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
  that cgroup program gets run in addition to the program in this cgroup.

NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
change their behavior. It only clarifies the semantics in relation
to new flag.

Only one program is allowed to be attached to a cgroup with
NONE or BPF_F_ALLOW_OVERRIDE flag.
Multiple programs are allowed to be attached to a cgroup with
BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
(those that were attached first, run first)
The programs of sub-cgroup are executed first, then programs of
this cgroup and then programs of parent cgroup.
All eligible programs are executed regardless of return code from
earlier programs.

To allow efficient execution of multiple programs attached to a cgroup
and to avoid penalizing cgroups without any programs attached
introduce 'struct bpf_prog_array' which is RCU protected array
of pointers to bpf programs.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
for cgroup bits
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

324bda9e

26 9月, 2017 1 次提交

cgroup: statically initialize init_css_set->dfl_cgrp · 38683148

由 Tejun Heo 提交于 9月 25, 2017

Like other csets, init_css_set's dfl_cgrp is initialized when the cset
gets linked.  init_css_set gets linked in cgroup_init().  This has
been fine till now but the recently added basic CPU usage accounting
may end up accessing dfl_cgrp of init before cgroup_init() leading to
the following oops.

  SELinux:  Initializing.
  BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
  IP: account_system_index_time+0x60/0x90
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc2-00003-g041cd640 #10
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
  +1.9.3-20161025_171302-gandalf 04/01/2014
  task: ffffffff81e10480 task.stack: ffffffff81e00000
  RIP: 0010:account_system_index_time+0x60/0x90
  RSP: 0000:ffff880011e03cb8 EFLAGS: 00010002
  RAX: ffffffff81ef8800 RBX: ffffffff81e10480 RCX: 0000000000000003
  RDX: 0000000000000000 RSI: 00000000000f4240 RDI: 0000000000000000
  RBP: ffff880011e03cc0 R08: 0000000000010000 R09: 0000000000000000
  R10: 0000000000000020 R11: 0000003b9aca0000 R12: 000000000001c100
  R13: 0000000000000000 R14: ffffffff81e10480 R15: ffffffff81e03cd8
  FS:  0000000000000000(0000) GS:ffff880011e00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000000b0 CR3: 0000000001e09000 CR4: 00000000000006b0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <IRQ>
   account_system_time+0x45/0x60
   account_process_tick+0x5a/0x140
   update_process_times+0x22/0x60
   tick_periodic+0x2b/0x90
   tick_handle_periodic+0x25/0x70
   timer_interrupt+0x15/0x20
   __handle_irq_event_percpu+0x7e/0x1b0
   handle_irq_event_percpu+0x23/0x60
   handle_irq_event+0x42/0x70
   handle_level_irq+0x83/0x100
   handle_irq+0x6f/0x110
   do_IRQ+0x46/0xd0
   common_interrupt+0x9d/0x9d

Fix it by statically initializing init_css_set.dfl_cgrp so that init's
default cgroup is accessible from the get-go.

Fixes: 041cd640 ("cgroup: Implement cgroup2 basic CPU usage accounting")
Reported-by: N“kbuild-all@01.org” <kbuild-all@01.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

38683148

25 9月, 2017 1 次提交

cgroup: Implement cgroup2 basic CPU usage accounting · 041cd640

由 Tejun Heo 提交于 9月 25, 2017

In cgroup1, while cpuacct isn't actually controlling any resources, it
is a separate controller due to combination of two factors -
1. enabling cpu controller has significant side effects, and 2. we
have to pick one of the hierarchies to account CPU usages on.  cpuacct
controller is effectively used to designate a hierarchy to track CPU
usages on.

cgroup2's unified hierarchy removes the second reason and we can
account basic CPU usages by default.  While we can use cpuacct for
this purpose, both its interface and implementation leave a lot to be
desired - it collects and exposes two sources of truth which don't
agree with each other and some of the exposed statistics don't make
much sense.  Also, it propagates all the way up the hierarchy on each
accounting event which is unnecessary.

This patch adds basic resource accounting mechanism to cgroup2's
unified hierarchy and accounts CPU usages using it.

* All accountings are done per-cpu and don't propagate immediately.
  It just bumps the per-cgroup per-cpu counters and links to the
  parent's updated list if not already on it.

* On a read, the per-cpu counters are collected into the global ones
  and then propagated upwards.  Only the per-cpu counters which have
  changed since the last read are propagated.

* CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
  prefix.  Total usage is collected from scheduling events.  User/sys
  breakdown is sourced from tick sampling and adjusted to the usage
  using cputime_adjust().

This keeps the accounting side hot path O(1) and per-cpu and the read
side O(nr_updated_since_last_read).

v2: Minor changes and documentation updates as suggested by Waiman and
    Roman.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Roman Gushchin <guro@fb.com>

041cd640

22 9月, 2017 1 次提交

cgroup: Reinit cgroup_taskset structure before cgroup_migrate_execute() returns · c4fa6c43

由 Waiman Long 提交于 9月 21, 2017

The cgroup_taskset structure within the larger cgroup_mgctx structure
is supposed to be used once and then discarded. That is not really the
case in the hotplug code path:

cpuset_hotplug_workfn()
 - cgroup_transfer_tasks()
   - cgroup_migrate()
     - cgroup_migrate_add_task()
     - cgroup_migrate_execute()

In this case, the cgroup_migrate() function is called multiple time
with the same cgroup_mgctx structure to transfer the tasks from
one cgroup to another one-by-one. The second time cgroup_migrate()
is called, the cgroup_taskset will be in an incorrect state and so
may cause the system to panic. For example,

  [  150.888410] Faulting instruction address: 0xc0000000001db648
  [  150.888414] Oops: Kernel access of bad area, sig: 11 [#1]
  [  150.888417] SMP NR_CPUS=2048
  [  150.888417] NUMA
  [  150.888419] pSeries
    :
  [  150.888545] NIP [c0000000001db648] cpuset_can_attach+0x58/0x1b0
  [  150.888548] LR [c0000000001db638] cpuset_can_attach+0x48/0x1b0
  [  150.888551] Call Trace:
  [  150.888554] [c0000005f65cb940] [c0000000001db638] cpuset_can_attach+0x48/0x1b 0 (unreliable)
  [  150.888559] [c0000005f65cb9a0] [c0000000001cff04] cgroup_migrate_execute+0xc4/0x4b0
  [  150.888563] [c0000005f65cba20] [c0000000001d7d14] cgroup_transfer_tasks+0x1d4/0x370
  [  150.888568] [c0000005f65cbb70] [c0000000001ddcb0] cpuset_hotplug_workfn+0x710/0x8f0
  [  150.888572] [c0000005f65cbc80] [c00000000012032c] process_one_work+0x1ac/0x4d0
  [  150.888576] [c0000005f65cbd20] [c0000000001206f8] worker_thread+0xa8/0x5b0
  [  150.888580] [c0000005f65cbdc0] [c0000000001293f8] kthread+0x168/0x1b0
  [  150.888584] [c0000005f65cbe30] [c00000000000b368] ret_from_kernel_thread+0x5c/0x74

To allow reuse of the cgroup_mgctx structure, some fields in that
structure are now re-initialized at the end of cgroup_migrate_execute()
function call so that the structure can be reused again in a later
iteration without causing problem.

This bug was introduced in the commit e595cd70 ("group: track
migration context in cgroup_mgctx") in 4.11. This commit moves the
cgroup_taskset initialization out of cgroup_migrate(). The commit
10467270fb3 ("cgroup: don't call migration methods if there are no
tasks to migrate") helped, but did not completely resolve the problem.

Fixes: e595cd70 ("group: track migration context in cgroup_mgctx")
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.11+

c4fa6c43

07 9月, 2017 3 次提交

sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs · 50e76632

由 Peter Zijlstra 提交于 9月 07, 2017

Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
because it now resulted in non-cpuset usage breaking too.

On suspend cpuset_cpu_inactive() doesn't call into
cpuset_update_active_cpus() because it doesn't want to move tasks about,
there is no need, all tasks are frozen and won't run again until after
we've resumed everything.

But this means that when we finally do call into
cpuset_update_active_cpus() after resuming the last frozen cpu in
cpuset_cpu_active(), the top_cpuset will not have any difference with
the cpu_active_mask and this it will not in fact do _anything_.

So the cpuset configuration will not be restored. This was largely
hidden because we would unconditionally create identity domains and
mobile users would not in fact use cpusets much. And servers what do use
cpusets tend to not suspend-resume much.

An addition problem is that we'd not in fact wait for the cpuset work to
finish before resuming the tasks, allowing spurious migrations outside
of the specified domains.

Fix the rebuild by introducing cpuset_force_rebuild() and fix the
ordering with cpuset_wait_for_hotplug().
Reported-by: NAndy Lutomirski <luto@kernel.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: deb7aa30 ("cpuset: reorganize CPU / memory hotplug handling")
Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

50e76632

mm: replace TIF_MEMDIE checks by tsk_is_oom_victim · da99ecf1

由 Michal Hocko 提交于 9月 06, 2017

TIF_MEMDIE is set only to the tasks whick were either directly selected
by the OOM killer or passed through mark_oom_victim from the allocator
path. tsk_is_oom_victim is more generic and allows to identify all
tasks (threads) which share the mm with the oom victim.

Please note that the freezer still needs to check TIF_MEMDIE because we
cannot thaw tasks which do not participage in oom_victims counting
otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.

Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

da99ecf1

cgroup: revert ("cgroup: reset css on destruction") · 65f3975f

由 Roman Gushchin 提交于 9月 06, 2017

Commit fa06235b ("cgroup: reset css on destruction") caused
css_reset callback to be called from the offlining path.  Although it
solves the problem mentioned in the commit description ("For instance,
memory cgroup needs to reset memory.low, otherwise pages charged to a
dead cgroup might never get reclaimed."), generally speaking, it's not
correct.

An offline cgroup can still be a resource domain, and we shouldn't grant
it more resources than it had before deletion.

For instance, if an offline memory cgroup has dirty pages, we should
still imply i/o limits during writeback.

The css_reset callback is designed to return the cgroup state into the
original state, that means reset all limits and counters.  It's
spomething different from the offlining, and we shouldn't use it from
the offlining path.  Instead, we should adjust necessary settings from
the per-controller css_offline callbacks (e.g.  reset memory.low).

Link: http://lkml.kernel.org/r/20170727130428.28856-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
Acked-by: NTejun Heo <tj@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

65f3975f

25 8月, 2017 2 次提交

sched/topology, cpuset: Avoid spurious/wrong domain rebuilds · 77d1dfda

由 Peter Zijlstra 提交于 8月 08, 2017

When disabling cpuset.sched_load_balance we expect to be able to online
CPUs without generating sched_domains. However this is currently
completely broken.

What happens is that we generate the sched_domains and then destroy
them. This is because of the spurious 'default' domain build in
cpuset_update_active_cpus(). That builds a single machine wide domain
and then schedules a work to build the 'real' domains. The work then
finds there are _no_ domains and destroys the lot again.

Furthermore, if there actually were cpusets, building the machine wide
domain is actively wrong, because it would allow tasks to 'escape' their
cpuset. Also I don't think its needed, the scheduler really should
respect the active mask.
Reported-by: NOfer Levi(SW) <oferle@mellanox.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com>
Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

77d1dfda

cpuset: Fix incorrect memory_pressure control file mapping · 1c08c22c

由 Waiman Long 提交于 8月 24, 2017

The memory_pressure control file was incorrectly set up without
a private value (0, by default). As a result, this control
file was treated like memory_migrate on read. By adding back the
FILE_MEMORY_PRESSURE private value, the correct memory pressure value
will be returned.
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: 7dbdb199 ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE")
Cc: stable@vger.kernel.org # v4.4+

1c08c22c

18 8月, 2017 2 次提交

cpuset: Allow v2 behavior in v1 cgroup · b8d1b8ee

由 Waiman Long 提交于 8月 17, 2017

Cpuset v2 has some useful behaviors that are not present in v1 because
of backward compatibility concern. One of that is the restoration of
the original cpu and memory node mask after a hot removal and addition
event sequence.

This patch makes the cpuset controller to check the
CGRP_ROOT_CPUSET_V2_MODE flag and use the v2 behavior if it is set.
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

b8d1b8ee

cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup · e1cba4b8

由 Waiman Long 提交于 8月 17, 2017

A new mount option "cpuset_v2_mode" is added to the v1 cgroupfs
filesystem to enable cpuset controller to use v2 behavior in a v1
cgroup. This mount option applies only to cpuset controller and have
no effect on other controllers.
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

e1cba4b8

12 8月, 2017 1 次提交

cgroup: remove unneeded checks · 696b98f2

由 Dan Carpenter 提交于 8月 09, 2017

"descendants" and "depth" are declared as int, so they can't be larger
than INT_MAX.  Static checkers complain and it's slightly confusing for
humans as well so let's just remove these conditions.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

696b98f2

11 8月, 2017 1 次提交

cgroup: misc changes · 3e48930c

由 Tejun Heo 提交于 8月 11, 2017

Misc trivial changes to prepare for future changes.  No functional
difference.

* Expose cgroup_get(), cgroup_tryget() and cgroup_parent().

* Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp.

* Rename cgroup_stats_show() to cgroup_stat_show() for consistency
  with the file name.
Signed-off-by: NTejun Heo <tj@kernel.org>

3e48930c

10 8月, 2017 1 次提交

cpuset: Make nr_cpusets private · be040bea

由 Paolo Bonzini 提交于 8月 01, 2017

Any use of key->enabled (that is static_key_enabled and static_key_count)
outside jump_label_lock should handle its own serialization.  In the case
of cpusets_enabled_key, the key is always incremented/decremented under
cpuset_mutex, and hence the same rule applies to nr_cpusets.  The rule
*is* respected currently, but the mutex is static so nr_cpusets should
be static too.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NZefan Li <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1501601046-35683-4-git-send-email-pbonzini@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

be040bea

03 8月, 2017 6 次提交

cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() · 89affbf5

由 Dima Zavin 提交于 8月 02, 2017

In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.

This problem manifested itself as a deadlock in the slub allocator,
inside get_any_partial.  The loop reads mems_allowed_seq value (via
read_mems_allowed_begin), performs the defrag operation, and then
verifies the consistency of mem_allowed via the read_mems_allowed_retry
and the cookie returned by xxx_begin.

The issue here is that both begin and retry first check if cpusets are
enabled via cpusets_enabled() static branch.  This branch can be
rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
x86 jump label code fully synchronizes across all CPUs for every entry
it rewrites.  If it rewrites only one of the callsites (specifically the
one in read_mems_allowed_retry) and then waits for the
smp_call_function(do_sync_core) to complete while a CPU is inside the
begin/retry section with IRQs off and the mems_allowed value is changed,
we can hang.

This is because begin() will always return 0 (since it wasn't patched
yet) while retry() will test the 0 against the actual value of the seq
counter.

The fix is to use two different static keys: one for begin
(pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
always return a valid seqcount if are enabling cpusets.  Similarly, when
disabling cpusets via cpuset_dec(), we first ensure that callers of
cpuset_mems_allowed_retry() will start ignoring the seqcount value
before we let cpuset_mems_allowed_begin() return 0.

The relevant stack traces of the two stuck threads:

  CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
  Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
  task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
  RIP: smp_call_function_many+0x1f9/0x260
  Call Trace:
    smp_call_function+0x3b/0x70
    on_each_cpu+0x2f/0x90
    text_poke_bp+0x87/0xd0
    arch_jump_label_transform+0x93/0x100
    __jump_label_update+0x77/0x90
    jump_label_update+0xaa/0xc0
    static_key_slow_inc+0x9e/0xb0
    cpuset_css_online+0x70/0x2e0
    online_css+0x2c/0xa0
    cgroup_apply_control_enable+0x27f/0x3d0
    cgroup_mkdir+0x2b7/0x420
    kernfs_iop_mkdir+0x5a/0x80
    vfs_mkdir+0xf6/0x1a0
    SyS_mkdir+0xb7/0xe0
    entry_SYSCALL_64_fastpath+0x18/0xad

  ...

  CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
  Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
  task: ffff8818087c0000 task.stack: ffffc90000030000
  RIP: int3+0x39/0x70
  Call Trace:
    <#DB> ? ___slab_alloc+0x28b/0x5a0
    <EOE> ? copy_process.part.40+0xf7/0x1de0
    __slab_alloc.isra.80+0x54/0x90
    copy_process.part.40+0xf7/0x1de0
    copy_process.part.40+0xf7/0x1de0
    kmem_cache_alloc_node+0x8a/0x280
    copy_process.part.40+0xf7/0x1de0
    _do_fork+0xe7/0x6c0
    _raw_spin_unlock_irq+0x2d/0x60
    trace_hardirqs_on_caller+0x136/0x1d0
    entry_SYSCALL_64_fastpath+0x5/0xad
    do_syscall_64+0x27/0x350
    SyS_clone+0x19/0x20
    do_syscall_64+0x60/0x350
    entry_SYSCALL64_slow_path+0x25/0x25

Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
Fixes: 46e700ab ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
Signed-off-by: NDima Zavin <dmitriyz@waymo.com>
Reported-by: NCliff Spradlin <cspradlin@waymo.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

89affbf5

cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy · 13d82fb7

由 Tejun Heo 提交于 8月 02, 2017

Each css_set directly points to the default cgroup it belongs to, so
there's no reason to walk the cgrp_links list on the default
hierarchy.
Signed-off-by: NTejun Heo <tj@kernel.org>

13d82fb7

cgroup: re-use the parent pointer in cgroup_destroy_locked() · 5a621e6c

由 Roman Gushchin 提交于 8月 02, 2017

As we already have a pointer to the parent cgroup in
cgroup_destroy_locked(), we don't need to calculate it again
to pass as an argument for cgroup1_check_for_release().
Signed-off-by: NRoman Gushchin <guro@fb.com>
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com
Cc: linux-kernel@vger.kernel.org

5a621e6c

cgroup: add cgroup.stat interface with basic hierarchy stats · ec39225c

由 Roman Gushchin 提交于 8月 02, 2017

A cgroup can consume resources even after being deleted by a user.
For example, writing back dirty pages should be accounted and
limited, despite the corresponding cgroup might contain no processes
and being deleted by a user.

In the current implementation a cgroup can remain in such "dying" state
for an undefined amount of time. For instance, if a memory cgroup
contains a pge, mlocked by a process belonging to an other cgroup.

Although the lifecycle of a dying cgroup is out of user's control,
it's important to have some insight of what's going on under the hood.

In particular, it's handy to have a counter which will allow
to detect css leaks.

To solve this problem, add a cgroup.stat interface to
the base cgroup control files with the following metrics:

nr_descendants		total number of visible descendant cgroups
nr_dying_descendants	total number of dying descendant cgroups
Signed-off-by: NRoman Gushchin <guro@fb.com>
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

ec39225c

cgroup: implement hierarchy limits · 1a926e0b

由 Roman Gushchin 提交于 7月 28, 2017

Creating cgroup hierearchies of unreasonable size can affect
overall system performance. A user might want to limit the
size of cgroup hierarchy. This is especially important if a user
is delegating some cgroup sub-tree.

To address this issue, introduce an ability to control
the size of cgroup hierarchy.

The cgroup.max.descendants control file allows to set the maximum
allowed number of descendant cgroups.
The cgroup.max.depth file controls the maximum depth of the cgroup
tree. Both are single value r/w files, with "max" default value.

The control files exist on each hierarchy level (including root).
When a new cgroup is created, we check the total descendants
and depth limits on each level, and if none of them are exceeded,
a new cgroup is created.

Only alive cgroups are counted, removed (dying) cgroups are
ignored.
Signed-off-by: NRoman Gushchin <guro@fb.com>
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

1a926e0b

cgroup: keep track of number of descent cgroups · 0679dee0

由 Roman Gushchin 提交于 8月 02, 2017

Keep track of the number of online and dying descent cgroups.

This data will be used later to add an ability to control cgroup
hierarchy (limit the depth and the number of descent cgroups)
and display hierarchy stats.
Signed-off-by: NRoman Gushchin <guro@fb.com>
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: kernel-team@fb.com
Cc: cgroups@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

0679dee0

29 7月, 2017 2 次提交

blktrace: add an option to allow displaying cgroup path · 69fd5c39

由 Shaohua Li 提交于 7月 12, 2017

By default we output cgroup id in blktrace. This adds an option to
display cgroup path. Since get cgroup path is a relativly heavy
operation, we don't enable it by default.

with the option enabled, blktrace will output something like this:
dd-1353  [007] d..2   293.015252:   8,0   /test/level  D   R 24 + 8 [dd]
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

69fd5c39

kernfs: add exportfs operations · aa818825

由 Shaohua Li 提交于 7月 12, 2017

Now we have the facilities to implement exportfs operations. The idea is
cgroup can export the fhandle info to userspace, then userspace uses
fhandle to find the cgroup name. Another example is userspace can get
fhandle for a cgroup and BPF uses the fhandle to filter info for the
cgroup.
Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

aa818825

26 7月, 2017 2 次提交

cgroup: add comment to cgroup_enable_threaded() · c705a00d

由 Tejun Heo 提交于 7月 25, 2017

Explain cgroup_enable_threaded() and note that the function can never
be called on the root cgroup.
Signed-off-by: NTejun Heo <tj@kernel.org>
Suggested-by: NWaiman Long <longman@redhat.com>

c705a00d

cgroup: remove unnecessary empty check when enabling threaded mode · 918a8c2c

由 Tejun Heo 提交于 7月 23, 2017

cgroup_enable_threaded() checks that the cgroup doesn't have any tasks
or children and fails the operation if so.  This test is unnecessary
because the first part is already checked by
cgroup_can_be_thread_root() and the latter is unnecessary.  The latter
actually cause a behavioral oddity.  Please consider the following
hierarchy.  All cgroups are domains.

    A
   / \
  B   C
       \
        D

If B is made threaded, C and D becomes invalid domains.  Due to the no
children restriction, threaded mode can't be enabled on C.  For C and
D, the only thing the user can do is removal.

There is no reason for this restriction.  Remove it.
Acked-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

918a8c2c

23 7月, 2017 1 次提交

cgroup: fix error return value from cgroup_subtree_control() · 3c745417

由 Tejun Heo 提交于 7月 23, 2017

While refactoring, f7b2814b ("cgroup: factor out
cgroup_{apply|finalize}_control() from
cgroup_subtree_control_write()") broke error return value from the
function.  The return value from the last operation is always
overridden to zero.  Fix it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.6+
Signed-off-by: NTejun Heo <tj@kernel.org>

3c745417

21 7月, 2017 4 次提交

cgroup: update debug controller to print out thread mode information · 7a0cf0e7

由 Waiman Long 提交于 7月 21, 2017

Update debug controller so that it prints out debug info about thread
mode.

 1) The relationship between proc_cset and threaded_csets are displayed.
 2) The status of being a thread root or threaded cgroup is displayed.

This patch is extracted from Waiman's larger patch.

v2: - Removed [thread root] / [threaded] from debug.cgroup_css_links
      file as the same information is available from cgroup.type.
      Suggested by Waiman.
    - Threaded marking is moved to the previous patch.
Patch-originally-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

7a0cf0e7

cgroup: implement cgroup v2 thread support · 8cfd8147

由 Tejun Heo 提交于 7月 21, 2017

This patch implements cgroup v2 thread support.  The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.

A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file.  When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.

The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint.  Note that the threads aren't allowed
to escape to a different threaded subtree.  To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.

The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree.  This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted.  This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.

As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.

Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.

This patch enables threaded mode for the pids and perf_events
controllers.  Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.

For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.

v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
      Spotted by Waiman.
    - Documentation updated as suggested by Waiman.
    - cgroup.type content slightly reformatted.
    - Mark the debug controller threaded.

v4: - Updated to the general idea of marking specific cgroups
      domain/threaded as suggested by PeterZ.

v3: - Dropped "join" and always make mixed children join the parent's
      threaded subtree.

v2: - After discussions with Waiman, support for mixed thread mode is
      added.  This should address the issue that Peter pointed out
      where any nesting should be avoided for thread subtrees while
      coexisting with other domain cgroups.
    - Enabling / disabling thread mode now piggy backs on the existing
      control mask update mechanism.
    - Bug fixes and cleanup.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>

8cfd8147

cgroup: implement CSS_TASK_ITER_THREADED · 450ee0c1

由 Tejun Heo 提交于 5月 15, 2017

cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the dom_cgrp to which the processes of the subtree conceptually belong
and domain-level resource consumptions not tied to any specific task
are charged.  In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.

This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
which, when used on a dom_cgrp, makes the iteration include the tasks
on all the associated threaded css_sets.  "cgroup.procs" read path is
updated to use it so that reading the file on a proc_cgrp lists all
processes.  This will also be used by controller implementations which
need to walk processes or tasks at the resource domain level.

Task iteration is implemented nested in css_set iteration.  If
CSS_TASK_ITER_THREADED is specified, after walking tasks of each
!threaded css_set, all the associated threaded css_sets are visited
before moving onto the next !threaded css_set.

v2: ->cur_pcset renamed to ->cur_dcset.  Updated for the new
    enable-threaded-per-cgroup behavior.
Signed-off-by: NTejun Heo <tj@kernel.org>

450ee0c1

cgroup: introduce cgroup->dom_cgrp and threaded css_set handling · 454000ad

由 Tejun Heo 提交于 5月 15, 2017

cgroup v2 is in the process of growing thread granularity support.  A
threaded subtree is composed of a thread root and threaded cgroups
which are proper members of the subtree.

The root cgroup of the subtree serves as the domain cgroup to which
the processes (as opposed to threads / tasks) of the subtree
conceptually belong and domain-level resource consumptions not tied to
any specific task are charged.  Inside the subtree, threads won't be
subject to process granularity or no-internal-task constraint and can
be distributed arbitrarily across the subtree.

This patch introduces cgroup->dom_cgrp along with threaded css_set
handling.

* cgroup->dom_cgrp points to self for normal and thread roots.  For
  proper thread subtree members, points to the dom_cgrp (the thread
  root).

* css_set->dom_cset points to self if for normal and thread roots.  If
  threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
  The dom_cgrp serves as the resource domain and keeps the matching
  csses available.  The dom_cset holds those csses and makes them
  easily accessible.

* All threaded csets are linked on their dom_csets to enable iteration
  of all threaded tasks.

* cgroup->nr_threaded_children keeps track of the number of threaded
  children.

This patch adds the above but doesn't actually use them yet.  The
following patches will build on top.

v4: ->nr_threaded_children added.

v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset.  Updated for the new
    enable-threaded-per-cgroup behavior.

v2: Added cgroup_is_threaded() helper.
Signed-off-by: NTejun Heo <tj@kernel.org>

454000ad

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功