1. 16 5月, 2018 1 次提交
  2. 20 3月, 2018 1 次提交
  3. 22 2月, 2018 1 次提交
    • T
      cgroup: fix rule checking for threaded mode switching · d1897c95
      Tejun Heo 提交于
      A domain cgroup isn't allowed to be turned threaded if its subtree is
      populated or domain controllers are enabled.  cgroup_enable_threaded()
      depended on cgroup_can_be_thread_root() test to enforce this rule.  A
      parent which has populated domain descendants or have domain
      controllers enabled can't become a thread root, so the above rules are
      enforced automatically.
      
      However, for the root cgroup which can host mixed domain and threaded
      children, cgroup_can_be_thread_root() doesn't check any of those
      conditions and thus first level cgroups ends up escaping those rules.
      
      This patch fixes the bug by adding explicit checks for those rules in
      cgroup_enable_threaded().
      Reported-by: NMichael Kerrisk (man-pages) <mtk.manpages@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 8cfd8147 ("cgroup: implement cgroup v2 thread support")
      Cc: stable@vger.kernel.org # v4.14+
      d1897c95
  4. 20 1月, 2018 1 次提交
    • T
      string: drop __must_check from strscpy() and restore strscpy() usages in cgroup · 08a77676
      Tejun Heo 提交于
      e7fd37ba ("cgroup: avoid copying strings longer than the buffers")
      converted possibly unsafe strncpy() usages in cgroup to strscpy().
      However, although the callsites are completely fine with truncated
      copied, because strscpy() is marked __must_check, it led to the
      following warnings.
      
        kernel/cgroup/cgroup.c: In function ‘cgroup_file_name’:
        kernel/cgroup/cgroup.c:1400:10: warning: ignoring return value of ‘strscpy’, declared with attribute warn_unused_result [-Wunused-result]
           strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX);
      	       ^
      
      To avoid the warnings, 50034ed4 ("cgroup: use strlcpy() instead of
      strscpy() to avoid spurious warning") switched them to strlcpy().
      
      strlcpy() is worse than strlcpy() because it unconditionally runs
      strlen() on the source string, and the only reason we switched to
      strlcpy() here was because it was lacking __must_check, which doesn't
      reflect any material differences between the two function.  It's just
      that someone added __must_check to strscpy() and not to strlcpy().
      
      These basic string copy operations are used in variety of ways, and
      one of not-so-uncommon use cases is safely handling truncated copies,
      where the caller naturally doesn't care about the return value.  The
      __must_check doesn't match the actual use cases and forces users to
      opt for inferior variants which lack __must_check by happenstance or
      spread ugly (void) casts.
      
      Remove __must_check from strscpy() and restore strscpy() usages in
      cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      08a77676
  5. 11 1月, 2018 1 次提交
  6. 20 12月, 2017 1 次提交
    • T
      cgroup: fix css_task_iter crash on CSS_TASK_ITER_PROC · 74d0833c
      Tejun Heo 提交于
      While teaching css_task_iter to handle skipping over tasks which
      aren't group leaders, bc2fb7ed ("cgroup: add @flags to
      css_task_iter_start() and implement CSS_TASK_ITER_PROCS") introduced a
      silly bug.
      
      CSS_TASK_ITER_PROCS is implemented by repeating
      css_task_iter_advance() while the advanced cursor is pointing to a
      non-leader thread.  However, the cursor variable, @l, wasn't updated
      when the iteration has to advance to the next css_set and the
      following repetition would operate on the terminal @l from the
      previous iteration which isn't pointing to a valid task leading to
      oopses like the following or infinite looping.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000254
        IP: __task_pid_nr_ns+0xc7/0xf0
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP
        ...
        CPU: 2 PID: 1 Comm: systemd Not tainted 4.14.4-200.fc26.x86_64 #1
        Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 3203 11/09/2017
        task: ffff88c4baee8000 task.stack: ffff96d5c3158000
        RIP: 0010:__task_pid_nr_ns+0xc7/0xf0
        RSP: 0018:ffff96d5c315bd50 EFLAGS: 00010206
        RAX: 0000000000000000 RBX: ffff88c4b68c6000 RCX: 0000000000000250
        RDX: ffffffffa5e47960 RSI: 0000000000000000 RDI: ffff88c490f6ab00
        RBP: ffff96d5c315bd50 R08: 0000000000001000 R09: 0000000000000005
        R10: ffff88c4be006b80 R11: ffff88c42f1b8004 R12: ffff96d5c315bf18
        R13: ffff88c42d7dd200 R14: ffff88c490f6a510 R15: ffff88c4b68c6000
        FS:  00007f9446f8ea00(0000) GS:ffff88c4be680000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000254 CR3: 00000007f956f000 CR4: 00000000003406e0
        Call Trace:
         cgroup_procs_show+0x19/0x30
         cgroup_seqfile_show+0x4c/0xb0
         kernfs_seq_show+0x21/0x30
         seq_read+0x2ec/0x3f0
         kernfs_fop_read+0x134/0x180
         __vfs_read+0x37/0x160
         ? security_file_permission+0x9b/0xc0
         vfs_read+0x8e/0x130
         SyS_read+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xa5
        RIP: 0033:0x7f94455f942d
        RSP: 002b:00007ffe81ba2d00 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
        RAX: ffffffffffffffda RBX: 00005574e2233f00 RCX: 00007f94455f942d
        RDX: 0000000000001000 RSI: 00005574e2321a90 RDI: 000000000000002b
        RBP: 0000000000000000 R08: 00005574e2321a90 R09: 00005574e231de60
        R10: 00007f94458c8b38 R11: 0000000000000293 R12: 00007f94458c8ae0
        R13: 00007ffe81ba3800 R14: 0000000000000000 R15: 00005574e2116560
        Code: 04 74 0e 89 f6 48 8d 04 76 48 8d 04 c5 f0 05 00 00 48 8b bf b8 05 00 00 48 01 c7 31 c0 48 8b 0f 48 85 c9 74 18 8b b2 30 08 00 00 <3b> 71 04 77 0d 48 c1 e6 05 48 01 f1 48 3b 51 38 74 09 5d c3 8b
        RIP: __task_pid_nr_ns+0xc7/0xf0 RSP: ffff96d5c315bd50
      
      Fix it by moving the initialization of the cursor below the repeat
      label.  While at it, rename it to @next for readability.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: bc2fb7ed ("cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS")
      Cc: stable@vger.kernel.org # v4.14+
      Reported-by: NLaura Abbott <labbott@redhat.com>
      Reported-by: NBronek Kozicki <brok@incorrekt.com>
      Reported-by: NGeorge Amanakis <gamanakis@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      74d0833c
  7. 15 12月, 2017 1 次提交
  8. 12 12月, 2017 1 次提交
  9. 07 11月, 2017 2 次提交
    • R
      cgroup: export list of cgroups v2 features using sysfs · 5f2e6734
      Roman Gushchin 提交于
      The active development of cgroups v2 sometimes leads to a creation
      of interfaces, which are not turned on by default (to provide
      backward compatibility). It's handy to know from userspace, which
      cgroup v2 features are supported without calculating it based
      on the kernel version. So, let's export the list of such features
      using /sys/kernel/cgroup/features pseudo-file.
      
      The list is hardcoded and has to be extended when new functionality
      is added. Each feature is printed on a new line.
      
      Example:
        $ cat /sys/kernel/cgroup/features
        nsdelegate
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5f2e6734
    • R
      cgroup: export list of delegatable control files using sysfs · 01ee6cfb
      Roman Gushchin 提交于
      Delegatable cgroup v2 control files may require special handling
      (e.g. chowning), and the exact list of such files varies between
      kernel versions (and likely to be extended in the future).
      
      To guarantee correctness of this list and simplify the life
      of userspace (systemd, first of all), let's export the list
      via /sys/kernel/cgroup/delegate pseudo-file.
      
      Format is siple: each control file name is printed on a new line.
      Example:
        $ cat /sys/kernel/cgroup/delegate
        cgroup.procs
        cgroup.subtree_control
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      Signed-off-by: NTejun Heo <tj@kernel.org>
      01ee6cfb
  10. 30 10月, 2017 1 次提交
  11. 27 10月, 2017 1 次提交
    • T
      cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat · d41bf8c9
      Tejun Heo 提交于
      The basic cpu stat is currently shown with "cpu." prefix in
      cgroup.stat, and the same information is duplicated in cpu.stat when
      cpu controller is enabled.  This is ugly and not very scalable as we
      want to expand the coverage of stat information which is always
      available.
      
      This patch makes cgroup core always create "cpu.stat" file and show
      the basic cpu stat there and calls the cpu controller to show the
      extra stats when enabled.  This ensures that the same information
      isn't presented in multiple places and makes future expansion of basic
      stats easier.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      d41bf8c9
  12. 11 10月, 2017 1 次提交
  13. 10 10月, 2017 1 次提交
  14. 05 10月, 2017 2 次提交
    • A
      bpf: introduce BPF_PROG_QUERY command · 468e2f64
      Alexei Starovoitov 提交于
      introduce BPF_PROG_QUERY command to retrieve a set of either
      attached programs to given cgroup or a set of effective programs
      that will execute for events within a cgroup
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      468e2f64
    • A
      bpf: multi program support for cgroup+bpf · 324bda9e
      Alexei Starovoitov 提交于
      introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
      bpf programs to a cgroup.
      
      The difference between three possible flags for BPF_PROG_ATTACH command:
      - NONE(default): No further bpf programs allowed in the subtree.
      - BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
        the program in this cgroup yields to sub-cgroup program.
      - BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
        that cgroup program gets run in addition to the program in this cgroup.
      
      NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
      change their behavior. It only clarifies the semantics in relation
      to new flag.
      
      Only one program is allowed to be attached to a cgroup with
      NONE or BPF_F_ALLOW_OVERRIDE flag.
      Multiple programs are allowed to be attached to a cgroup with
      BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
      (those that were attached first, run first)
      The programs of sub-cgroup are executed first, then programs of
      this cgroup and then programs of parent cgroup.
      All eligible programs are executed regardless of return code from
      earlier programs.
      
      To allow efficient execution of multiple programs attached to a cgroup
      and to avoid penalizing cgroups without any programs attached
      introduce 'struct bpf_prog_array' which is RCU protected array
      of pointers to bpf programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      324bda9e
  15. 26 9月, 2017 1 次提交
    • T
      cgroup: statically initialize init_css_set->dfl_cgrp · 38683148
      Tejun Heo 提交于
      Like other csets, init_css_set's dfl_cgrp is initialized when the cset
      gets linked.  init_css_set gets linked in cgroup_init().  This has
      been fine till now but the recently added basic CPU usage accounting
      may end up accessing dfl_cgrp of init before cgroup_init() leading to
      the following oops.
      
        SELinux:  Initializing.
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
        IP: account_system_index_time+0x60/0x90
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc2-00003-g041cd640 #10
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
        +1.9.3-20161025_171302-gandalf 04/01/2014
        task: ffffffff81e10480 task.stack: ffffffff81e00000
        RIP: 0010:account_system_index_time+0x60/0x90
        RSP: 0000:ffff880011e03cb8 EFLAGS: 00010002
        RAX: ffffffff81ef8800 RBX: ffffffff81e10480 RCX: 0000000000000003
        RDX: 0000000000000000 RSI: 00000000000f4240 RDI: 0000000000000000
        RBP: ffff880011e03cc0 R08: 0000000000010000 R09: 0000000000000000
        R10: 0000000000000020 R11: 0000003b9aca0000 R12: 000000000001c100
        R13: 0000000000000000 R14: ffffffff81e10480 R15: ffffffff81e03cd8
        FS:  0000000000000000(0000) GS:ffff880011e00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000b0 CR3: 0000000001e09000 CR4: 00000000000006b0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         account_system_time+0x45/0x60
         account_process_tick+0x5a/0x140
         update_process_times+0x22/0x60
         tick_periodic+0x2b/0x90
         tick_handle_periodic+0x25/0x70
         timer_interrupt+0x15/0x20
         __handle_irq_event_percpu+0x7e/0x1b0
         handle_irq_event_percpu+0x23/0x60
         handle_irq_event+0x42/0x70
         handle_level_irq+0x83/0x100
         handle_irq+0x6f/0x110
         do_IRQ+0x46/0xd0
         common_interrupt+0x9d/0x9d
      
      Fix it by statically initializing init_css_set.dfl_cgrp so that init's
      default cgroup is accessible from the get-go.
      
      Fixes: 041cd640 ("cgroup: Implement cgroup2 basic CPU usage accounting")
      Reported-by: N“kbuild-all@01.org” <kbuild-all@01.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      38683148
  16. 25 9月, 2017 1 次提交
    • T
      cgroup: Implement cgroup2 basic CPU usage accounting · 041cd640
      Tejun Heo 提交于
      In cgroup1, while cpuacct isn't actually controlling any resources, it
      is a separate controller due to combination of two factors -
      1. enabling cpu controller has significant side effects, and 2. we
      have to pick one of the hierarchies to account CPU usages on.  cpuacct
      controller is effectively used to designate a hierarchy to track CPU
      usages on.
      
      cgroup2's unified hierarchy removes the second reason and we can
      account basic CPU usages by default.  While we can use cpuacct for
      this purpose, both its interface and implementation leave a lot to be
      desired - it collects and exposes two sources of truth which don't
      agree with each other and some of the exposed statistics don't make
      much sense.  Also, it propagates all the way up the hierarchy on each
      accounting event which is unnecessary.
      
      This patch adds basic resource accounting mechanism to cgroup2's
      unified hierarchy and accounts CPU usages using it.
      
      * All accountings are done per-cpu and don't propagate immediately.
        It just bumps the per-cgroup per-cpu counters and links to the
        parent's updated list if not already on it.
      
      * On a read, the per-cpu counters are collected into the global ones
        and then propagated upwards.  Only the per-cpu counters which have
        changed since the last read are propagated.
      
      * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
        prefix.  Total usage is collected from scheduling events.  User/sys
        breakdown is sourced from tick sampling and adjusted to the usage
        using cputime_adjust().
      
      This keeps the accounting side hot path O(1) and per-cpu and the read
      side O(nr_updated_since_last_read).
      
      v2: Minor changes and documentation updates as suggested by Waiman and
          Roman.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      041cd640
  17. 22 9月, 2017 1 次提交
    • W
      cgroup: Reinit cgroup_taskset structure before cgroup_migrate_execute() returns · c4fa6c43
      Waiman Long 提交于
      The cgroup_taskset structure within the larger cgroup_mgctx structure
      is supposed to be used once and then discarded. That is not really the
      case in the hotplug code path:
      
      cpuset_hotplug_workfn()
       - cgroup_transfer_tasks()
         - cgroup_migrate()
           - cgroup_migrate_add_task()
           - cgroup_migrate_execute()
      
      In this case, the cgroup_migrate() function is called multiple time
      with the same cgroup_mgctx structure to transfer the tasks from
      one cgroup to another one-by-one. The second time cgroup_migrate()
      is called, the cgroup_taskset will be in an incorrect state and so
      may cause the system to panic. For example,
      
        [  150.888410] Faulting instruction address: 0xc0000000001db648
        [  150.888414] Oops: Kernel access of bad area, sig: 11 [#1]
        [  150.888417] SMP NR_CPUS=2048
        [  150.888417] NUMA
        [  150.888419] pSeries
          :
        [  150.888545] NIP [c0000000001db648] cpuset_can_attach+0x58/0x1b0
        [  150.888548] LR [c0000000001db638] cpuset_can_attach+0x48/0x1b0
        [  150.888551] Call Trace:
        [  150.888554] [c0000005f65cb940] [c0000000001db638] cpuset_can_attach+0x48/0x1b 0 (unreliable)
        [  150.888559] [c0000005f65cb9a0] [c0000000001cff04] cgroup_migrate_execute+0xc4/0x4b0
        [  150.888563] [c0000005f65cba20] [c0000000001d7d14] cgroup_transfer_tasks+0x1d4/0x370
        [  150.888568] [c0000005f65cbb70] [c0000000001ddcb0] cpuset_hotplug_workfn+0x710/0x8f0
        [  150.888572] [c0000005f65cbc80] [c00000000012032c] process_one_work+0x1ac/0x4d0
        [  150.888576] [c0000005f65cbd20] [c0000000001206f8] worker_thread+0xa8/0x5b0
        [  150.888580] [c0000005f65cbdc0] [c0000000001293f8] kthread+0x168/0x1b0
        [  150.888584] [c0000005f65cbe30] [c00000000000b368] ret_from_kernel_thread+0x5c/0x74
      
      To allow reuse of the cgroup_mgctx structure, some fields in that
      structure are now re-initialized at the end of cgroup_migrate_execute()
      function call so that the structure can be reused again in a later
      iteration without causing problem.
      
      This bug was introduced in the commit e595cd70 ("group: track
      migration context in cgroup_mgctx") in 4.11. This commit moves the
      cgroup_taskset initialization out of cgroup_migrate(). The commit
      10467270fb3 ("cgroup: don't call migration methods if there are no
      tasks to migrate") helped, but did not completely resolve the problem.
      
      Fixes: e595cd70 ("group: track migration context in cgroup_mgctx")
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v4.11+
      c4fa6c43
  18. 07 9月, 2017 1 次提交
  19. 12 8月, 2017 1 次提交
  20. 11 8月, 2017 1 次提交
    • T
      cgroup: misc changes · 3e48930c
      Tejun Heo 提交于
      Misc trivial changes to prepare for future changes.  No functional
      difference.
      
      * Expose cgroup_get(), cgroup_tryget() and cgroup_parent().
      
      * Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp.
      
      * Rename cgroup_stats_show() to cgroup_stat_show() for consistency
        with the file name.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3e48930c
  21. 03 8月, 2017 5 次提交
    • T
      cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy · 13d82fb7
      Tejun Heo 提交于
      Each css_set directly points to the default cgroup it belongs to, so
      there's no reason to walk the cgrp_links list on the default
      hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      13d82fb7
    • R
      cgroup: re-use the parent pointer in cgroup_destroy_locked() · 5a621e6c
      Roman Gushchin 提交于
      As we already have a pointer to the parent cgroup in
      cgroup_destroy_locked(), we don't need to calculate it again
      to pass as an argument for cgroup1_check_for_release().
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: linux-kernel@vger.kernel.org
      5a621e6c
    • R
      cgroup: add cgroup.stat interface with basic hierarchy stats · ec39225c
      Roman Gushchin 提交于
      A cgroup can consume resources even after being deleted by a user.
      For example, writing back dirty pages should be accounted and
      limited, despite the corresponding cgroup might contain no processes
      and being deleted by a user.
      
      In the current implementation a cgroup can remain in such "dying" state
      for an undefined amount of time. For instance, if a memory cgroup
      contains a pge, mlocked by a process belonging to an other cgroup.
      
      Although the lifecycle of a dying cgroup is out of user's control,
      it's important to have some insight of what's going on under the hood.
      
      In particular, it's handy to have a counter which will allow
      to detect css leaks.
      
      To solve this problem, add a cgroup.stat interface to
      the base cgroup control files with the following metrics:
      
      nr_descendants		total number of visible descendant cgroups
      nr_dying_descendants	total number of dying descendant cgroups
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      ec39225c
    • R
      cgroup: implement hierarchy limits · 1a926e0b
      Roman Gushchin 提交于
      Creating cgroup hierearchies of unreasonable size can affect
      overall system performance. A user might want to limit the
      size of cgroup hierarchy. This is especially important if a user
      is delegating some cgroup sub-tree.
      
      To address this issue, introduce an ability to control
      the size of cgroup hierarchy.
      
      The cgroup.max.descendants control file allows to set the maximum
      allowed number of descendant cgroups.
      The cgroup.max.depth file controls the maximum depth of the cgroup
      tree. Both are single value r/w files, with "max" default value.
      
      The control files exist on each hierarchy level (including root).
      When a new cgroup is created, we check the total descendants
      and depth limits on each level, and if none of them are exceeded,
      a new cgroup is created.
      
      Only alive cgroups are counted, removed (dying) cgroups are
      ignored.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      1a926e0b
    • R
      cgroup: keep track of number of descent cgroups · 0679dee0
      Roman Gushchin 提交于
      Keep track of the number of online and dying descent cgroups.
      
      This data will be used later to add an ability to control cgroup
      hierarchy (limit the depth and the number of descent cgroups)
      and display hierarchy stats.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      0679dee0
  22. 29 7月, 2017 2 次提交
  23. 26 7月, 2017 2 次提交
  24. 23 7月, 2017 1 次提交
  25. 21 7月, 2017 6 次提交
    • W
      cgroup: update debug controller to print out thread mode information · 7a0cf0e7
      Waiman Long 提交于
      Update debug controller so that it prints out debug info about thread
      mode.
      
       1) The relationship between proc_cset and threaded_csets are displayed.
       2) The status of being a thread root or threaded cgroup is displayed.
      
      This patch is extracted from Waiman's larger patch.
      
      v2: - Removed [thread root] / [threaded] from debug.cgroup_css_links
            file as the same information is available from cgroup.type.
            Suggested by Waiman.
          - Threaded marking is moved to the previous patch.
      Patch-originally-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      7a0cf0e7
    • T
      cgroup: implement cgroup v2 thread support · 8cfd8147
      Tejun Heo 提交于
      This patch implements cgroup v2 thread support.  The goal of the
      thread mode is supporting hierarchical accounting and control at
      thread granularity while staying inside the resource domain model
      which allows coordination across different resource controllers and
      handling of anonymous resource consumptions.
      
      A cgroup is always created as a domain and can be made threaded by
      writing to the "cgroup.type" file.  When a cgroup becomes threaded, it
      becomes a member of a threaded subtree which is anchored at the
      closest ancestor which isn't threaded.
      
      The threads of the processes which are in a threaded subtree can be
      placed anywhere without being restricted by process granularity or
      no-internal-process constraint.  Note that the threads aren't allowed
      to escape to a different threaded subtree.  To be used inside a
      threaded subtree, a controller should explicitly support threaded mode
      and be able to handle internal competition in the way which is
      appropriate for the resource.
      
      The root of a threaded subtree, the nearest ancestor which isn't
      threaded, is called the threaded domain and serves as the resource
      domain for the whole subtree.  This is the last cgroup where domain
      controllers are operational and where all the domain-level resource
      consumptions in the subtree are accounted.  This allows threaded
      controllers to operate at thread granularity when requested while
      staying inside the scope of system-level resource distribution.
      
      As the root cgroup is exempt from the no-internal-process constraint,
      it can serve as both a threaded domain and a parent to normal cgroups,
      so, unlike non-root cgroups, the root cgroup can have both domain and
      threaded children.
      
      Internally, in a threaded subtree, each css_set has its ->dom_cset
      pointing to a matching css_set which belongs to the threaded domain.
      This ensures that thread root level cgroup_subsys_state for all
      threaded controllers are readily accessible for domain-level
      operations.
      
      This patch enables threaded mode for the pids and perf_events
      controllers.  Neither has to worry about domain-level resource
      consumptions and it's enough to simply set the flag.
      
      For more details on the interface and behavior of the thread mode,
      please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
      by this patch.
      
      v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
            Spotted by Waiman.
          - Documentation updated as suggested by Waiman.
          - cgroup.type content slightly reformatted.
          - Mark the debug controller threaded.
      
      v4: - Updated to the general idea of marking specific cgroups
            domain/threaded as suggested by PeterZ.
      
      v3: - Dropped "join" and always make mixed children join the parent's
            threaded subtree.
      
      v2: - After discussions with Waiman, support for mixed thread mode is
            added.  This should address the issue that Peter pointed out
            where any nesting should be avoided for thread subtrees while
            coexisting with other domain cgroups.
          - Enabling / disabling thread mode now piggy backs on the existing
            control mask update mechanism.
          - Bug fixes and cleanup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8cfd8147
    • T
      cgroup: implement CSS_TASK_ITER_THREADED · 450ee0c1
      Tejun Heo 提交于
      cgroup v2 is in the process of growing thread granularity support.
      Once thread mode is enabled, the root cgroup of the subtree serves as
      the dom_cgrp to which the processes of the subtree conceptually belong
      and domain-level resource consumptions not tied to any specific task
      are charged.  In the subtree, threads won't be subject to process
      granularity or no-internal-task constraint and can be distributed
      arbitrarily across the subtree.
      
      This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
      which, when used on a dom_cgrp, makes the iteration include the tasks
      on all the associated threaded css_sets.  "cgroup.procs" read path is
      updated to use it so that reading the file on a proc_cgrp lists all
      processes.  This will also be used by controller implementations which
      need to walk processes or tasks at the resource domain level.
      
      Task iteration is implemented nested in css_set iteration.  If
      CSS_TASK_ITER_THREADED is specified, after walking tasks of each
      !threaded css_set, all the associated threaded css_sets are visited
      before moving onto the next !threaded css_set.
      
      v2: ->cur_pcset renamed to ->cur_dcset.  Updated for the new
          enable-threaded-per-cgroup behavior.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      450ee0c1
    • T
      cgroup: introduce cgroup->dom_cgrp and threaded css_set handling · 454000ad
      Tejun Heo 提交于
      cgroup v2 is in the process of growing thread granularity support.  A
      threaded subtree is composed of a thread root and threaded cgroups
      which are proper members of the subtree.
      
      The root cgroup of the subtree serves as the domain cgroup to which
      the processes (as opposed to threads / tasks) of the subtree
      conceptually belong and domain-level resource consumptions not tied to
      any specific task are charged.  Inside the subtree, threads won't be
      subject to process granularity or no-internal-task constraint and can
      be distributed arbitrarily across the subtree.
      
      This patch introduces cgroup->dom_cgrp along with threaded css_set
      handling.
      
      * cgroup->dom_cgrp points to self for normal and thread roots.  For
        proper thread subtree members, points to the dom_cgrp (the thread
        root).
      
      * css_set->dom_cset points to self if for normal and thread roots.  If
        threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
        The dom_cgrp serves as the resource domain and keeps the matching
        csses available.  The dom_cset holds those csses and makes them
        easily accessible.
      
      * All threaded csets are linked on their dom_csets to enable iteration
        of all threaded tasks.
      
      * cgroup->nr_threaded_children keeps track of the number of threaded
        children.
      
      This patch adds the above but doesn't actually use them yet.  The
      following patches will build on top.
      
      v4: ->nr_threaded_children added.
      
      v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset.  Updated for the new
          enable-threaded-per-cgroup behavior.
      
      v2: Added cgroup_is_threaded() helper.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      454000ad
    • T
      cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS · bc2fb7ed
      Tejun Heo 提交于
      css_task_iter currently always walks all tasks.  With the scheduled
      cgroup v2 thread support, the iterator would need to handle multiple
      types of iteration.  As a preparation, add @flags to
      css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
      is not specified, it walks all tasks as before.  When asserted, the
      iterator only walks the group leaders.
      
      For now, the only user of the flag is cgroup v2 "cgroup.procs" file
      which no longer needs to skip non-leader tasks in cgroup_procs_next().
      Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
      v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
      cgroup" but "list all thread group id's with any threads in the
      cgroup".
      
      While at it, update cgroup_procs_show() to use task_pid_vnr() instead
      of task_tgid_vnr().  As the iteration guarantees that the function
      only sees group leaders, this doesn't change the output and will allow
      sharing the function for thread iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bc2fb7ed
    • T
      cgroup: reorganize cgroup.procs / task write path · 715c809d
      Tejun Heo 提交于
      Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
      handled by __cgroup_procs_write() on both v1 and v2.  This patch
      reoragnizes the write path so that there are common helper functions
      that different write paths use.
      
      While this somewhat increases LOC, the different paths are no longer
      intertwined and each path has more flexibility to implement different
      behaviors which will be necessary for the planned v2 thread support.
      
      v3: - Restructured so that cgroup_procs_write_permission() takes
            @src_cgrp and @dst_cgrp.
      
      v2: - Rolled in Waiman's task reference count fix.
          - Updated on top of nsdelegate changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      715c809d
  26. 19 7月, 2017 1 次提交
    • T
      cgroup: create dfl_root files on subsys registration · 7af608e4
      Tejun Heo 提交于
      On subsystem registration, css_populate_dir() is not called on the new
      root css, so the interface files for the subsystem on cgrp_dfl_root
      aren't created on registration.  This is a residue from the days when
      cgrp_dfl_root was used only as the parking spot for unused subsystems,
      which no longer is true as it's used as the root for cgroup2.
      
      This is often fine as later operations tend to create them as a part
      of mount (cgroup1) or subtree_control operations (cgroup2); however,
      it's not difficult to mount cgroup2 with the controller interface
      files missing as Waiman found out.
      
      Fix it by invoking css_populate_dir() on the root css on subsys
      registration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NWaiman Long <longman@redhat.com>
      Cc: stable@vger.kernel.org # v4.5+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      7af608e4
  27. 17 7月, 2017 1 次提交