1. 10 6月, 2019 1 次提交
  2. 07 6月, 2019 1 次提交
  3. 02 6月, 2019 1 次提交
    • C
      mm, memcg: consider subtrees in memory.events · 9852ae3f
      Chris Down 提交于
      memory.stat and other files already consider subtrees in their output, and
      we should too in order to not present an inconsistent interface.
      
      The current situation is fairly confusing, because people interacting with
      cgroups expect hierarchical behaviour in the vein of memory.stat,
      cgroup.events, and other files.  For example, this causes confusion when
      debugging reclaim events under low, as currently these always read "0" at
      non-leaf memcg nodes, which frequently causes people to misdiagnose breach
      behaviour.  The same confusion applies to other counters in this file when
      debugging issues.
      
      Aggregation is done at write time instead of at read-time since these
      counters aren't hot (unlike memory.stat which is per-page, so it does it
      at read time), and it makes sense to bundle this with the file
      notifications.
      
      After this patch, events are propagated up the hierarchy:
      
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 0
          oom 0
          oom_kill 0
          [root@ktst ~]# systemd-run -p MemoryMax=1 true
          Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 7
          oom 1
          oom_kill 1
      
      As this is a change in behaviour, this can be reverted to the old
      behaviour by mounting with the `memory_localevents' flag set.  However, we
      use the new behaviour by default as there's a lack of evidence that there
      are any current users of memory.events that would find this change
      undesirable.
      
      akpm: this is a behaviour change, so Cc:stable.  THis is so that
      forthcoming distros which use cgroup v2 are more likely to pick up the
      revised behaviour.
      
      Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9852ae3f
  4. 01 6月, 2019 1 次提交
  5. 20 4月, 2019 2 次提交
    • R
      cgroup: cgroup v2 freezer · 76f969e8
      Roman Gushchin 提交于
      Cgroup v1 implements the freezer controller, which provides an ability
      to stop the workload in a cgroup and temporarily free up some
      resources (cpu, io, network bandwidth and, potentially, memory)
      for some other tasks. Cgroup v2 lacks this functionality.
      
      This patch implements freezer for cgroup v2.
      
      Cgroup v2 freezer tries to put tasks into a state similar to jobctl
      stop. This means that tasks can be killed, ptraced (using
      PTRACE_SEIZE*), and interrupted. It is possible to attach to
      a frozen task, get some information (e.g. read registers) and detach.
      It's also possible to migrate a frozen tasks to another cgroup.
      
      This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
      tried to imitate the system-wide freezer. However uninterruptible
      sleep is fine when all tasks are going to be frozen (hibernation case),
      it's not the acceptable state for some subset of the system.
      
      Cgroup v2 freezer is not supporting freezing kthreads.
      If a non-root cgroup contains kthread, the cgroup still can be frozen,
      but the kthread will remain running, the cgroup will be shown
      as non-frozen, and the notification will not be delivered.
      
      * PTRACE_ATTACH is not working because non-fatal signal delivery
      is blocked in frozen state.
      
      There are some interface differences between cgroup v1 and cgroup v2
      freezer too, which are required to conform the cgroup v2 interface
      design principles:
      1) There is no separate controller, which has to be turned on:
      the functionality is always available and is represented by
      cgroup.freeze and cgroup.events cgroup control files.
      2) The desired state is defined by the cgroup.freeze control file.
      Any hierarchical configuration is allowed.
      3) The interface is asynchronous. The actual state is available
      using cgroup.events control file ("frozen" field). There are no
      dedicated transitional states.
      4) It's allowed to make any changes with the cgroup hierarchy
      (create new cgroups, remove old cgroups, move tasks between cgroups)
      no matter if some cgroups are frozen.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      No-objection-from-me-by: NOleg Nesterov <oleg@redhat.com>
      Cc: kernel-team@fb.com
      76f969e8
    • R
      cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock · 4dcabece
      Roman Gushchin 提交于
      The number of descendant cgroups and the number of dying
      descendant cgroups are currently synchronized using the cgroup_mutex.
      
      The number of descendant cgroups will be required by the cgroup v2
      freezer, which will use it to determine if a cgroup is frozen
      (depending on total number of descendants and number of frozen
      descendants). It's not always acceptable to grab the cgroup_mutex,
      especially from quite hot paths (e.g. exit()).
      
      To avoid this, let's additionally synchronize these counters using
      the css_set_lock.
      
      So, it's safe to read these counters with either cgroup_mutex or
      css_set_lock locked, and for changing both locks should be acquired.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      4dcabece
  6. 06 3月, 2019 1 次提交
  7. 31 1月, 2019 1 次提交
    • O
      cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting · 51bee5ab
      Oleg Nesterov 提交于
      The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
      needs pids_free() to uncharge the pid.
      
      However, ->free() is called from __put_task_struct()->cgroup_free() and this
      is too late. Even the trivial program which does
      
      	for (;;) {
      		int pid = fork();
      		assert(pid >= 0);
      		if (pid)
      			wait(NULL);
      		else
      			exit(0);
      	}
      
      can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
      implies an RCU gp after the task/pid goes away and before the final put().
      
      Test-case:
      
      	mkdir -p /tmp/CG
      	mount -t cgroup2 none /tmp/CG
      	echo '+pids' > /tmp/CG/cgroup.subtree_control
      
      	mkdir /tmp/CG/PID
      	echo 2 > /tmp/CG/PID/pids.max
      
      	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
      	echo $! > /tmp/CG/PID/cgroup.procs
      
      Without this patch the forking process fails soon after migration.
      
      Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
      into the new helper, cgroup_release(), called by release_task() which actually
      frees the pid(s).
      Reported-by: NHerton R. Krzesinski <hkrzesin@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      51bee5ab
  8. 09 11月, 2018 1 次提交
    • W
      cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug · 5cf8114d
      Waiman Long 提交于
      For debugging purpose, it will be useful to expose the content of the
      subparts_cpus as a read-only file to see if the code work correctly.
      However, subparts_cpus will not be used at all in most use cases. So
      adding a new cpuset file that clutters the cgroup directory may not be
      desirable.  This is now being done by using the hidden "cgroup_debug"
      kernel command line option to expose a new "cpuset.cpus.subpartitions"
      file.
      
      That option was originally used by the debug controller to expose
      itself when configured into the kernel. This is now extended to set an
      internal flag used by cgroup_addrm_files(). A new CFTYPE_DEBUG flag
      can now be used to specify that a cgroup file should only be created
      when the "cgroup_debug" option is specified.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5cf8114d
  9. 27 10月, 2018 1 次提交
  10. 05 10月, 2018 1 次提交
    • T
      cgroup: Fix dom_cgrp propagation when enabling threaded mode · 479adb89
      Tejun Heo 提交于
      A cgroup which is already a threaded domain may be converted into a
      threaded cgroup if the prerequisite conditions are met.  When this
      happens, all threaded descendant should also have their ->dom_cgrp
      updated to the new threaded domain cgroup.  Unfortunately, this
      propagation was missing leading to the following failure.
      
        # cd /sys/fs/cgroup/unified
        # cat cgroup.subtree_control    # show that no controllers are enabled
      
        # mkdir -p mycgrp/a/b/c
        # echo threaded > mycgrp/a/b/cgroup.type
      
        At this point, the hierarchy looks as follows:
      
            mycgrp [d]
      	  a [dt]
      	      b [t]
      		  c [inv]
      
        Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):
      
        # echo threaded > mycgrp/a/cgroup.type
      
        By this point, we now have a hierarchy that looks as follows:
      
            mycgrp [dt]
      	  a [t]
      	      b [t]
      		  c [inv]
      
        But, when we try to convert the node "c" from "domain invalid" to
        "threaded", we get ENOTSUP on the write():
      
        # echo threaded > mycgrp/a/b/c/cgroup.type
        sh: echo: write error: Operation not supported
      
      This patch fixes the problem by
      
      * Moving the opencoded ->dom_cgrp save and restoration in
        cgroup_enable_threaded() into cgroup_{save|restore}_control() so
        that mulitple cgroups can be handled.
      
      * Updating all threaded descendants' ->dom_cgrp to point to the new
        dom_cgrp when enabling threaded mode.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: N"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Reported-by: NAmin Jamali <ajamali@pivotal.io>
      Reported-by: NJoao De Almeida Pereira <jpereira@pivotal.io>
      Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
      Fixes: 454000ad ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
      Cc: stable@vger.kernel.org # v4.14+
      479adb89
  11. 09 7月, 2018 1 次提交
    • J
      blkcg: add generic throttling mechanism · d09d8df3
      Josef Bacik 提交于
      Since IO can be issued from literally anywhere it's almost impossible to
      do throttling without having some sort of adverse effect somewhere else
      in the system because of locking or other dependencies.  The best way to
      solve this is to do the throttling when we know we aren't holding any
      other kernel resources.  Do this by tracking throttling in a per-blkg
      basis, and if we require throttling flag the task that it needs to check
      before it returns to user space and possibly sleep there.
      
      This is to address the case where a process is doing work that is
      generating IO that can't be throttled, whether that is directly with a
      lot of REQ_META IO, or indirectly by allocating so much memory that it
      is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
      don't want to induce a memory allocation in the IO path, so simply
      saving the request queue in the task and flagging it to do the
      notify_resume thing achieves the same result without the overhead of a
      memory allocation.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d09d8df3
  12. 27 4月, 2018 4 次提交
    • T
      cgroup: Add cgroup_subsys->css_rstat_flush() · 8f53470b
      Tejun Heo 提交于
      This patch adds cgroup_subsys->css_rstat_flush().  If a subsystem has
      this callback, its csses are linked on cgrp->css_rstat_list and rstat
      will call the function whenever the associated cgroup is flushed.
      Flush is also performed when such csses are released so that residual
      counts aren't lost.
      
      Combined with the rstat API previous patches factored out, this allows
      controllers to plug into rstat to manage their statistics in a
      scalable way.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      8f53470b
    • T
      cgroup: Distinguish base resource stat implementation from rstat · d4ff749b
      Tejun Heo 提交于
      Base resource stat accounts universial (not specific to any
      controller) resource consumptions on top of rstat.  Currently, its
      implementation is intermixed with rstat implementation making the code
      confusing to follow.
      
      This patch clarifies the distintion by doing the followings.
      
      * Encapsulate base resource stat counters, currently only cputime, in
        struct cgroup_base_stat.
      
      * Move prev_cputime into struct cgroup and initialize it with cgroup.
      
      * Rename the related functions so that they start with cgroup_base_stat.
      
      * Prefix the related variables and field names with b.
      
      This patch doesn't make any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d4ff749b
    • T
      cgroup: Rename stat to rstat · c58632b3
      Tejun Heo 提交于
      stat is too generic a name and ends up causing subtle confusions.
      It'll be made generic so that controllers can plug into it, which will
      make the problem worse.  Let's rename it to something more specific -
      cgroup_rstat for cgroup recursive stat.
      
      This patch does the following renames.  No other changes.
      
      * cpu_stat	-> rstat_cpu
      * stat		-> rstat
      * ?cstat	-> ?rstatc
      
      Note that the renames are selective.  The unrenamed are the ones which
      implement basic resource statistics on top of rstat.  This will be
      further cleaned up in the following patches.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c58632b3
    • T
      cgroup: Limit event generation frequency · b12e3583
      Tejun Heo 提交于
      ".events" files generate file modified event to notify userland of
      possible new events.  Some of the events can be quite bursty
      (e.g. memory high event) and generating notification each time is
      costly and pointless.
      
      This patch implements a event rate limit mechanism.  If a new
      notification is requested before 10ms has passed since the previous
      notification, the new notification is delayed till then.
      
      As this only delays from the second notification on in a given close
      cluster of notifications, userland reactions to notifications
      shouldn't be delayed at all in most cases while avoiding notification
      storms.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b12e3583
  13. 20 3月, 2018 1 次提交
  14. 15 3月, 2018 1 次提交
  15. 02 1月, 2018 1 次提交
  16. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  17. 27 10月, 2017 1 次提交
    • T
      cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat · d41bf8c9
      Tejun Heo 提交于
      The basic cpu stat is currently shown with "cpu." prefix in
      cgroup.stat, and the same information is duplicated in cpu.stat when
      cpu controller is enabled.  This is ugly and not very scalable as we
      want to expand the coverage of stat information which is always
      available.
      
      This patch makes cgroup core always create "cpu.stat" file and show
      the basic cpu stat there and calls the cpu controller to show the
      extra stats when enabled.  This ensures that the same information
      isn't presented in multiple places and makes future expansion of basic
      stats easier.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      d41bf8c9
  18. 25 9月, 2017 1 次提交
    • T
      cgroup: Implement cgroup2 basic CPU usage accounting · 041cd640
      Tejun Heo 提交于
      In cgroup1, while cpuacct isn't actually controlling any resources, it
      is a separate controller due to combination of two factors -
      1. enabling cpu controller has significant side effects, and 2. we
      have to pick one of the hierarchies to account CPU usages on.  cpuacct
      controller is effectively used to designate a hierarchy to track CPU
      usages on.
      
      cgroup2's unified hierarchy removes the second reason and we can
      account basic CPU usages by default.  While we can use cpuacct for
      this purpose, both its interface and implementation leave a lot to be
      desired - it collects and exposes two sources of truth which don't
      agree with each other and some of the exposed statistics don't make
      much sense.  Also, it propagates all the way up the hierarchy on each
      accounting event which is unnecessary.
      
      This patch adds basic resource accounting mechanism to cgroup2's
      unified hierarchy and accounts CPU usages using it.
      
      * All accountings are done per-cpu and don't propagate immediately.
        It just bumps the per-cgroup per-cpu counters and links to the
        parent's updated list if not already on it.
      
      * On a read, the per-cpu counters are collected into the global ones
        and then propagated upwards.  Only the per-cpu counters which have
        changed since the last read are propagated.
      
      * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
        prefix.  Total usage is collected from scheduling events.  User/sys
        breakdown is sourced from tick sampling and adjusted to the usage
        using cputime_adjust().
      
      This keeps the accounting side hot path O(1) and per-cpu and the read
      side O(nr_updated_since_last_read).
      
      v2: Minor changes and documentation updates as suggested by Waiman and
          Roman.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      041cd640
  19. 18 8月, 2017 1 次提交
  20. 03 8月, 2017 2 次提交
    • R
      cgroup: implement hierarchy limits · 1a926e0b
      Roman Gushchin 提交于
      Creating cgroup hierearchies of unreasonable size can affect
      overall system performance. A user might want to limit the
      size of cgroup hierarchy. This is especially important if a user
      is delegating some cgroup sub-tree.
      
      To address this issue, introduce an ability to control
      the size of cgroup hierarchy.
      
      The cgroup.max.descendants control file allows to set the maximum
      allowed number of descendant cgroups.
      The cgroup.max.depth file controls the maximum depth of the cgroup
      tree. Both are single value r/w files, with "max" default value.
      
      The control files exist on each hierarchy level (including root).
      When a new cgroup is created, we check the total descendants
      and depth limits on each level, and if none of them are exceeded,
      a new cgroup is created.
      
      Only alive cgroups are counted, removed (dying) cgroups are
      ignored.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      1a926e0b
    • R
      cgroup: keep track of number of descent cgroups · 0679dee0
      Roman Gushchin 提交于
      Keep track of the number of online and dying descent cgroups.
      
      This data will be used later to add an ability to control cgroup
      hierarchy (limit the depth and the number of descent cgroups)
      and display hierarchy stats.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      0679dee0
  21. 21 7月, 2017 2 次提交
    • T
      cgroup: implement cgroup v2 thread support · 8cfd8147
      Tejun Heo 提交于
      This patch implements cgroup v2 thread support.  The goal of the
      thread mode is supporting hierarchical accounting and control at
      thread granularity while staying inside the resource domain model
      which allows coordination across different resource controllers and
      handling of anonymous resource consumptions.
      
      A cgroup is always created as a domain and can be made threaded by
      writing to the "cgroup.type" file.  When a cgroup becomes threaded, it
      becomes a member of a threaded subtree which is anchored at the
      closest ancestor which isn't threaded.
      
      The threads of the processes which are in a threaded subtree can be
      placed anywhere without being restricted by process granularity or
      no-internal-process constraint.  Note that the threads aren't allowed
      to escape to a different threaded subtree.  To be used inside a
      threaded subtree, a controller should explicitly support threaded mode
      and be able to handle internal competition in the way which is
      appropriate for the resource.
      
      The root of a threaded subtree, the nearest ancestor which isn't
      threaded, is called the threaded domain and serves as the resource
      domain for the whole subtree.  This is the last cgroup where domain
      controllers are operational and where all the domain-level resource
      consumptions in the subtree are accounted.  This allows threaded
      controllers to operate at thread granularity when requested while
      staying inside the scope of system-level resource distribution.
      
      As the root cgroup is exempt from the no-internal-process constraint,
      it can serve as both a threaded domain and a parent to normal cgroups,
      so, unlike non-root cgroups, the root cgroup can have both domain and
      threaded children.
      
      Internally, in a threaded subtree, each css_set has its ->dom_cset
      pointing to a matching css_set which belongs to the threaded domain.
      This ensures that thread root level cgroup_subsys_state for all
      threaded controllers are readily accessible for domain-level
      operations.
      
      This patch enables threaded mode for the pids and perf_events
      controllers.  Neither has to worry about domain-level resource
      consumptions and it's enough to simply set the flag.
      
      For more details on the interface and behavior of the thread mode,
      please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
      by this patch.
      
      v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
            Spotted by Waiman.
          - Documentation updated as suggested by Waiman.
          - cgroup.type content slightly reformatted.
          - Mark the debug controller threaded.
      
      v4: - Updated to the general idea of marking specific cgroups
            domain/threaded as suggested by PeterZ.
      
      v3: - Dropped "join" and always make mixed children join the parent's
            threaded subtree.
      
      v2: - After discussions with Waiman, support for mixed thread mode is
            added.  This should address the issue that Peter pointed out
            where any nesting should be avoided for thread subtrees while
            coexisting with other domain cgroups.
          - Enabling / disabling thread mode now piggy backs on the existing
            control mask update mechanism.
          - Bug fixes and cleanup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8cfd8147
    • T
      cgroup: introduce cgroup->dom_cgrp and threaded css_set handling · 454000ad
      Tejun Heo 提交于
      cgroup v2 is in the process of growing thread granularity support.  A
      threaded subtree is composed of a thread root and threaded cgroups
      which are proper members of the subtree.
      
      The root cgroup of the subtree serves as the domain cgroup to which
      the processes (as opposed to threads / tasks) of the subtree
      conceptually belong and domain-level resource consumptions not tied to
      any specific task are charged.  Inside the subtree, threads won't be
      subject to process granularity or no-internal-task constraint and can
      be distributed arbitrarily across the subtree.
      
      This patch introduces cgroup->dom_cgrp along with threaded css_set
      handling.
      
      * cgroup->dom_cgrp points to self for normal and thread roots.  For
        proper thread subtree members, points to the dom_cgrp (the thread
        root).
      
      * css_set->dom_cset points to self if for normal and thread roots.  If
        threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
        The dom_cgrp serves as the resource domain and keeps the matching
        csses available.  The dom_cset holds those csses and makes them
        easily accessible.
      
      * All threaded csets are linked on their dom_csets to enable iteration
        of all threaded tasks.
      
      * cgroup->nr_threaded_children keeps track of the number of threaded
        children.
      
      This patch adds the above but doesn't actually use them yet.  The
      following patches will build on top.
      
      v4: ->nr_threaded_children added.
      
      v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset.  Updated for the new
          enable-threaded-per-cgroup behavior.
      
      v2: Added cgroup_is_threaded() helper.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      454000ad
  22. 17 7月, 2017 1 次提交
    • T
      cgroup: distinguish local and children populated states · 788b950c
      Tejun Heo 提交于
      cgrp->populated_cnt counts both local (the cgroup's populated
      css_sets) and subtree proper (populated children) so that it's only
      zero when the whole subtree, including self, is empty.
      
      This patch splits the counter into two so that local and children
      populated states are tracked separately.  It allows finer-grained
      tests on the state of the hierarchy which will be used to replace
      css_set walking local populated test.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      788b950c
  23. 29 6月, 2017 1 次提交
    • T
      cgroup: implement "nsdelegate" mount option · 5136f636
      Tejun Heo 提交于
      Currently, cgroup only supports delegation to !root users and cgroup
      namespaces don't get any special treatments.  This limits the
      usefulness of cgroup namespaces as they by themselves can't be safe
      delegation boundaries.  A process inside a cgroup can change the
      resource control knobs of the parent in the namespace root and may
      move processes in and out of the namespace if cgroups outside its
      namespace are visible somehow.
      
      This patch adds a new mount option "nsdelegate" which makes cgroup
      namespaces delegation boundaries.  If set, cgroup behaves as if write
      permission based delegation took place at namespace boundaries -
      writes to the resource control knobs from the namespace root are
      denied and migration crossing the namespace boundary aren't allowed
      from inside the namespace.
      
      This allows cgroup namespace to function as a delegation boundary by
      itself.
      
      v2: Silently ignore nsdelegate specified on !init mounts.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Aravind Anbudurai <aru7@fb.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      5136f636
  24. 15 6月, 2017 1 次提交
    • W
      cgroup: Keep accurate count of tasks in each css_set · 73a7242a
      Waiman Long 提交于
      The reference count in the css_set data structure was used as a
      proxy of the number of tasks attached to that css_set. However, that
      count is actually not an accurate measure especially with thread mode
      support. So a new variable nr_tasks is added to the css_set to keep
      track of the actual task count. This new variable is protected by
      the css_set_lock. Functions that require the actual task count are
      updated to use the new variable.
      
      tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps.
          Refreshed on top of cgroup/for-v4.13 which dropped on
          css_set_populated() -> nr_tasks conversion.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      73a7242a
  25. 18 5月, 2017 1 次提交
  26. 11 4月, 2017 1 次提交
    • T
      cgroup: move cgroup_subsys_state parent field for cache locality · b8b1a2e5
      Todd Poynor 提交于
      Various structures embed a struct cgroup_subsys_state, typically at
      the top of the containing structure.  It is common for code that
      accesses the structures to perform operations that iterate over the
      chain of parent css pointers, also accessing data in each containing
      structure.  In particular, struct cpuacct is used by fairly hot code
      paths in the scheduler such as cpuacct_charge().
      
      Move the parent css pointer field to the end of the structure to
      increase the chances of residing in the same cache line as the data
      from the containing structure.
      Signed-off-by: NTodd Poynor <toddpoynor@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b8b1a2e5
  27. 09 3月, 2017 1 次提交
  28. 02 3月, 2017 1 次提交
    • I
      sched/headers, cgroups: Remove the threadgroup_change_*() wrappery · 780de9dd
      Ingo Molnar 提交于
      threadgroup_change_begin()/end() is a pointless wrapper around
      cgroup_threadgroup_change_begin()/end(), minus a might_sleep()
      in the !CONFIG_CGROUPS=y case.
      
      Remove the wrappery, move the might_sleep() (the down_read()
      already has a might_sleep() check).
      
      This debloats <linux/sched.h> a bit and simplifies this API.
      
      Update all call sites.
      
      No change in functionality.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      780de9dd
  29. 28 12月, 2016 2 次提交
  30. 26 11月, 2016 1 次提交
    • D
      cgroup: add support for eBPF programs · 30070984
      Daniel Mack 提交于
      This patch adds two sets of eBPF program pointers to struct cgroup.
      One for such that are directly pinned to a cgroup, and one for such
      that are effective for it.
      
      To illustrate the logic behind that, assume the following example
      cgroup hierarchy.
      
        A - B - C
              \ D - E
      
      If only B has a program attached, it will be effective for B, C, D
      and E. If D then attaches a program itself, that will be effective for
      both D and E, and the program in B will only affect B and C. Only one
      program of a given type is effective for a cgroup.
      
      Attaching and detaching programs will be done through the bpf(2)
      syscall. For now, ingress and egress inet socket filtering are the
      only supported use-cases.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30070984
  31. 26 4月, 2016 1 次提交
    • T
      cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback · 5cf1cacb
      Tejun Heo 提交于
      Since e93ad19d ("cpuset: make mm migration asynchronous"), cpuset
      kicks off asynchronous NUMA node migration if necessary during task
      migration and flushes it from cpuset_post_attach_flush() which is
      called at the end of __cgroup_procs_write().  This is to avoid
      performing migration with cgroup_threadgroup_rwsem write-locked which
      can lead to deadlock through dependency on kworker creation.
      
      memcg has a similar issue with charge moving, so let's convert it to
      an official callback rather than the current one-off cpuset specific
      function.  This patch adds cgroup_subsys->post_attach callback and
      makes cpuset register cpuset_post_attach_flush() as its ->post_attach.
      
      The conversion is mostly one-to-one except that the new callback is
      called under cgroup_mutex.  This is to guarantee that no other
      migration operations are started before ->post_attach callbacks are
      finished.  cgroup_mutex is one of the outermost mutex in the system
      and has never been and shouldn't be a problem.  We can add specialized
      synchronization around __cgroup_procs_write() but I don't think
      there's any noticeable benefit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
      5cf1cacb
  32. 17 3月, 2016 1 次提交
    • T
      cgroup: ignore css_sets associated with dead cgroups during migration · 2b021cbf
      Tejun Heo 提交于
      Before 2e91fa7f ("cgroup: keep zombies associated with their
      original cgroups"), all dead tasks were associated with init_css_set.
      If a zombie task is requested for migration, while migration prep
      operations would still be performed on init_css_set, the actual
      migration would ignore zombie tasks.  As init_css_set is always valid,
      this worked fine.
      
      However, after 2e91fa7f, zombie tasks stay with the css_set it was
      associated with at the time of death.  Let's say a task T associated
      with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2.  After T
      becomes a zombie, it would still remain associated with A and B.  If A
      only contains zombie tasks, it can be removed.  On removal, A gets
      marked offline but stays pinned until all zombies are drained.  At
      this point, if migration is initiated on T to a cgroup C on hierarchy
      H-2, migration path would try to prepare T's css_set for migration and
      trigger the following.
      
       WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
       CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
       ...
       Call Trace:
        [<ffffffff8127e63c>] dump_stack+0x4e/0x82
        [<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
        [<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
        [<ffffffff810c33e1>] cgroup_get+0x121/0x160
        [<ffffffff810c349b>] link_css_set+0x7b/0x90
        [<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
        [<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
        [<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
        [<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
        [<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
        [<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
        [<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
        [<ffffffff81151673>] __vfs_write+0x23/0xe0
        [<ffffffff81152494>] vfs_write+0xa4/0x1a0
        [<ffffffff811532d4>] SyS_write+0x44/0xa0
        [<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      It doesn't make sense to prepare migration for css_sets pointing to
      dead cgroups as they are guaranteed to contain only zombies which are
      ignored later during migration.  This patch makes cgroup destruction
      path mark all affected css_sets as dead and updates the migration path
      to ignore them during preparation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 2e91fa7f ("cgroup: keep zombies associated with their original cgroups")
      Cc: stable@vger.kernel.org # v4.4+
      2b021cbf
  33. 09 3月, 2016 1 次提交
    • T
      cgroup: implement cgroup_subsys->implicit_on_dfl · f6d635ad
      Tejun Heo 提交于
      Some controllers, perf_event for now and possibly freezer in the
      future, don't really make sense to control explicitly through
      "cgroup.subtree_control".  For example, the primary role of perf_event
      is identifying the cgroups of tasks; however, because the controller
      also keeps a small amount of state per cgroup, it can't be replaced
      with simple cgroup membership tests.
      
      This patch implements cgroup_subsys->implicit_on_dfl flag.  When set,
      the controller is implicitly enabled on all cgroups on the v2
      hierarchy so that utility type controllers such as perf_event can be
      enabled and function transparently.
      
      An implicit controller doesn't show up in "cgroup.controllers" or
      "cgroup.subtree_control", is exempt from no internal process rule and
      can be stolen from the default hierarchy even if there are non-root
      csses.
      
      v2: Reimplemented on top of the recent updates to css handling and
          subsystem rebinding.  Rebinding implicit subsystems is now a
          simple matter of exempting it from the busy subsystem check.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f6d635ad