1. 29 8月, 2019 2 次提交
    • T
      blkcg: add tools/cgroup/iocost_coef_gen.py · 8504dea7
      Tejun Heo 提交于
      Add a script which can be used to generate device-specific iocost
      linear model coefficients.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8504dea7
    • T
      blkcg: implement blk-iocost · 7caa4715
      Tejun Heo 提交于
      This patchset implements IO cost model based work-conserving
      proportional controller.
      
      While io.latency provides the capability to comprehensively prioritize
      and protect IOs depending on the cgroups, its protection is binary -
      the lowest latency target cgroup which is suffering is protected at
      the cost of all others.  In many use cases including stacking multiple
      workload containers in a single system, it's necessary to distribute
      IO capacity with better granularity.
      
      One challenge of controlling IO resources is the lack of trivially
      observable cost metric.  The most common metrics - bandwidth and iops
      - can be off by orders of magnitude depending on the device type and
      IO pattern.  However, the cost isn't a complete mystery.  Given
      several key attributes, we can make fairly reliable predictions on how
      expensive a given stream of IOs would be, at least compared to other
      IO patterns.
      
      The function which determines the cost of a given IO is the IO cost
      model for the device.  This controller distributes IO capacity based
      on the costs estimated by such model.  The more accurate the cost
      model the better but the controller adapts based on IO completion
      latency and as long as the relative costs across differents IO
      patterns are consistent and sensible, it'll adapt to the actual
      performance of the device.
      
      Currently, the only implemented cost model is a simple linear one with
      a few sets of default parameters for different classes of device.
      This covers most common devices reasonably well.  All the
      infrastructure to tune and add different cost models is already in
      place and a later patch will also allow using bpf progs for cost
      models.
      
      Please see the top comment in blk-iocost.c and documentation for
      more details.
      
      v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
          for a divide-by-zero bug in current_hweight() triggered by zero
          inuse_sum.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7caa4715
  2. 15 7月, 2019 2 次提交
  3. 13 7月, 2019 1 次提交
    • S
      mm, memcg: introduce memory.events.local · 1e577f97
      Shakeel Butt 提交于
      The memory controller in cgroup v2 exposes memory.events file for each
      memcg which shows the number of times events like low, high, max, oom
      and oom_kill have happened for the whole tree rooted at that memcg.
      Users can also poll or register notification to monitor the changes in
      that file.  Any event at any level of the tree rooted at memcg will
      notify all the listeners along the path till root_mem_cgroup.  There are
      existing users which depend on this behavior.
      
      However there are users which are only interested in the events
      happening at a specific level of the memcg tree and not in the events in
      the underlying tree rooted at that memcg.  One such use-case is a
      centralized resource monitor which can dynamically adjust the limits of
      the jobs running on a system.  The jobs can create their sub-hierarchy
      for their own sub-tasks.  The centralized monitor is only interested in
      the events at the top level memcgs of the jobs as it can then act and
      adjust the limits of the jobs.  Using the current memory.events for such
      centralized monitor is very inconvenient.  The monitor will keep
      receiving events which it is not interested and to find if the received
      event is interesting, it has to read memory.event files of the next
      level and compare it with the top level one.  So, let's introduce
      memory.events.local to the memcg which shows and notify for the events
      at the memcg level.
      
      Now, does memory.stat and memory.pressure need their local versions.  IMHO
      no due to the no internal process contraint of the cgroup v2.  The
      memory.stat file of the top level memcg of a job shows the stats and
      vmevents of the whole tree.  The local stats or vmevents of the top level
      memcg will only change if there is a process running in that memcg but v2
      does not allow that.  Similarly for memory.pressure there will not be any
      process in the internal nodes and thus no chance of local pressure.
      
      Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Chris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e577f97
  4. 10 7月, 2019 1 次提交
  5. 02 6月, 2019 1 次提交
    • C
      mm, memcg: consider subtrees in memory.events · 9852ae3f
      Chris Down 提交于
      memory.stat and other files already consider subtrees in their output, and
      we should too in order to not present an inconsistent interface.
      
      The current situation is fairly confusing, because people interacting with
      cgroups expect hierarchical behaviour in the vein of memory.stat,
      cgroup.events, and other files.  For example, this causes confusion when
      debugging reclaim events under low, as currently these always read "0" at
      non-leaf memcg nodes, which frequently causes people to misdiagnose breach
      behaviour.  The same confusion applies to other counters in this file when
      debugging issues.
      
      Aggregation is done at write time instead of at read-time since these
      counters aren't hot (unlike memory.stat which is per-page, so it does it
      at read time), and it makes sense to bundle this with the file
      notifications.
      
      After this patch, events are propagated up the hierarchy:
      
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 0
          oom 0
          oom_kill 0
          [root@ktst ~]# systemd-run -p MemoryMax=1 true
          Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 7
          oom 1
          oom_kill 1
      
      As this is a change in behaviour, this can be reverted to the old
      behaviour by mounting with the `memory_localevents' flag set.  However, we
      use the new behaviour by default as there's a lack of evidence that there
      are any current users of memory.events that would find this change
      undesirable.
      
      akpm: this is a behaviour change, so Cc:stable.  THis is so that
      forthcoming distros which use cgroup v2 are more likely to pick up the
      revised behaviour.
      
      Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9852ae3f
  6. 01 6月, 2019 1 次提交
    • T
      cgroup: add cgroup_parse_float() · a5e112e6
      Tejun Heo 提交于
      cgroup already uses floating point for percent[ile] numbers and there
      are several controllers which want to take them as input.  Add a
      generic parse helper to handle inputs.
      
      Update the interface convention documentation about the use of
      percentage numbers.  While at it, also clarify the default time unit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a5e112e6
  7. 20 4月, 2019 1 次提交
  8. 06 3月, 2019 1 次提交
  9. 09 2月, 2019 1 次提交
    • R
      Documentation: cgroup-v2: eliminate markup warnings · 34b43446
      Randy Dunlap 提交于
      Fix markup warnings in cgroup-v2.rst:
      
      Documentation/admin-guide/cgroup-v2.rst:1509: WARNING: Block quote ends without a blank line; unexpected unindent.
      Documentation/admin-guide/cgroup-v2.rst:1511: WARNING: Block quote ends without a blank line; unexpected unindent.
      Documentation/admin-guide/cgroup-v2.rst:1512: WARNING: Block quote ends without a blank line; unexpected unindent.
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: cgroups@vger.kernel.org
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: linux-doc@vger.kernel.org
      Signed-off-by: NTejun Heo <tj@kernel.org>
      34b43446
  10. 08 12月, 2018 1 次提交
  11. 14 11月, 2018 1 次提交
    • T
      cpuset: Minor cgroup2 interface updates · b1e3aeb1
      Tejun Heo 提交于
      * Rename the partition file from "cpuset.sched.partition" to
        "cpuset.cpus.partition".
      
      * When writing to the partition file, drop "0" and "1" and only accept
        "member" and "root".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Waiman Long <longman@redhat.com>
      b1e3aeb1
  12. 09 11月, 2018 3 次提交
  13. 02 11月, 2018 1 次提交
  14. 27 10月, 2018 2 次提交
    • R
      mm: don't raise MEMCG_OOM event due to failed high-order allocation · 7a1adfdd
      Roman Gushchin 提交于
      It was reported that on some of our machines containers were restarted
      with OOM symptoms without an obvious reason.  Despite there were almost no
      memory pressure and plenty of page cache, MEMCG_OOM event was raised
      occasionally, causing the container management software to think, that OOM
      has happened.  However, no tasks have been killed.
      
      The following investigation showed that the problem is caused by a failing
      attempt to charge a high-order page.  In such case, the OOM killer is
      never invoked.  As shown below, it can happen under conditions, which are
      very far from a real OOM: e.g.  there is plenty of clean page cache and no
      memory pressure.
      
      There is no sense in raising an OOM event in this case, as it might
      confuse a user and lead to wrong and excessive actions (e.g.  restart the
      workload, as in my case).
      
      Let's look at the charging path in try_charge().  If the memory usage is
      about memory.max, which is absolutely natural for most memory cgroups, we
      try to reclaim some pages.  Even if we were able to reclaim enough memory
      for the allocation, the following check can fail due to a race with
      another concurrent allocation:
      
          if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
              goto retry;
      
      For regular pages the following condition will save us from triggering
      the OOM:
      
         if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
             goto retry;
      
      But for high-order allocation this condition will intentionally fail.  The
      reason behind is that we'll likely fall to regular pages anyway, so it's
      ok and even preferred to return ENOMEM.
      
      In this case the idea of raising MEMCG_OOM looks dubious.
      
      Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
      order check, so that the event won't be raised for high order allocations.
      This change doesn't affect regular pages allocation and charging.
      
      Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a1adfdd
    • J
      psi: cgroup support · 2ce7135a
      Johannes Weiner 提交于
      On a system that executes multiple cgrouped jobs and independent
      workloads, we don't just care about the health of the overall system, but
      also that of individual jobs, so that we can ensure individual job health,
      fairness between jobs, or prioritize some jobs over others.
      
      This patch implements pressure stall tracking for cgroups.  In kernels
      with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
      and io.pressure files that track aggregate pressure stall times for only
      the tasks inside the cgroup.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ce7135a
  15. 22 9月, 2018 1 次提交
  16. 23 8月, 2018 1 次提交
    • R
      mm, oom: introduce memory.oom.group · 3d8b38eb
      Roman Gushchin 提交于
      For some workloads an intervention from the OOM killer can be painful.
      Killing a random task can bring the workload into an inconsistent state.
      
      Historically, there are two common solutions for this
      problem:
      1) enabling panic_on_oom,
      2) using a userspace daemon to monitor OOMs and kill
         all outstanding processes.
      
      Both approaches have their downsides: rebooting on each OOM is an obvious
      waste of capacity, and handling all in userspace is tricky and requires a
      userspace agent, which will monitor all cgroups for OOMs.
      
      In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
      the necessity of enabling panic_on_oom.  Also, it can simplify the cgroup
      management for userspace applications.
      
      This commit introduces a new knob for cgroup v2 memory controller:
      memory.oom.group.  The knob determines whether the cgroup should be
      treated as an indivisible workload by the OOM killer.  If set, all tasks
      belonging to the cgroup or to its descendants (if the memory cgroup is not
      a leaf cgroup) are killed together or not at all.
      
      To determine which cgroup has to be killed, we do traverse the cgroup
      hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
      and looking for the highest-level cgroup with memory.oom.group set.
      
      Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
      an exception and are never killed.
      
      This patch doesn't change the OOM victim selection algorithm.
      
      Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d8b38eb
  17. 02 8月, 2018 1 次提交
  18. 18 7月, 2018 1 次提交
  19. 09 7月, 2018 1 次提交
  20. 08 6月, 2018 4 次提交
  21. 11 5月, 2018 1 次提交
  22. 17 1月, 2018 1 次提交
    • M
      cgroup, docs: document the root cgroup behavior of cpu and io controllers · c4e0842b
      Maciej S. Szmigiero 提交于
      Currently, cgroups v2 documentation contains only a generic remark that
      "How resource consumption in the root cgroup is governed is up to each
      controller", which isn't really telling users much, who need to dig in the
      code and / or commit messages to learn the exact behavior.
      
      In cgroups v1 at least the blkio controller had its operation with respect
      to competition between child threads and child cgroups documented in
      blkio-controller.txt, with references to cfq-iosched.txt.
      Also, cgroups v2 documentation describes v1 behavior of both cpu and
      blkio controllers in an "Issues with v1" section.
      
      Let's document this behavior also for cgroups v2 to make life easier for
      users.
      Signed-off-by: NMaciej S. Szmigiero <mail@maciej.szmigiero.name>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c4e0842b
  23. 03 1月, 2018 1 次提交
  24. 14 12月, 2017 1 次提交
  25. 06 12月, 2017 1 次提交
  26. 27 10月, 2017 1 次提交
    • T
      cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat · d41bf8c9
      Tejun Heo 提交于
      The basic cpu stat is currently shown with "cpu." prefix in
      cgroup.stat, and the same information is duplicated in cpu.stat when
      cpu controller is enabled.  This is ugly and not very scalable as we
      want to expand the coverage of stat information which is always
      available.
      
      This patch makes cgroup core always create "cpu.stat" file and show
      the basic cpu stat there and calls the cpu controller to show the
      extra stats when enabled.  This ensures that the same information
      isn't presented in multiple places and makes future expansion of basic
      stats easier.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      d41bf8c9
  27. 30 9月, 2017 1 次提交
    • T
      sched: Implement interface for cgroup unified hierarchy · 0d593634
      Tejun Heo 提交于
      There are a couple interface issues which can be addressed in cgroup2
      interface.
      
      * Stats from cpuacct being reported separately from the cpu stats.
      
      * Use of different time units.  Writable control knobs use
        microseconds, some stat fields use nanoseconds while other cpuacct
        stat fields use centiseconds.
      
      * Control knobs which can't be used in the root cgroup still show up
        in the root.
      
      * Control knob names and semantics aren't consistent with other
        controllers.
      
      This patchset implements cpu controller's interface on cgroup2 which
      adheres to the controller file conventions described in
      Documentation/cgroups/cgroup-v2.txt.  Overall, the following changes
      are made.
      
      * cpuacct is implictly enabled and disabled by cpu and its information
        is reported through "cpu.stat" which now uses microseconds for all
        time durations.  All time duration fields now have "_usec" appended
        to them for clarity.
      
        Note that cpuacct.usage_percpu is currently not included in
        "cpu.stat".  If this information is actually called for, it will be
        added later.
      
      * "cpu.shares" is replaced with "cpu.weight" and operates on the
        standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000).
        The weight is scaled to scheduler weight so that 100 maps to 1024
        and the ratio relationship is preserved - if weight is W and its
        scaled value is S, W / 100 == S / 1024.  While the mapped range is a
        bit smaller than the orignal scheduler weight range, the dead zones
        on both sides are relatively small and covers wider range than the
        nice value mappings.  This file doesn't make sense in the root
        cgroup and isn't created on root.
      
      * "cpu.weight.nice" is added. When read, it reads back the nice value
        which is closest to the current "cpu.weight".  When written, it sets
        "cpu.weight" to the weight value which matches the nice value.  This
        makes it easy to configure cgroups when they're competing against
        threads in threaded subtrees.
      
      * "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max"
        which contains both quota and period.
      
      v4: - Use cgroup2 basic usage stat as the information source instead
            of cpuacct.
      
      v3: - Added "cpu.weight.nice" to allow using nice values when
            configuring the weight.  The feature is requested by PeterZ.
          - Merge the patch to enable threaded support on cpu and cpuacct.
          - Dropped the bits about getting rid of cpuacct from patch
            description as there is a pretty strong case for making cpuacct
            an implicit controller so that basic cpu usage stats are always
            available.
          - Documentation updated accordingly.  "cpu.rt.max" section is
            dropped for now.
      
      v2: - cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED
            for CFS bandwidth stats and also using raw division for u64.
            Use CONFIG_CFS_BANDWITH and do_div() instead.  "cpu.rt.max" is
            not included yet.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      0d593634
  28. 25 9月, 2017 1 次提交
    • T
      cgroup: Implement cgroup2 basic CPU usage accounting · 041cd640
      Tejun Heo 提交于
      In cgroup1, while cpuacct isn't actually controlling any resources, it
      is a separate controller due to combination of two factors -
      1. enabling cpu controller has significant side effects, and 2. we
      have to pick one of the hierarchies to account CPU usages on.  cpuacct
      controller is effectively used to designate a hierarchy to track CPU
      usages on.
      
      cgroup2's unified hierarchy removes the second reason and we can
      account basic CPU usages by default.  While we can use cpuacct for
      this purpose, both its interface and implementation leave a lot to be
      desired - it collects and exposes two sources of truth which don't
      agree with each other and some of the exposed statistics don't make
      much sense.  Also, it propagates all the way up the hierarchy on each
      accounting event which is unnecessary.
      
      This patch adds basic resource accounting mechanism to cgroup2's
      unified hierarchy and accounts CPU usages using it.
      
      * All accountings are done per-cpu and don't propagate immediately.
        It just bumps the per-cgroup per-cpu counters and links to the
        parent's updated list if not already on it.
      
      * On a read, the per-cpu counters are collected into the global ones
        and then propagated upwards.  Only the per-cpu counters which have
        changed since the last read are propagated.
      
      * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
        prefix.  Total usage is collected from scheduling events.  User/sys
        breakdown is sourced from tick sampling and adjusted to the usage
        using cputime_adjust().
      
      This keeps the accounting side hot path O(1) and per-cpu and the read
      side O(nr_updated_since_last_read).
      
      v2: Minor changes and documentation updates as suggested by Waiman and
          Roman.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      041cd640
  29. 03 8月, 2017 2 次提交
    • R
      cgroup: add cgroup.stat interface with basic hierarchy stats · ec39225c
      Roman Gushchin 提交于
      A cgroup can consume resources even after being deleted by a user.
      For example, writing back dirty pages should be accounted and
      limited, despite the corresponding cgroup might contain no processes
      and being deleted by a user.
      
      In the current implementation a cgroup can remain in such "dying" state
      for an undefined amount of time. For instance, if a memory cgroup
      contains a pge, mlocked by a process belonging to an other cgroup.
      
      Although the lifecycle of a dying cgroup is out of user's control,
      it's important to have some insight of what's going on under the hood.
      
      In particular, it's handy to have a counter which will allow
      to detect css leaks.
      
      To solve this problem, add a cgroup.stat interface to
      the base cgroup control files with the following metrics:
      
      nr_descendants		total number of visible descendant cgroups
      nr_dying_descendants	total number of dying descendant cgroups
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      ec39225c
    • R
      cgroup: implement hierarchy limits · 1a926e0b
      Roman Gushchin 提交于
      Creating cgroup hierearchies of unreasonable size can affect
      overall system performance. A user might want to limit the
      size of cgroup hierarchy. This is especially important if a user
      is delegating some cgroup sub-tree.
      
      To address this issue, introduce an ability to control
      the size of cgroup hierarchy.
      
      The cgroup.max.descendants control file allows to set the maximum
      allowed number of descendant cgroups.
      The cgroup.max.depth file controls the maximum depth of the cgroup
      tree. Both are single value r/w files, with "max" default value.
      
      The control files exist on each hierarchy level (including root).
      When a new cgroup is created, we check the total descendants
      and depth limits on each level, and if none of them are exceeded,
      a new cgroup is created.
      
      Only alive cgroups are counted, removed (dying) cgroups are
      ignored.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      1a926e0b
  30. 26 7月, 2017 1 次提交
    • T
      cgroup: remove unnecessary empty check when enabling threaded mode · 918a8c2c
      Tejun Heo 提交于
      cgroup_enable_threaded() checks that the cgroup doesn't have any tasks
      or children and fails the operation if so.  This test is unnecessary
      because the first part is already checked by
      cgroup_can_be_thread_root() and the latter is unnecessary.  The latter
      actually cause a behavioral oddity.  Please consider the following
      hierarchy.  All cgroups are domains.
      
          A
         / \
        B   C
             \
              D
      
      If B is made threaded, C and D becomes invalid domains.  Due to the no
      children restriction, threaded mode can't be enabled on C.  For C and
      D, the only thing the user can do is removal.
      
      There is no reason for this restriction.  Remove it.
      Acked-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      918a8c2c
  31. 21 7月, 2017 1 次提交
    • T
      cgroup: implement cgroup v2 thread support · 8cfd8147
      Tejun Heo 提交于
      This patch implements cgroup v2 thread support.  The goal of the
      thread mode is supporting hierarchical accounting and control at
      thread granularity while staying inside the resource domain model
      which allows coordination across different resource controllers and
      handling of anonymous resource consumptions.
      
      A cgroup is always created as a domain and can be made threaded by
      writing to the "cgroup.type" file.  When a cgroup becomes threaded, it
      becomes a member of a threaded subtree which is anchored at the
      closest ancestor which isn't threaded.
      
      The threads of the processes which are in a threaded subtree can be
      placed anywhere without being restricted by process granularity or
      no-internal-process constraint.  Note that the threads aren't allowed
      to escape to a different threaded subtree.  To be used inside a
      threaded subtree, a controller should explicitly support threaded mode
      and be able to handle internal competition in the way which is
      appropriate for the resource.
      
      The root of a threaded subtree, the nearest ancestor which isn't
      threaded, is called the threaded domain and serves as the resource
      domain for the whole subtree.  This is the last cgroup where domain
      controllers are operational and where all the domain-level resource
      consumptions in the subtree are accounted.  This allows threaded
      controllers to operate at thread granularity when requested while
      staying inside the scope of system-level resource distribution.
      
      As the root cgroup is exempt from the no-internal-process constraint,
      it can serve as both a threaded domain and a parent to normal cgroups,
      so, unlike non-root cgroups, the root cgroup can have both domain and
      threaded children.
      
      Internally, in a threaded subtree, each css_set has its ->dom_cset
      pointing to a matching css_set which belongs to the threaded domain.
      This ensures that thread root level cgroup_subsys_state for all
      threaded controllers are readily accessible for domain-level
      operations.
      
      This patch enables threaded mode for the pids and perf_events
      controllers.  Neither has to worry about domain-level resource
      consumptions and it's enough to simply set the flag.
      
      For more details on the interface and behavior of the thread mode,
      please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
      by this patch.
      
      v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
            Spotted by Waiman.
          - Documentation updated as suggested by Waiman.
          - cgroup.type content slightly reformatted.
          - Mark the debug controller threaded.
      
      v4: - Updated to the general idea of marking specific cgroups
            domain/threaded as suggested by PeterZ.
      
      v3: - Dropped "join" and always make mixed children join the parent's
            threaded subtree.
      
      v2: - After discussions with Waiman, support for mixed thread mode is
            added.  This should address the issue that Peter pointed out
            where any nesting should be avoided for thread subtrees while
            coexisting with other domain cgroups.
          - Enabling / disabling thread mode now piggy backs on the existing
            control mask update mechanism.
          - Bug fixes and cleanup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8cfd8147