1. 09 6月, 2020 1 次提交
  2. 03 6月, 2020 2 次提交
    • J
      mm/memcg: automatically penalize tasks with high swap use · 4b82ab4f
      Jakub Kicinski 提交于
      Add a memory.swap.high knob, which can be used to protect the system
      from SWAP exhaustion.  The mechanism used for penalizing is similar to
      memory.high penalty (sleep on return to user space).
      
      That is not to say that the knob itself is equivalent to memory.high.
      The objective is more to protect the system from potentially buggy tasks
      consuming a lot of swap and impacting other tasks, or even bringing the
      whole system to stand still with complete SWAP exhaustion.  Hopefully
      without the need to find per-task hard limits.
      
      Slowing misbehaving tasks down gradually allows user space oom killers
      or other protection mechanisms to react.  oomd and earlyoom already do
      killing based on swap exhaustion, and memory.swap.high protection will
      help implement such userspace oom policies more reliably.
      
      We can use one counter for number of pages allocated under pressure to
      save struct task space and avoid two separate hierarchy walks on the hot
      path.  The exact overage is calculated on return to user space, anyway.
      
      Take the new high limit into account when determining if swap is "full".
      Borrowing the explanation from Johannes:
      
        The idea behind "swap full" is that as long as the workload has plenty
        of swap space available and it's not changing its memory contents, it
        makes sense to generously hold on to copies of data in the swap device,
        even after the swapin.  A later reclaim cycle can drop the page without
        any IO.  Trading disk space for IO.
      
        But the only two ways to reclaim a swap slot is when they're faulted
        in and the references go away, or by scanning the virtual address space
        like swapoff does - which is very expensive (one could argue it's too
        expensive even for swapoff, it's often more practical to just reboot).
      
        So at some point in the fill level, we have to start freeing up swap
        slots on fault/swapin.  Otherwise we could eventually run out of swap
        slots while they're filled with copies of data that is also in RAM.
      
        We don't want to OOM a workload because its available swap space is
        filled with redundant cache.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200527195846.102707-5-kuba@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b82ab4f
    • Y
      mm, memcg: add workingset_restore in memory.stat · a6f5576b
      Yafang Shao 提交于
      There's a new workingset counter introduced in commit 1899ad18 ("mm:
      workingset: tell cache transitions from workingset thrashing").  With
      the help of this counter we can know the workingset is transitioning or
      thrashing.  To leverage the benifit of this counter to memcg, we should
      introduce it into memory.stat.  Then we could know the workingset of the
      workload inside a memcg better.
      
      Bellow is the verification of this new counter in memory.stat.  Read a
      file into the memory and then read it again to make these pages be
      active.  The size of this file is 1G.  (memory.max is greater than file
      size) The counters in memory.stat will be
      
      	inactive_file 0
      	active_file 1073639424
      
      	workingset_refault 0
      	workingset_activate 0
      	workingset_restore 0
      	workingset_nodereclaim 0
      
      Trigger the memcg reclaim by setting a lower value to memory.high, and
      then some pages will be demoted into inactive list, and then some pages
      in the inactive list will be evicted into the storage.
      
      	inactive_file 498094080
      	active_file 310063104
      
      	workingset_refault 0
      	workingset_activate 0
      	workingset_restore 0
      	workingset_nodereclaim 0
      
      Then recover the memory.high and read the file into memory again.  As a
      result of it, the transitioning will occur.  Bellow is the result of
      this transitioning,
      
      	inactive_file 498094080
      	active_file 575397888
      
      	workingset_refault 64746
      	workingset_activate 64746
      	workingset_restore 64746
      	workingset_nodereclaim 0
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChris Down <chris@chrisdown.name>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Link: http://lkml.kernel.org/r/20200504153522.11553-1-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6f5576b
  3. 28 5月, 2020 1 次提交
    • B
      cgroup: add cpu.stat file to root cgroup · 936f2a70
      Boris Burkov 提交于
      Currently, the root cgroup does not have a cpu.stat file. Add one which
      is consistent with /proc/stat to capture global cpu statistics that
      might not fall under cgroup accounting.
      
      We haven't done this in the past because the data are already presented
      in /proc/stat and we didn't want to add overhead from collecting root
      cgroup stats when cgroups are configured, but no cgroups have been
      created.
      
      By keeping the data consistent with /proc/stat, I think we avoid the
      first problem, while improving the usability of cgroups stats.
      We avoid the second problem by computing the contents of cpu.stat from
      existing data collected for /proc/stat anyway.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      936f2a70
  4. 03 4月, 2020 1 次提交
    • J
      mm: memcontrol: recursive memory.low protection · 8a931f80
      Johannes Weiner 提交于
      Right now, the effective protection of any given cgroup is capped by its
      own explicit memory.low setting, regardless of what the parent says.  The
      reasons for this are mostly historical and ease of implementation: to make
      delegation of memory.low safe, effective protection is the min() of all
      memory.low up the tree.
      
      Unfortunately, this limitation makes it impossible to protect an entire
      subtree from another without forcing the user to make explicit protection
      allocations all the way to the leaf cgroups - something that is highly
      undesirable in real life scenarios.
      
      Consider memory in a data center host.  At the cgroup top level, we have a
      distinction between system management software and the actual workload the
      system is executing.  Both branches are further subdivided into individual
      services, job components etc.
      
      We want to protect the workload as a whole from the system management
      software, but that doesn't mean we want to protect and prioritize
      individual workload wrt each other.  Their memory demand can vary over
      time, and we'd want the VM to simply cache the hottest data within the
      workload subtree.  Yet, the current memory.low limitations force us to
      allocate a fixed amount of protection to each workload component in order
      to get protection from system management software in general.  This
      results in very inefficient resource distribution.
      
      Another concern with mandating downward allocation is that, as the
      complexity of the cgroup tree grows, it gets harder for the lower levels
      to be informed about decisions made at the host-level.  Consider a
      container inside a namespace that in turn creates its own nested tree of
      cgroups to run multiple workloads.  It'd be extremely difficult to
      configure memory.low parameters in those leaf cgroups that on one hand
      balance pressure among siblings as the container desires, while also
      reflecting the host-level protection from e.g.  rpm upgrades, that lie
      beyond one or more delegation and namespacing points in the tree.
      
      It's highly unusual from a cgroup interface POV that nested levels have to
      be aware of and reflect decisions made at higher levels for them to be
      effective.
      
      To enable such use cases and scale configurability for complex trees, this
      patch implements a resource inheritance model for memory that is similar
      to how the CPU and the IO controller implement work-conserving resource
      allocations: a share of a resource allocated to a subree always applies to
      the entire subtree recursively, while allowing, but not mandating,
      children to further specify distribution rules.
      
      That means that if protection is explicitly allocated among siblings,
      those configured shares are being followed during page reclaim just like
      they are now.  However, if the memory.low set at a higher level is not
      fully claimed by the children in that subtree, the "floating" remainder is
      applied to each cgroup in the tree in proportion to its size.  Since
      reclaim pressure is applied in proportion to size as well, each child in
      that tree gets the same boost, and the effect is neutral among siblings -
      with respect to each other, they behave as if no memory control was
      enabled at all, and the VM simply balances the memory demands optimally
      within the subtree.  But collectively those cgroups enjoy a boost over the
      cgroups in neighboring trees.
      
      E.g.  a leaf cgroup with a memory.low setting of 0 no longer means that
      it's not getting a share of the hierarchically assigned resource, just
      that it doesn't claim a fixed amount of it to protect from its siblings.
      
      This allows us to recursively protect one subtree (workload) from another
      (system management), while letting subgroups compete freely among each
      other - without having to assign fixed shares to each leaf, and without
      nested groups having to echo higher-level settings.
      
      The floating protection composes naturally with fixed protection.
      Consider the following example tree:
      
      		A            A: low = 2G
                     / \          A1: low = 1G
                    A1 A2         A2: low = 0G
      
      As outside pressure is applied to this tree, A1 will enjoy a fixed
      protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
      evenly among A1 and A2, coming out to 1.5G and 0.5G.
      
      There is a slight risk of regressing theoretical setups where the
      top-level cgroups don't know about the true budgeting and set bogusly high
      "bypass" values that are meaningfully allocated down the tree.  Such
      setups would rely on unclaimed protection to be discarded, and
      distributing it would change the intended behavior.  Be safe and hide the
      new behavior behind a mount option, 'memory_recursiveprot'.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NChris Down <chris@chrisdown.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a931f80
  5. 03 3月, 2020 5 次提交
  6. 17 12月, 2019 1 次提交
    • G
      mm: hugetlb controller for cgroups v2 · faced7e0
      Giuseppe Scrivano 提交于
      In the effort of supporting cgroups v2 into Kubernetes, I stumped on
      the lack of the hugetlb controller.
      
      When the controller is enabled, it exposes four new files for each
      hugetlb size on non-root cgroups:
      
      - hugetlb.<hugepagesize>.current
      - hugetlb.<hugepagesize>.max
      - hugetlb.<hugepagesize>.events
      - hugetlb.<hugepagesize>.events.local
      
      The differences with the legacy hierarchy are in the file names and
      using the value "max" instead of "-1" to disable a limit.
      
      The file .limit_in_bytes is renamed to .max.
      
      The file .usage_in_bytes is renamed to .current.
      
      .failcnt is not provided as a single file anymore, but its value can
      be read through the new flat-keyed files .events and .events.local,
      through the "max" key.
      Signed-off-by: NGiuseppe Scrivano <gscrivan@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      faced7e0
  7. 01 12月, 2019 1 次提交
  8. 18 11月, 2019 1 次提交
  9. 08 10月, 2019 1 次提交
    • C
      mm, memcg: proportional memory.{low,min} reclaim · 9783aa99
      Chris Down 提交于
      cgroup v2 introduces two memory protection thresholds: memory.low
      (best-effort) and memory.min (hard protection).  While they generally do
      what they say on the tin, there is a limitation in their implementation
      that makes them difficult to use effectively: that cliff behaviour often
      manifests when they become eligible for reclaim.  This patch implements
      more intuitive and usable behaviour, where we gradually mount more
      reclaim pressure as cgroups further and further exceed their protection
      thresholds.
      
      This cliff edge behaviour happens because we only choose whether or not
      to reclaim based on whether the memcg is within its protection limits
      (see the use of mem_cgroup_protected in shrink_node), but we don't vary
      our reclaim behaviour based on this information.  Imagine the following
      timeline, with the numbers the lruvec size in this zone:
      
      1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
      2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
      3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
         scanned. (?!)
      
      * Of course, we won't usually scan all available pages in the zone even
        without this patch because of scan control priority, over-reclaim
        protection, etc.  However, as shown by the tests at the end, these
        techniques don't sufficiently throttle such an extreme change in input,
        so cliff-like behaviour isn't really averted by their existence alone.
      
      Here's an example of how this plays out in practice.  At Facebook, we are
      trying to protect various workloads from "system" software, like
      configuration management tools, metric collectors, etc (see this[0] case
      study).  In order to find a suitable memory.low value, we start by
      determining the expected memory range within which the workload will be
      comfortable operating.  This isn't an exact science -- memory usage deemed
      "comfortable" will vary over time due to user behaviour, differences in
      composition of work, etc, etc.  As such we need to ballpark memory.low,
      but doing this is currently problematic:
      
      1. If we end up setting it too low for the workload, it won't have
         *any* effect (see discussion above).  The group will receive the full
         weight of reclaim and won't have any priority while competing with the
         less important system software, as if we had no memory.low configured
         at all.
      
      2. Because of this behaviour, we end up erring on the side of setting
         it too high, such that the comfort range is reliably covered.  However,
         protected memory is completely unavailable to the rest of the system,
         so we might cause undue memory and IO pressure there when we *know* we
         have some elasticity in the workload.
      
      3. Even if we get the value totally right, smack in the middle of the
         comfort zone, we get extreme jumps between no pressure and full
         pressure that cause unpredictable pressure spikes in the workload due
         to the current binary reclaim behaviour.
      
      With this patch, we can set it to our ballpark estimation without too much
      worry.  Any undesirable behaviour, such as too much or too little reclaim
      pressure on the workload or system will be proportional to how far our
      estimation is off.  This means we can set memory.low much more
      conservatively and thus waste less resources *without* the risk of the
      workload falling off a cliff if we overshoot.
      
      As a more abstract technical description, this unintuitive behaviour
      results in having to give high-priority workloads a large protection
      buffer on top of their expected usage to function reliably, as otherwise
      we have abrupt periods of dramatically increased memory pressure which
      hamper performance.  Having to set these thresholds so high wastes
      resources and generally works against the principle of work conservation.
      In addition, having proportional memory reclaim behaviour has other
      benefits.  Most notably, before this patch it's basically mandatory to set
      memory.low to a higher than desirable value because otherwise as soon as
      you exceed memory.low, all protection is lost, and all pages are eligible
      to scan again.  By contrast, having a gradual ramp in reclaim pressure
      means that you now still get some protection when thresholds are exceeded,
      which means that one can now be more comfortable setting memory.low to
      lower values without worrying that all protection will be lost.  This is
      important because workingset size is really hard to know exactly,
      especially with variable workloads, so at least getting *some* protection
      if your workingset size grows larger than you expect increases user
      confidence in setting memory.low without a huge buffer on top being
      needed.
      
      Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
      assistance in thinking about how to make this work better.
      
      In testing these changes, I intended to verify that:
      
      1. Changes in page scanning become gradual and proportional instead of
         binary.
      
         To test this, I experimented stepping further and further down
         memory.low protection on a workload that floats around 19G workingset
         when under memory.low protection, watching page scan rates for the
         workload cgroup:
      
         +------------+-----------------+--------------------+--------------+
         | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
         +------------+-----------------+--------------------+--------------+
         |        21G |               0 |                  0 | N/A          |
         |        17G |             867 |               3799 | 23%          |
         |        12G |            1203 |               3543 | 34%          |
         |         8G |            2534 |               3979 | 64%          |
         |         4G |            3980 |               4147 | 96%          |
         |          0 |            3799 |               3980 | 95%          |
         +------------+-----------------+--------------------+--------------+
      
         As you can see, the test kernel (with a kernel containing this
         patch) ramps up page scanning significantly more gradually than the
         control kernel (without this patch).
      
      2. More gradual ramp up in reclaim aggression doesn't result in
         premature OOMs.
      
         To test this, I wrote a script that slowly increments the number of
         pages held by stress(1)'s --vm-keep mode until a production system
         entered severe overall memory contention.  This script runs in a highly
         protected slice taking up the majority of available system memory.
         Watching vmstat revealed that page scanning continued essentially
         nominally between test and control, without causing forward reclaim
         progress to become arrested.
      
      [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
      
      [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
      [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
        Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
      Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9783aa99
  10. 01 10月, 2019 1 次提交
  11. 03 9月, 2019 1 次提交
    • P
      sched/uclamp: Extend CPU's cgroup controller · 2480c093
      Patrick Bellasi 提交于
      The cgroup CPU bandwidth controller allows to assign a specified
      (maximum) bandwidth to the tasks of a group. However this bandwidth is
      defined and enforced only on a temporal base, without considering the
      actual frequency a CPU is running on. Thus, the amount of computation
      completed by a task within an allocated bandwidth can be very different
      depending on the actual frequency the CPU is running that task.
      The amount of computation can be affected also by the specific CPU a
      task is running on, especially when running on asymmetric capacity
      systems like Arm's big.LITTLE.
      
      With the availability of schedutil, the scheduler is now able
      to drive frequency selections based on actual task utilization.
      Moreover, the utilization clamping support provides a mechanism to
      bias the frequency selection operated by schedutil depending on
      constraints assigned to the tasks currently RUNNABLE on a CPU.
      
      Giving the mechanisms described above, it is now possible to extend the
      cpu controller to specify the minimum (or maximum) utilization which
      should be considered for tasks RUNNABLE on a cpu.
      This makes it possible to better defined the actual computational
      power assigned to task groups, thus improving the cgroup CPU bandwidth
      controller which is currently based just on time constraints.
      
      Extend the CPU controller with a couple of new attributes uclamp.{min,max}
      which allow to enforce utilization boosting and capping for all the
      tasks in a group.
      
      Specifically:
      
      - uclamp.min: defines the minimum utilization which should be considered
      	      i.e. the RUNNABLE tasks of this group will run at least at a
      	      minimum frequency which corresponds to the uclamp.min
      	      utilization
      
      - uclamp.max: defines the maximum utilization which should be considered
      	      i.e. the RUNNABLE tasks of this group will run up to a
      	      maximum frequency which corresponds to the uclamp.max
      	      utilization
      
      These attributes:
      
      a) are available only for non-root nodes, both on default and legacy
         hierarchies, while system wide clamps are defined by a generic
         interface which does not depends on cgroups. This system wide
         interface enforces constraints on tasks in the root node.
      
      b) enforce effective constraints at each level of the hierarchy which
         are a restriction of the group requests considering its parent's
         effective constraints. Root group effective constraints are defined
         by the system wide interface.
         This mechanism allows each (non-root) level of the hierarchy to:
         - request whatever clamp values it would like to get
         - effectively get only up to the maximum amount allowed by its parent
      
      c) have higher priority than task-specific clamps, defined via
         sched_setattr(), thus allowing to control and restrict task requests.
      
      Add two new attributes to the cpu controller to collect "requested"
      clamp values. Allow that at each non-root level of the hierarchy.
      Keep it simple by not caring now about "effective" values computation
      and propagation along the hierarchy.
      
      Update sysctl_sched_uclamp_handler() to use the newly introduced
      uclamp_mutex so that we serialize system default updates with cgroup
      relate updates.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMichal Koutny <mkoutny@suse.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2480c093
  12. 29 8月, 2019 2 次提交
    • T
      blkcg: add tools/cgroup/iocost_coef_gen.py · 8504dea7
      Tejun Heo 提交于
      Add a script which can be used to generate device-specific iocost
      linear model coefficients.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8504dea7
    • T
      blkcg: implement blk-iocost · 7caa4715
      Tejun Heo 提交于
      This patchset implements IO cost model based work-conserving
      proportional controller.
      
      While io.latency provides the capability to comprehensively prioritize
      and protect IOs depending on the cgroups, its protection is binary -
      the lowest latency target cgroup which is suffering is protected at
      the cost of all others.  In many use cases including stacking multiple
      workload containers in a single system, it's necessary to distribute
      IO capacity with better granularity.
      
      One challenge of controlling IO resources is the lack of trivially
      observable cost metric.  The most common metrics - bandwidth and iops
      - can be off by orders of magnitude depending on the device type and
      IO pattern.  However, the cost isn't a complete mystery.  Given
      several key attributes, we can make fairly reliable predictions on how
      expensive a given stream of IOs would be, at least compared to other
      IO patterns.
      
      The function which determines the cost of a given IO is the IO cost
      model for the device.  This controller distributes IO capacity based
      on the costs estimated by such model.  The more accurate the cost
      model the better but the controller adapts based on IO completion
      latency and as long as the relative costs across differents IO
      patterns are consistent and sensible, it'll adapt to the actual
      performance of the device.
      
      Currently, the only implemented cost model is a simple linear one with
      a few sets of default parameters for different classes of device.
      This covers most common devices reasonably well.  All the
      infrastructure to tune and add different cost models is already in
      place and a later patch will also allow using bpf progs for cost
      models.
      
      Please see the top comment in blk-iocost.c and documentation for
      more details.
      
      v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
          for a divide-by-zero bug in current_hweight() triggered by zero
          inuse_sum.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7caa4715
  13. 15 7月, 2019 2 次提交
  14. 13 7月, 2019 1 次提交
    • S
      mm, memcg: introduce memory.events.local · 1e577f97
      Shakeel Butt 提交于
      The memory controller in cgroup v2 exposes memory.events file for each
      memcg which shows the number of times events like low, high, max, oom
      and oom_kill have happened for the whole tree rooted at that memcg.
      Users can also poll or register notification to monitor the changes in
      that file.  Any event at any level of the tree rooted at memcg will
      notify all the listeners along the path till root_mem_cgroup.  There are
      existing users which depend on this behavior.
      
      However there are users which are only interested in the events
      happening at a specific level of the memcg tree and not in the events in
      the underlying tree rooted at that memcg.  One such use-case is a
      centralized resource monitor which can dynamically adjust the limits of
      the jobs running on a system.  The jobs can create their sub-hierarchy
      for their own sub-tasks.  The centralized monitor is only interested in
      the events at the top level memcgs of the jobs as it can then act and
      adjust the limits of the jobs.  Using the current memory.events for such
      centralized monitor is very inconvenient.  The monitor will keep
      receiving events which it is not interested and to find if the received
      event is interesting, it has to read memory.event files of the next
      level and compare it with the top level one.  So, let's introduce
      memory.events.local to the memcg which shows and notify for the events
      at the memcg level.
      
      Now, does memory.stat and memory.pressure need their local versions.  IMHO
      no due to the no internal process contraint of the cgroup v2.  The
      memory.stat file of the top level memcg of a job shows the stats and
      vmevents of the whole tree.  The local stats or vmevents of the top level
      memcg will only change if there is a process running in that memcg but v2
      does not allow that.  Similarly for memory.pressure there will not be any
      process in the internal nodes and thus no chance of local pressure.
      
      Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Chris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e577f97
  15. 10 7月, 2019 1 次提交
  16. 02 6月, 2019 1 次提交
    • C
      mm, memcg: consider subtrees in memory.events · 9852ae3f
      Chris Down 提交于
      memory.stat and other files already consider subtrees in their output, and
      we should too in order to not present an inconsistent interface.
      
      The current situation is fairly confusing, because people interacting with
      cgroups expect hierarchical behaviour in the vein of memory.stat,
      cgroup.events, and other files.  For example, this causes confusion when
      debugging reclaim events under low, as currently these always read "0" at
      non-leaf memcg nodes, which frequently causes people to misdiagnose breach
      behaviour.  The same confusion applies to other counters in this file when
      debugging issues.
      
      Aggregation is done at write time instead of at read-time since these
      counters aren't hot (unlike memory.stat which is per-page, so it does it
      at read time), and it makes sense to bundle this with the file
      notifications.
      
      After this patch, events are propagated up the hierarchy:
      
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 0
          oom 0
          oom_kill 0
          [root@ktst ~]# systemd-run -p MemoryMax=1 true
          Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 7
          oom 1
          oom_kill 1
      
      As this is a change in behaviour, this can be reverted to the old
      behaviour by mounting with the `memory_localevents' flag set.  However, we
      use the new behaviour by default as there's a lack of evidence that there
      are any current users of memory.events that would find this change
      undesirable.
      
      akpm: this is a behaviour change, so Cc:stable.  THis is so that
      forthcoming distros which use cgroup v2 are more likely to pick up the
      revised behaviour.
      
      Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9852ae3f
  17. 01 6月, 2019 1 次提交
    • T
      cgroup: add cgroup_parse_float() · a5e112e6
      Tejun Heo 提交于
      cgroup already uses floating point for percent[ile] numbers and there
      are several controllers which want to take them as input.  Add a
      generic parse helper to handle inputs.
      
      Update the interface convention documentation about the use of
      percentage numbers.  While at it, also clarify the default time unit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a5e112e6
  18. 20 4月, 2019 1 次提交
  19. 06 3月, 2019 1 次提交
  20. 09 2月, 2019 1 次提交
    • R
      Documentation: cgroup-v2: eliminate markup warnings · 34b43446
      Randy Dunlap 提交于
      Fix markup warnings in cgroup-v2.rst:
      
      Documentation/admin-guide/cgroup-v2.rst:1509: WARNING: Block quote ends without a blank line; unexpected unindent.
      Documentation/admin-guide/cgroup-v2.rst:1511: WARNING: Block quote ends without a blank line; unexpected unindent.
      Documentation/admin-guide/cgroup-v2.rst:1512: WARNING: Block quote ends without a blank line; unexpected unindent.
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: cgroups@vger.kernel.org
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: linux-doc@vger.kernel.org
      Signed-off-by: NTejun Heo <tj@kernel.org>
      34b43446
  21. 08 12月, 2018 1 次提交
  22. 14 11月, 2018 1 次提交
    • T
      cpuset: Minor cgroup2 interface updates · b1e3aeb1
      Tejun Heo 提交于
      * Rename the partition file from "cpuset.sched.partition" to
        "cpuset.cpus.partition".
      
      * When writing to the partition file, drop "0" and "1" and only accept
        "member" and "root".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Waiman Long <longman@redhat.com>
      b1e3aeb1
  23. 09 11月, 2018 3 次提交
  24. 02 11月, 2018 1 次提交
  25. 27 10月, 2018 2 次提交
    • R
      mm: don't raise MEMCG_OOM event due to failed high-order allocation · 7a1adfdd
      Roman Gushchin 提交于
      It was reported that on some of our machines containers were restarted
      with OOM symptoms without an obvious reason.  Despite there were almost no
      memory pressure and plenty of page cache, MEMCG_OOM event was raised
      occasionally, causing the container management software to think, that OOM
      has happened.  However, no tasks have been killed.
      
      The following investigation showed that the problem is caused by a failing
      attempt to charge a high-order page.  In such case, the OOM killer is
      never invoked.  As shown below, it can happen under conditions, which are
      very far from a real OOM: e.g.  there is plenty of clean page cache and no
      memory pressure.
      
      There is no sense in raising an OOM event in this case, as it might
      confuse a user and lead to wrong and excessive actions (e.g.  restart the
      workload, as in my case).
      
      Let's look at the charging path in try_charge().  If the memory usage is
      about memory.max, which is absolutely natural for most memory cgroups, we
      try to reclaim some pages.  Even if we were able to reclaim enough memory
      for the allocation, the following check can fail due to a race with
      another concurrent allocation:
      
          if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
              goto retry;
      
      For regular pages the following condition will save us from triggering
      the OOM:
      
         if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
             goto retry;
      
      But for high-order allocation this condition will intentionally fail.  The
      reason behind is that we'll likely fall to regular pages anyway, so it's
      ok and even preferred to return ENOMEM.
      
      In this case the idea of raising MEMCG_OOM looks dubious.
      
      Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
      order check, so that the event won't be raised for high order allocations.
      This change doesn't affect regular pages allocation and charging.
      
      Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a1adfdd
    • J
      psi: cgroup support · 2ce7135a
      Johannes Weiner 提交于
      On a system that executes multiple cgrouped jobs and independent
      workloads, we don't just care about the health of the overall system, but
      also that of individual jobs, so that we can ensure individual job health,
      fairness between jobs, or prioritize some jobs over others.
      
      This patch implements pressure stall tracking for cgroups.  In kernels
      with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
      and io.pressure files that track aggregate pressure stall times for only
      the tasks inside the cgroup.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ce7135a
  26. 22 9月, 2018 1 次提交
  27. 23 8月, 2018 1 次提交
    • R
      mm, oom: introduce memory.oom.group · 3d8b38eb
      Roman Gushchin 提交于
      For some workloads an intervention from the OOM killer can be painful.
      Killing a random task can bring the workload into an inconsistent state.
      
      Historically, there are two common solutions for this
      problem:
      1) enabling panic_on_oom,
      2) using a userspace daemon to monitor OOMs and kill
         all outstanding processes.
      
      Both approaches have their downsides: rebooting on each OOM is an obvious
      waste of capacity, and handling all in userspace is tricky and requires a
      userspace agent, which will monitor all cgroups for OOMs.
      
      In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
      the necessity of enabling panic_on_oom.  Also, it can simplify the cgroup
      management for userspace applications.
      
      This commit introduces a new knob for cgroup v2 memory controller:
      memory.oom.group.  The knob determines whether the cgroup should be
      treated as an indivisible workload by the OOM killer.  If set, all tasks
      belonging to the cgroup or to its descendants (if the memory cgroup is not
      a leaf cgroup) are killed together or not at all.
      
      To determine which cgroup has to be killed, we do traverse the cgroup
      hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
      and looking for the highest-level cgroup with memory.oom.group set.
      
      Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
      an exception and are never killed.
      
      This patch doesn't change the OOM victim selection algorithm.
      
      Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d8b38eb
  28. 02 8月, 2018 1 次提交
  29. 18 7月, 2018 1 次提交
  30. 09 7月, 2018 1 次提交