1. 13 8月, 2021 1 次提交
    • W
      cgroup/cpuset: Enable memory migration for cpuset v2 · ee9707e8
      Waiman Long 提交于
      When a user changes cpuset.cpus, each task in a v2 cpuset will be moved
      to one of the new cpus if it is not there already. For memory, however,
      they won't be migrated to the new nodes when cpuset.mems changes. This is
      an inconsistency in behavior.
      
      In cpuset v1, there is a memory_migrate control file to enable such
      behavior by setting the CS_MEMORY_MIGRATE flag. Make it the default
      for cpuset v2 so that we have a consistent set of behavior for both
      cpus and memory.
      
      There is certainly a cost to make memory migration the default, but it
      is a one time cost that shouldn't really matter as long as cpuset.mems
      isn't changed frequenty.  Update the cgroup-v2.rst file to document the
      new behavior and recommend against changing cpuset.mems frequently.
      
      Since there won't be any concurrent access to the newly allocated cpuset
      structure in cpuset_css_alloc(), we can use the cheaper non-atomic
      __set_bit() instead of the more expensive atomic set_bit().
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ee9707e8
  2. 22 6月, 2021 1 次提交
    • B
      block: Introduce the ioprio rq-qos policy · 556910e3
      Bart Van Assche 提交于
      Introduce an rq-qos policy that assigns an I/O priority to requests based
      on blk-cgroup configuration settings. This policy has the following
      advantages over the ioprio_set() system call:
      - This policy is cgroup based so it has all the advantages of cgroups.
      - While ioprio_set() does not affect page cache writeback I/O, this rq-qos
        controller affects page cache writeback I/O for filesystems that support
        assiociating a cgroup with writeback I/O. See also
        Documentation/admin-guide/cgroup-v2.rst.
      
      Cc: Damien Le Moal <damien.lemoal@wdc.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
      Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Link: https://lore.kernel.org/r/20210618004456.7280-5-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>
      556910e3
  3. 10 5月, 2021 1 次提交
  4. 05 4月, 2021 1 次提交
  5. 26 2月, 2021 1 次提交
  6. 25 2月, 2021 1 次提交
    • S
      mm: memcg: add swapcache stat for memcg v2 · b6038942
      Shakeel Butt 提交于
      This patch adds swapcache stat for the cgroup v2.  The swapcache
      represents the memory that is accounted against both the memory and the
      swap limit of the cgroup.  The main motivation behind exposing the
      swapcache stat is for enabling users to gracefully migrate from cgroup
      v1's memsw counter to cgroup v2's memory and swap counters.
      
      Cgroup v1's memsw limit allows users to limit the memory+swap usage of a
      workload but without control on the exact proportion of memory and swap.
      Cgroup v2 provides separate limits for memory and swap which enables more
      control on the exact usage of memory and swap individually for the
      workload.
      
      With some little subtleties, the v1's memsw limit can be switched with the
      sum of the v2's memory and swap limits.  However the alternative for memsw
      usage is not yet available in cgroup v2.  Exposing per-cgroup swapcache
      stat enables that alternative.  Adding the memory usage and swap usage and
      subtracting the swapcache will approximate the memsw usage.  This will
      help in the transparent migration of the workloads depending on memsw
      usage and limit to v2' memory and swap counters.
      
      The reasons these applications are still interested in this approximate
      memsw usage are: (1) these applications are not really interested in two
      separate memory and swap usage metrics.  A single usage metric is more
      simple to use and reason about for them.
      
      (2) The memsw usage metric hides the underlying system's swap setup from
      the applications.  Applications with multiple instances running in a
      datacenter with heterogeneous systems (some have swap and some don't) will
      keep seeing a consistent view of their usage.
      
      [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
      
      Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6038942
  7. 22 1月, 2021 6 次提交
  8. 20 1月, 2021 1 次提交
  9. 12 1月, 2021 1 次提交
  10. 16 12月, 2020 2 次提交
  11. 14 10月, 2020 1 次提交
  12. 27 9月, 2020 1 次提交
  13. 13 8月, 2020 1 次提交
  14. 18 7月, 2020 1 次提交
    • B
      blk-cgroup: show global disk stats in root cgroup io.stat · ef45fe47
      Boris Burkov 提交于
      In order to improve consistency and usability in cgroup stat accounting,
      we would like to support the root cgroup's io.stat.
      
      Since the root cgroup has processes doing io even if the system has no
      explicitly created cgroups, we need to be careful to avoid overhead in
      that case.  For that reason, the rstat algorithms don't handle the root
      cgroup, so just turning the file on wouldn't give correct statistics.
      
      To get around this, we simulate flushing the iostat struct by filling it
      out directly from global disk stats. The result is a root cgroup io.stat
      file consistent with both /proc/diskstats and io.stat.
      
      Note that in order to collect the disk stats, we needed to iterate over
      devices. To facilitate that, we had to change the linkage of a disk_type
      to external so that it can be used from blk-cgroup.c to iterate over
      disks.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ef45fe47
  15. 06 7月, 2020 2 次提交
  16. 26 6月, 2020 1 次提交
  17. 09 6月, 2020 1 次提交
  18. 03 6月, 2020 2 次提交
    • J
      mm/memcg: automatically penalize tasks with high swap use · 4b82ab4f
      Jakub Kicinski 提交于
      Add a memory.swap.high knob, which can be used to protect the system
      from SWAP exhaustion.  The mechanism used for penalizing is similar to
      memory.high penalty (sleep on return to user space).
      
      That is not to say that the knob itself is equivalent to memory.high.
      The objective is more to protect the system from potentially buggy tasks
      consuming a lot of swap and impacting other tasks, or even bringing the
      whole system to stand still with complete SWAP exhaustion.  Hopefully
      without the need to find per-task hard limits.
      
      Slowing misbehaving tasks down gradually allows user space oom killers
      or other protection mechanisms to react.  oomd and earlyoom already do
      killing based on swap exhaustion, and memory.swap.high protection will
      help implement such userspace oom policies more reliably.
      
      We can use one counter for number of pages allocated under pressure to
      save struct task space and avoid two separate hierarchy walks on the hot
      path.  The exact overage is calculated on return to user space, anyway.
      
      Take the new high limit into account when determining if swap is "full".
      Borrowing the explanation from Johannes:
      
        The idea behind "swap full" is that as long as the workload has plenty
        of swap space available and it's not changing its memory contents, it
        makes sense to generously hold on to copies of data in the swap device,
        even after the swapin.  A later reclaim cycle can drop the page without
        any IO.  Trading disk space for IO.
      
        But the only two ways to reclaim a swap slot is when they're faulted
        in and the references go away, or by scanning the virtual address space
        like swapoff does - which is very expensive (one could argue it's too
        expensive even for swapoff, it's often more practical to just reboot).
      
        So at some point in the fill level, we have to start freeing up swap
        slots on fault/swapin.  Otherwise we could eventually run out of swap
        slots while they're filled with copies of data that is also in RAM.
      
        We don't want to OOM a workload because its available swap space is
        filled with redundant cache.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200527195846.102707-5-kuba@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b82ab4f
    • Y
      mm, memcg: add workingset_restore in memory.stat · a6f5576b
      Yafang Shao 提交于
      There's a new workingset counter introduced in commit 1899ad18 ("mm:
      workingset: tell cache transitions from workingset thrashing").  With
      the help of this counter we can know the workingset is transitioning or
      thrashing.  To leverage the benifit of this counter to memcg, we should
      introduce it into memory.stat.  Then we could know the workingset of the
      workload inside a memcg better.
      
      Bellow is the verification of this new counter in memory.stat.  Read a
      file into the memory and then read it again to make these pages be
      active.  The size of this file is 1G.  (memory.max is greater than file
      size) The counters in memory.stat will be
      
      	inactive_file 0
      	active_file 1073639424
      
      	workingset_refault 0
      	workingset_activate 0
      	workingset_restore 0
      	workingset_nodereclaim 0
      
      Trigger the memcg reclaim by setting a lower value to memory.high, and
      then some pages will be demoted into inactive list, and then some pages
      in the inactive list will be evicted into the storage.
      
      	inactive_file 498094080
      	active_file 310063104
      
      	workingset_refault 0
      	workingset_activate 0
      	workingset_restore 0
      	workingset_nodereclaim 0
      
      Then recover the memory.high and read the file into memory again.  As a
      result of it, the transitioning will occur.  Bellow is the result of
      this transitioning,
      
      	inactive_file 498094080
      	active_file 575397888
      
      	workingset_refault 64746
      	workingset_activate 64746
      	workingset_restore 64746
      	workingset_nodereclaim 0
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChris Down <chris@chrisdown.name>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Link: http://lkml.kernel.org/r/20200504153522.11553-1-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6f5576b
  19. 28 5月, 2020 1 次提交
    • B
      cgroup: add cpu.stat file to root cgroup · 936f2a70
      Boris Burkov 提交于
      Currently, the root cgroup does not have a cpu.stat file. Add one which
      is consistent with /proc/stat to capture global cpu statistics that
      might not fall under cgroup accounting.
      
      We haven't done this in the past because the data are already presented
      in /proc/stat and we didn't want to add overhead from collecting root
      cgroup stats when cgroups are configured, but no cgroups have been
      created.
      
      By keeping the data consistent with /proc/stat, I think we avoid the
      first problem, while improving the usability of cgroups stats.
      We avoid the second problem by computing the contents of cpu.stat from
      existing data collected for /proc/stat anyway.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      936f2a70
  20. 03 4月, 2020 1 次提交
    • J
      mm: memcontrol: recursive memory.low protection · 8a931f80
      Johannes Weiner 提交于
      Right now, the effective protection of any given cgroup is capped by its
      own explicit memory.low setting, regardless of what the parent says.  The
      reasons for this are mostly historical and ease of implementation: to make
      delegation of memory.low safe, effective protection is the min() of all
      memory.low up the tree.
      
      Unfortunately, this limitation makes it impossible to protect an entire
      subtree from another without forcing the user to make explicit protection
      allocations all the way to the leaf cgroups - something that is highly
      undesirable in real life scenarios.
      
      Consider memory in a data center host.  At the cgroup top level, we have a
      distinction between system management software and the actual workload the
      system is executing.  Both branches are further subdivided into individual
      services, job components etc.
      
      We want to protect the workload as a whole from the system management
      software, but that doesn't mean we want to protect and prioritize
      individual workload wrt each other.  Their memory demand can vary over
      time, and we'd want the VM to simply cache the hottest data within the
      workload subtree.  Yet, the current memory.low limitations force us to
      allocate a fixed amount of protection to each workload component in order
      to get protection from system management software in general.  This
      results in very inefficient resource distribution.
      
      Another concern with mandating downward allocation is that, as the
      complexity of the cgroup tree grows, it gets harder for the lower levels
      to be informed about decisions made at the host-level.  Consider a
      container inside a namespace that in turn creates its own nested tree of
      cgroups to run multiple workloads.  It'd be extremely difficult to
      configure memory.low parameters in those leaf cgroups that on one hand
      balance pressure among siblings as the container desires, while also
      reflecting the host-level protection from e.g.  rpm upgrades, that lie
      beyond one or more delegation and namespacing points in the tree.
      
      It's highly unusual from a cgroup interface POV that nested levels have to
      be aware of and reflect decisions made at higher levels for them to be
      effective.
      
      To enable such use cases and scale configurability for complex trees, this
      patch implements a resource inheritance model for memory that is similar
      to how the CPU and the IO controller implement work-conserving resource
      allocations: a share of a resource allocated to a subree always applies to
      the entire subtree recursively, while allowing, but not mandating,
      children to further specify distribution rules.
      
      That means that if protection is explicitly allocated among siblings,
      those configured shares are being followed during page reclaim just like
      they are now.  However, if the memory.low set at a higher level is not
      fully claimed by the children in that subtree, the "floating" remainder is
      applied to each cgroup in the tree in proportion to its size.  Since
      reclaim pressure is applied in proportion to size as well, each child in
      that tree gets the same boost, and the effect is neutral among siblings -
      with respect to each other, they behave as if no memory control was
      enabled at all, and the VM simply balances the memory demands optimally
      within the subtree.  But collectively those cgroups enjoy a boost over the
      cgroups in neighboring trees.
      
      E.g.  a leaf cgroup with a memory.low setting of 0 no longer means that
      it's not getting a share of the hierarchically assigned resource, just
      that it doesn't claim a fixed amount of it to protect from its siblings.
      
      This allows us to recursively protect one subtree (workload) from another
      (system management), while letting subgroups compete freely among each
      other - without having to assign fixed shares to each leaf, and without
      nested groups having to echo higher-level settings.
      
      The floating protection composes naturally with fixed protection.
      Consider the following example tree:
      
      		A            A: low = 2G
                     / \          A1: low = 1G
                    A1 A2         A2: low = 0G
      
      As outside pressure is applied to this tree, A1 will enjoy a fixed
      protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
      evenly among A1 and A2, coming out to 1.5G and 0.5G.
      
      There is a slight risk of regressing theoretical setups where the
      top-level cgroups don't know about the true budgeting and set bogusly high
      "bypass" values that are meaningfully allocated down the tree.  Such
      setups would rely on unclaimed protection to be discarded, and
      distributing it would change the intended behavior.  Be safe and hide the
      new behavior behind a mount option, 'memory_recursiveprot'.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NChris Down <chris@chrisdown.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a931f80
  21. 03 3月, 2020 5 次提交
  22. 17 12月, 2019 1 次提交
    • G
      mm: hugetlb controller for cgroups v2 · faced7e0
      Giuseppe Scrivano 提交于
      In the effort of supporting cgroups v2 into Kubernetes, I stumped on
      the lack of the hugetlb controller.
      
      When the controller is enabled, it exposes four new files for each
      hugetlb size on non-root cgroups:
      
      - hugetlb.<hugepagesize>.current
      - hugetlb.<hugepagesize>.max
      - hugetlb.<hugepagesize>.events
      - hugetlb.<hugepagesize>.events.local
      
      The differences with the legacy hierarchy are in the file names and
      using the value "max" instead of "-1" to disable a limit.
      
      The file .limit_in_bytes is renamed to .max.
      
      The file .usage_in_bytes is renamed to .current.
      
      .failcnt is not provided as a single file anymore, but its value can
      be read through the new flat-keyed files .events and .events.local,
      through the "max" key.
      Signed-off-by: NGiuseppe Scrivano <gscrivan@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      faced7e0
  23. 01 12月, 2019 1 次提交
  24. 18 11月, 2019 1 次提交
  25. 08 10月, 2019 1 次提交
    • C
      mm, memcg: proportional memory.{low,min} reclaim · 9783aa99
      Chris Down 提交于
      cgroup v2 introduces two memory protection thresholds: memory.low
      (best-effort) and memory.min (hard protection).  While they generally do
      what they say on the tin, there is a limitation in their implementation
      that makes them difficult to use effectively: that cliff behaviour often
      manifests when they become eligible for reclaim.  This patch implements
      more intuitive and usable behaviour, where we gradually mount more
      reclaim pressure as cgroups further and further exceed their protection
      thresholds.
      
      This cliff edge behaviour happens because we only choose whether or not
      to reclaim based on whether the memcg is within its protection limits
      (see the use of mem_cgroup_protected in shrink_node), but we don't vary
      our reclaim behaviour based on this information.  Imagine the following
      timeline, with the numbers the lruvec size in this zone:
      
      1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
      2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
      3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
         scanned. (?!)
      
      * Of course, we won't usually scan all available pages in the zone even
        without this patch because of scan control priority, over-reclaim
        protection, etc.  However, as shown by the tests at the end, these
        techniques don't sufficiently throttle such an extreme change in input,
        so cliff-like behaviour isn't really averted by their existence alone.
      
      Here's an example of how this plays out in practice.  At Facebook, we are
      trying to protect various workloads from "system" software, like
      configuration management tools, metric collectors, etc (see this[0] case
      study).  In order to find a suitable memory.low value, we start by
      determining the expected memory range within which the workload will be
      comfortable operating.  This isn't an exact science -- memory usage deemed
      "comfortable" will vary over time due to user behaviour, differences in
      composition of work, etc, etc.  As such we need to ballpark memory.low,
      but doing this is currently problematic:
      
      1. If we end up setting it too low for the workload, it won't have
         *any* effect (see discussion above).  The group will receive the full
         weight of reclaim and won't have any priority while competing with the
         less important system software, as if we had no memory.low configured
         at all.
      
      2. Because of this behaviour, we end up erring on the side of setting
         it too high, such that the comfort range is reliably covered.  However,
         protected memory is completely unavailable to the rest of the system,
         so we might cause undue memory and IO pressure there when we *know* we
         have some elasticity in the workload.
      
      3. Even if we get the value totally right, smack in the middle of the
         comfort zone, we get extreme jumps between no pressure and full
         pressure that cause unpredictable pressure spikes in the workload due
         to the current binary reclaim behaviour.
      
      With this patch, we can set it to our ballpark estimation without too much
      worry.  Any undesirable behaviour, such as too much or too little reclaim
      pressure on the workload or system will be proportional to how far our
      estimation is off.  This means we can set memory.low much more
      conservatively and thus waste less resources *without* the risk of the
      workload falling off a cliff if we overshoot.
      
      As a more abstract technical description, this unintuitive behaviour
      results in having to give high-priority workloads a large protection
      buffer on top of their expected usage to function reliably, as otherwise
      we have abrupt periods of dramatically increased memory pressure which
      hamper performance.  Having to set these thresholds so high wastes
      resources and generally works against the principle of work conservation.
      In addition, having proportional memory reclaim behaviour has other
      benefits.  Most notably, before this patch it's basically mandatory to set
      memory.low to a higher than desirable value because otherwise as soon as
      you exceed memory.low, all protection is lost, and all pages are eligible
      to scan again.  By contrast, having a gradual ramp in reclaim pressure
      means that you now still get some protection when thresholds are exceeded,
      which means that one can now be more comfortable setting memory.low to
      lower values without worrying that all protection will be lost.  This is
      important because workingset size is really hard to know exactly,
      especially with variable workloads, so at least getting *some* protection
      if your workingset size grows larger than you expect increases user
      confidence in setting memory.low without a huge buffer on top being
      needed.
      
      Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
      assistance in thinking about how to make this work better.
      
      In testing these changes, I intended to verify that:
      
      1. Changes in page scanning become gradual and proportional instead of
         binary.
      
         To test this, I experimented stepping further and further down
         memory.low protection on a workload that floats around 19G workingset
         when under memory.low protection, watching page scan rates for the
         workload cgroup:
      
         +------------+-----------------+--------------------+--------------+
         | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
         +------------+-----------------+--------------------+--------------+
         |        21G |               0 |                  0 | N/A          |
         |        17G |             867 |               3799 | 23%          |
         |        12G |            1203 |               3543 | 34%          |
         |         8G |            2534 |               3979 | 64%          |
         |         4G |            3980 |               4147 | 96%          |
         |          0 |            3799 |               3980 | 95%          |
         +------------+-----------------+--------------------+--------------+
      
         As you can see, the test kernel (with a kernel containing this
         patch) ramps up page scanning significantly more gradually than the
         control kernel (without this patch).
      
      2. More gradual ramp up in reclaim aggression doesn't result in
         premature OOMs.
      
         To test this, I wrote a script that slowly increments the number of
         pages held by stress(1)'s --vm-keep mode until a production system
         entered severe overall memory contention.  This script runs in a highly
         protected slice taking up the majority of available system memory.
         Watching vmstat revealed that page scanning continued essentially
         nominally between test and control, without causing forward reclaim
         progress to become arrested.
      
      [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
      
      [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
      [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
        Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
      Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9783aa99
  26. 01 10月, 2019 1 次提交
  27. 03 9月, 2019 1 次提交
    • P
      sched/uclamp: Extend CPU's cgroup controller · 2480c093
      Patrick Bellasi 提交于
      The cgroup CPU bandwidth controller allows to assign a specified
      (maximum) bandwidth to the tasks of a group. However this bandwidth is
      defined and enforced only on a temporal base, without considering the
      actual frequency a CPU is running on. Thus, the amount of computation
      completed by a task within an allocated bandwidth can be very different
      depending on the actual frequency the CPU is running that task.
      The amount of computation can be affected also by the specific CPU a
      task is running on, especially when running on asymmetric capacity
      systems like Arm's big.LITTLE.
      
      With the availability of schedutil, the scheduler is now able
      to drive frequency selections based on actual task utilization.
      Moreover, the utilization clamping support provides a mechanism to
      bias the frequency selection operated by schedutil depending on
      constraints assigned to the tasks currently RUNNABLE on a CPU.
      
      Giving the mechanisms described above, it is now possible to extend the
      cpu controller to specify the minimum (or maximum) utilization which
      should be considered for tasks RUNNABLE on a cpu.
      This makes it possible to better defined the actual computational
      power assigned to task groups, thus improving the cgroup CPU bandwidth
      controller which is currently based just on time constraints.
      
      Extend the CPU controller with a couple of new attributes uclamp.{min,max}
      which allow to enforce utilization boosting and capping for all the
      tasks in a group.
      
      Specifically:
      
      - uclamp.min: defines the minimum utilization which should be considered
      	      i.e. the RUNNABLE tasks of this group will run at least at a
      	      minimum frequency which corresponds to the uclamp.min
      	      utilization
      
      - uclamp.max: defines the maximum utilization which should be considered
      	      i.e. the RUNNABLE tasks of this group will run up to a
      	      maximum frequency which corresponds to the uclamp.max
      	      utilization
      
      These attributes:
      
      a) are available only for non-root nodes, both on default and legacy
         hierarchies, while system wide clamps are defined by a generic
         interface which does not depends on cgroups. This system wide
         interface enforces constraints on tasks in the root node.
      
      b) enforce effective constraints at each level of the hierarchy which
         are a restriction of the group requests considering its parent's
         effective constraints. Root group effective constraints are defined
         by the system wide interface.
         This mechanism allows each (non-root) level of the hierarchy to:
         - request whatever clamp values it would like to get
         - effectively get only up to the maximum amount allowed by its parent
      
      c) have higher priority than task-specific clamps, defined via
         sched_setattr(), thus allowing to control and restrict task requests.
      
      Add two new attributes to the cpu controller to collect "requested"
      clamp values. Allow that at each non-root level of the hierarchy.
      Keep it simple by not caring now about "effective" values computation
      and propagation along the hierarchy.
      
      Update sysctl_sched_uclamp_handler() to use the newly introduced
      uclamp_mutex so that we serialize system default updates with cgroup
      relate updates.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMichal Koutny <mkoutny@suse.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2480c093
  28. 29 8月, 2019 1 次提交