提交 · 69dc8010b8fca475170650a4ebbc0074541df859 · openeuler / Kernel

13 8月, 2021 1 次提交

cgroup/cpuset: Enable memory migration for cpuset v2 · ee9707e8

由 Waiman Long 提交于 8月 11, 2021

When a user changes cpuset.cpus, each task in a v2 cpuset will be moved
to one of the new cpus if it is not there already. For memory, however,
they won't be migrated to the new nodes when cpuset.mems changes. This is
an inconsistency in behavior.

In cpuset v1, there is a memory_migrate control file to enable such
behavior by setting the CS_MEMORY_MIGRATE flag. Make it the default
for cpuset v2 so that we have a consistent set of behavior for both
cpus and memory.

There is certainly a cost to make memory migration the default, but it
is a one time cost that shouldn't really matter as long as cpuset.mems
isn't changed frequenty.  Update the cgroup-v2.rst file to document the
new behavior and recommend against changing cpuset.mems frequently.

Since there won't be any concurrent access to the newly allocated cpuset
structure in cpuset_css_alloc(), we can use the cheaper non-atomic
__set_bit() instead of the more expensive atomic set_bit().
Signed-off-by: NWaiman Long <longman@redhat.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

ee9707e8

22 6月, 2021 1 次提交

block: Introduce the ioprio rq-qos policy · 556910e3

由 Bart Van Assche 提交于 6月 17, 2021

Introduce an rq-qos policy that assigns an I/O priority to requests based
on blk-cgroup configuration settings. This policy has the following
advantages over the ioprio_set() system call:
- This policy is cgroup based so it has all the advantages of cgroups.
- While ioprio_set() does not affect page cache writeback I/O, this rq-qos
  controller affects page cache writeback I/O for filesystems that support
  assiociating a cgroup with writeback I/O. See also
  Documentation/admin-guide/cgroup-v2.rst.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-5-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

556910e3

10 5月, 2021 1 次提交

docs/cgroup: add entry for cgroup.kill · 340272b0

由 Christian Brauner 提交于 5月 08, 2021

Give a brief overview of the cgroup.kill functionality.

Link: https://lore.kernel.org/r/20210503143922.3093755-2-brauner@kernel.org
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

340272b0

05 4月, 2021 1 次提交

cgroup: Miscellaneous cgroup documentation. · 25259fc9

由 Vipin Sharma 提交于 3月 29, 2021

Documentation of miscellaneous cgroup controller. This new controller is
used to track and limit the usage of scalar resources.
Signed-off-by: NVipin Sharma <vipinsh@google.com>
Reviewed-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

25259fc9

26 2月, 2021 1 次提交

Documentation: cgroup-v2: fix path to example BPF program · 43c4f657

由 Antonio Terceiro 提交于 2月 24, 2021

This file has been moved into the "progs" subdirectory, together with all
test BPF programs.

Fixes: bd4aed0e ("selftests: bpf: centre kernel bpf objects under new subdir "progs"")
Signed-off-by: NAntonio Terceiro <antonio.terceiro@linaro.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jiong Wang <jiong.wang@netronome.com>
Link: https://lore.kernel.org/r/20210224131631.349287-1-antonio.terceiro@linaro.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

43c4f657

25 2月, 2021 1 次提交

mm: memcg: add swapcache stat for memcg v2 · b6038942

由 Shakeel Butt 提交于 2月 24, 2021

This patch adds swapcache stat for the cgroup v2.  The swapcache
represents the memory that is accounted against both the memory and the
swap limit of the cgroup.  The main motivation behind exposing the
swapcache stat is for enabling users to gracefully migrate from cgroup
v1's memsw counter to cgroup v2's memory and swap counters.

Cgroup v1's memsw limit allows users to limit the memory+swap usage of a
workload but without control on the exact proportion of memory and swap.
Cgroup v2 provides separate limits for memory and swap which enables more
control on the exact usage of memory and swap individually for the
workload.

With some little subtleties, the v1's memsw limit can be switched with the
sum of the v2's memory and swap limits.  However the alternative for memsw
usage is not yet available in cgroup v2.  Exposing per-cgroup swapcache
stat enables that alternative.  Adding the memory usage and swap usage and
subtracting the swapcache will approximate the memsw usage.  This will
help in the transparent migration of the workloads depending on memsw
usage and limit to v2' memory and swap counters.

The reasons these applications are still interested in this approximate
memsw usage are: (1) these applications are not really interested in two
separate memory and swap usage metrics.  A single usage metric is more
simple to use and reason about for them.

(2) The memsw usage metric hides the underlying system's swap setup from
the applications.  Applications with multiple instances running in a
datacenter with heterogeneous systems (some have swap and some don't) will
keep seeing a consistent view of their usage.

[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]

Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b6038942

22 1月, 2021 6 次提交

docs/admin-guide/cgroup-v2: fix mount opt rendering · ffcc972a

由 Kir Kolyshkin 提交于 1月 19, 2021

Due to an extra empty line between the option and its description
it is rendered not like in other places.

Remove the empty lines to fix.
Signed-off-by: NKir Kolyshkin <kolyshkin@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210120001824.385168-11-kolyshkin@gmail.comSigned-off-by: NJonathan Corbet <corbet@lwn.net>

ffcc972a

docs/admin-guide/cgroup-v2: nit · 7361ec68

由 Kir Kolyshkin 提交于 1月 19, 2021

Improper Capitalization.
Signed-off-by: NKir Kolyshkin <kolyshkin@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210120001824.385168-10-kolyshkin@gmail.comSigned-off-by: NJonathan Corbet <corbet@lwn.net>

7361ec68

doc/admin-guide/cgroup-v2: use tables · 8a32d0fe

由 Kir Kolyshkin 提交于 1月 19, 2021

These two places are rendered like a table in the source (rst) code,
but they are seen as plain text by formatters, and thus are joined
together into a single line, e.g.:

> “root” - a partition root “member” - a non-root member of a partition

This is definitely not what was intended.

To fix, use table formatting, like in other places.
Signed-off-by: NKir Kolyshkin <kolyshkin@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210120001824.385168-9-kolyshkin@gmail.comSigned-off-by: NJonathan Corbet <corbet@lwn.net>

8a32d0fe

docs/admin-guide: cgroup-v2: fix cgroup.type rendering · 0d17d017

由 Kir Kolyshkin 提交于 1月 19, 2021

Due to an extra vertical whitespace, this was not recognised
as a definition list entry, and thus was not rendered like
the rest of cgroupfs files.
Signed-off-by: NKir Kolyshkin <kolyshkin@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210120001824.385168-8-kolyshkin@gmail.comSigned-off-by: NJonathan Corbet <corbet@lwn.net>

0d17d017

docs/admin-guide: cgroup-v2: typos and spaces · a21e7bb3

由 Kir Kolyshkin 提交于 1月 19, 2021

- fix a typo (mempry -> memory) in a file name;
- add space before "(" where appropriate.
Signed-off-by: NKir Kolyshkin <kolyshkin@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210120001824.385168-7-kolyshkin@gmail.comSigned-off-by: NJonathan Corbet <corbet@lwn.net>

a21e7bb3

docs/scheduler/sched-bwc: note/link cgroup v2 · e5ba9ea6

由 Kir Kolyshkin 提交于 1月 19, 2021

Signed-off-by: NKir Kolyshkin <kolyshkin@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210120001824.385168-6-kolyshkin@gmail.comSigned-off-by: NJonathan Corbet <corbet@lwn.net>

e5ba9ea6

20 1月, 2021 1 次提交

cgroup: update PSI file description in docs · 74bdd45c

由 Odin Ugedal 提交于 1月 16, 2021

Update PSI file description in cgroup-v2 docs to reflect the current
implementation.

tj: Changed cpu.pressure from read-only to read-write as suggested by
    Johannes.
Signed-off-by: NOdin Ugedal <odin@uged.al>
Acked-by: NDan Schatzberg <dschatzberg@fb.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

74bdd45c

12 1月, 2021 1 次提交

Documentation: Fix typos found in cgroup-v2.rst · c4c6b86a

由 Jiang Biao 提交于 1月 07, 2021

Fix typos found in Documentation/admin-guide/cgroup-v2.rst.
Signed-off-by: NJiang Biao <benbjiang@tencent.com>
Link: https://lore.kernel.org/r/20210107141118.9530-1-benbjiang@tencent.comSigned-off-by: NJonathan Corbet <corbet@lwn.net>

c4c6b86a

16 12月, 2020 2 次提交

mm: memcontrol: account pagetables per node · f0c0c115

由 Shakeel Butt 提交于 12月 14, 2020

For many workloads, pagetable consumption is significant and it makes
sense to expose it in the memory.stat for the memory cgroups.  However at
the moment, the pagetables are accounted per-zone.  Converting them to
per-node and using the right interface will correctly account for the
memory cgroups as well.

[akpm@linux-foundation.org: export __mod_lruvec_page_state to modules for arch/mips/kvm/]

Link: https://lkml.kernel.org/r/20201130212541.2781790-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NRoman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f0c0c115

mm: memcontrol: add file_thp, shmem_thp to memory.stat · b8eddff8

由 Johannes Weiner 提交于 12月 14, 2020

As huge page usage in the page cache and for shmem files proliferates in
our production environment, the performance monitoring team has asked for
per-cgroup stats on those pages.

We already track and export anon_thp per cgroup.  We already track file
THP and shmem THP per node, so making them per-cgroup is only a matter of
switching from node to lruvec counters.  All callsites are in places where
the pages are charged and locked, so page->memcg is stable.

[hannes@cmpxchg.org: add documentation]
  Link: https://lkml.kernel.org/r/20201026174029.GC548555@cmpxchg.org

Link: https://lkml.kernel.org/r/20201022151844.489337-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NRik van Riel <riel@surriel.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b8eddff8

14 10月, 2020 1 次提交

mm: memcontrol: add the missing numa_stat interface for cgroup v2 · 5f9a4f4a

由 Muchun Song 提交于 10月 13, 2020

In the cgroup v1, we have a numa_stat interface.  This is useful for
providing visibility into the numa locality information within an memcg
since the pages are allowed to be allocated from any physical node.  One
of the use cases is evaluating application performance by combining this
information with the application's CPU allocation.  But the cgroup v2 does
not.  So this patch adds the missing information.
Suggested-by: NShakeel Butt <shakeelb@google.com>
Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20200916100030.71698-2-songmuchun@bytedance.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5f9a4f4a

27 9月, 2020 1 次提交

mm: memcontrol: fix missing suffix of workingset_restore · 8d3fe09d

由 Muchun Song 提交于 9月 25, 2020

We forget to add the suffix to the workingset_restore string, so fix it.

And also update the documentation of cgroup-v2.rst.

Fixes: 170b04b7 ("mm/workingset: prepare the workingset detection infrastructure for anon LRU")
Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20200916100030.71698-1-songmuchun@bytedance.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8d3fe09d

13 8月, 2020 1 次提交

mm: memcg/percpu: per-memcg percpu memory statistics · 772616b0

由 Roman Gushchin 提交于 8月 11, 2020

Percpu memory can represent a noticeable chunk of the total memory
consumption, especially on big machines with many CPUs.  Let's track
percpu memory usage for each memcg and display it in memory.stat.

A percpu allocation is usually scattered over multiple pages (and nodes),
and can be significantly smaller than a page.  So let's add a byte-sized
counter on the memcg level: MEMCG_PERCPU_B.  Byte-sized vmstat infra
created for slabs can be perfectly reused for percpu case.

[guro@fb.com: v3]
  Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NDennis Zhou <dennis@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

772616b0

18 7月, 2020 1 次提交

blk-cgroup: show global disk stats in root cgroup io.stat · ef45fe47

由 Boris Burkov 提交于 6月 01, 2020

In order to improve consistency and usability in cgroup stat accounting,
we would like to support the root cgroup's io.stat.

Since the root cgroup has processes doing io even if the system has no
explicitly created cgroups, we need to be careful to avoid overhead in
that case.  For that reason, the rstat algorithms don't handle the root
cgroup, so just turning the file on wouldn't give correct statistics.

To get around this, we simulate flushing the iostat struct by filling it
out directly from global disk stats. The result is a root cgroup io.stat
file consistent with both /proc/diskstats and io.stat.

Note that in order to collect the disk stats, we needed to iterate over
devices. To facilitate that, we had to change the linkage of a disk_type
to external so that it can be used from blk-cgroup.c to iterate over
disks.
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NBoris Burkov <boris@bur.io>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ef45fe47

06 7月, 2020 2 次提交

doc: cgroup: add f2fs and xfs to supported list for writeback · 1b932b7d

由 Eric Sandeen 提交于 6月 29, 2020

f2fs and xfs have both added support for cgroup writeback:

578c6478 f2fs: implement cgroup writeback support
adfb5fb4 xfs: implement cgroup aware writeback

so add them to the supported list in the docs.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Link: https://lore.kernel.org/r/c8271324-9132-388c-5242-d7699f011892@redhat.comSigned-off-by: NJonathan Corbet <corbet@lwn.net>

1b932b7d

Documentation/admin-guide: cgroup-v2: drop doubled word · aefea466

由 Randy Dunlap 提交于 7月 03, 2020

Drop the doubled word "of".
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Cc: cgroups@vger.kernel.org
Link: https://lore.kernel.org/r/20200704032020.21923-2-rdunlap@infradead.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

aefea466

26 6月, 2020 1 次提交

doc: THP CoW fault no longer allocate THP · 2a8bef32

由 Yang Shi 提交于 6月 25, 2020

Since commit 3917c802 ("thp: change CoW semantics for anon-THP"),
THP CoW page fault is rewritten.  Now it just splits pmd then fallback
to base page fault, it doesn't try to allocate THP anymore.  So it is no
longer counted in THP_FAULT_ALLOC.

Remove the obsolete statement in documentation about THP CoW allocation
to avoid confusion.

Link: http://lkml.kernel.org/r/1592424895-5421-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2a8bef32

09 6月, 2020 1 次提交

doc: cgroup: update note about conditions when oom killer is invoked · db33ec37

由 Konstantin Khlebnikov 提交于 6月 07, 2020

Starting from v4.19 commit 29ef680a ("memcg, oom: move out_of_memory
back to the charge path") cgroup oom killer is no longer invoked only
from page faults.  Now it implements the same semantics as global OOM
killer: allocation context invokes OOM killer and keeps retrying until
success.

[akpm@linux-foundation.org: fixes per Randy]
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: http://lkml.kernel.org/r/158894738928.208854.5244393925922074518.stgit@buzzSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

db33ec37

03 6月, 2020 2 次提交

mm/memcg: automatically penalize tasks with high swap use · 4b82ab4f

由 Jakub Kicinski 提交于 6月 01, 2020

Add a memory.swap.high knob, which can be used to protect the system
from SWAP exhaustion. The mechanism used for penalizing is similar to
memory.high penalty (sleep on return to user space).

That is not to say that the knob itself is equivalent to memory.high.
The objective is more to protect the system from potentially buggy tasks
consuming a lot of swap and impacting other tasks, or even bringing the
whole system to stand still with complete SWAP exhaustion. Hopefully
without the need to find per-task hard limits.

Slowing misbehaving tasks down gradually allows user space oom killers
or other protection mechanisms to react. oomd and earlyoom already do
killing based on swap exhaustion, and memory.swap.high protection will
help implement such userspace oom policies more reliably.

We can use one counter for number of pages allocated under pressure to
save struct task space and avoid two separate hierarchy walks on the hot
path. The exact overage is calculated on return to user space, anyway.

Take the new high limit into account when determining if swap is "full".
Borrowing the explanation from Johannes:

The idea behind "swap full" is that as long as the workload has plenty
of swap space available and it's not changing its memory contents, it
makes sense to generously hold on to copies of data in the swap device,
even after the swapin. A later reclaim cycle can drop the page without
any IO. Trading disk space for IO.

But the only two ways to reclaim a swap slot is when they're faulted
in and the references go away, or by scanning the virtual address space
like swapoff does - which is very expensive (one could argue it's too
expensive even for swapoff, it's often more practical to just reboot).

So at some point in the fill level, we have to start freeing up swap
slots on fault/swapin. Otherwise we could eventually run out of swap
slots while they're filled with copies of data that is also in RAM.

We don't want to OOM a workload because its available swap space is
filled with redundant cache.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Chris Down <chris@chrisdown.name>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200527195846.102707-5-kuba@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4b82ab4f

mm, memcg: add workingset_restore in memory.stat · a6f5576b

由 Yafang Shao 提交于 6月 01, 2020

There's a new workingset counter introduced in commit 1899ad18 ("mm:
workingset: tell cache transitions from workingset thrashing").  With
the help of this counter we can know the workingset is transitioning or
thrashing.  To leverage the benifit of this counter to memcg, we should
introduce it into memory.stat.  Then we could know the workingset of the
workload inside a memcg better.

Bellow is the verification of this new counter in memory.stat.  Read a
file into the memory and then read it again to make these pages be
active.  The size of this file is 1G.  (memory.max is greater than file
size) The counters in memory.stat will be

	inactive_file 0
	active_file 1073639424

	workingset_refault 0
	workingset_activate 0
	workingset_restore 0
	workingset_nodereclaim 0

Trigger the memcg reclaim by setting a lower value to memory.high, and
then some pages will be demoted into inactive list, and then some pages
in the inactive list will be evicted into the storage.

	inactive_file 498094080
	active_file 310063104

	workingset_refault 0
	workingset_activate 0
	workingset_restore 0
	workingset_nodereclaim 0

Then recover the memory.high and read the file into memory again.  As a
result of it, the transitioning will occur.  Bellow is the result of
this transitioning,

	inactive_file 498094080
	active_file 575397888

	workingset_refault 64746
	workingset_activate 64746
	workingset_restore 64746
	workingset_nodereclaim 0
Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NChris Down <chris@chrisdown.name>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Link: http://lkml.kernel.org/r/20200504153522.11553-1-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a6f5576b

28 5月, 2020 1 次提交

cgroup: add cpu.stat file to root cgroup · 936f2a70

由 Boris Burkov 提交于 5月 27, 2020

Currently, the root cgroup does not have a cpu.stat file. Add one which
is consistent with /proc/stat to capture global cpu statistics that
might not fall under cgroup accounting.

We haven't done this in the past because the data are already presented
in /proc/stat and we didn't want to add overhead from collecting root
cgroup stats when cgroups are configured, but no cgroups have been
created.

By keeping the data consistent with /proc/stat, I think we avoid the
first problem, while improving the usability of cgroups stats.
We avoid the second problem by computing the contents of cpu.stat from
existing data collected for /proc/stat anyway.
Signed-off-by: NBoris Burkov <boris@bur.io>
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

936f2a70

03 4月, 2020 1 次提交

mm: memcontrol: recursive memory.low protection · 8a931f80

由 Johannes Weiner 提交于 4月 01, 2020

Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says. The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.

Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.

Consider memory in a data center host. At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing. Both branches are further subdivided into individual
services, job components etc.

We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general. This
results in very inefficient resource distribution.

Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level. Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads. It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g. rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.

It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.

To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.

That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now. However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size. Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree. But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.

E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.

This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.

The floating protection composes naturally with fixed protection.
Consider the following example tree:

A A: low = 2G
/ \ A1: low = 1G
A1 A2 A2: low = 0G

As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.

There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree. Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior. Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NTejun Heo <tj@kernel.org>
Acked-by: NRoman Gushchin <guro@fb.com>
Acked-by: NChris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8a931f80

03 3月, 2020 5 次提交

doc: cgroup: improve formatting of references · 373e8ffa

由 Jakub Kicinski 提交于 2月 27, 2020

Annotate references to other documents to make them clickable.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20200228000653.1572553-6-kuba@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

373e8ffa

doc: cgroup: improve formatting of cpuset examples · f3431ba7

由 Jakub Kicinski 提交于 2月 27, 2020

We need literal sections otherwise the entire example is rendered
as a single line.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20200228000653.1572553-5-kuba@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

f3431ba7

doc: cgroup: improve formatting of io example · 69654d37

由 Jakub Kicinski 提交于 2月 27, 2020

We need a literal section, like few paragraphs below.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20200228000653.1572553-4-kuba@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

69654d37

doc: cgroup: improve formatting of mem stats · 2551cab5

由 Jakub Kicinski 提交于 2月 27, 2020

If there is an empty line between item and description
Sphinx does not emphasize the item. First half of the
list does not have the empty line and is emphasized
correctly.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20200228000653.1572553-3-kuba@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

2551cab5

doc: cgroup: improve formatting · d0c3bacb

由 Jakub Kicinski 提交于 2月 27, 2020

Fix tabs vs spaces issue which cases the line to be considered
a new list entry.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20200228000653.1572553-2-kuba@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

d0c3bacb

17 12月, 2019 1 次提交

mm: hugetlb controller for cgroups v2 · faced7e0

由 Giuseppe Scrivano 提交于 12月 16, 2019

In the effort of supporting cgroups v2 into Kubernetes, I stumped on
the lack of the hugetlb controller.

When the controller is enabled, it exposes four new files for each
hugetlb size on non-root cgroups:

- hugetlb.<hugepagesize>.current
- hugetlb.<hugepagesize>.max
- hugetlb.<hugepagesize>.events
- hugetlb.<hugepagesize>.events.local

The differences with the legacy hierarchy are in the file names and
using the value "max" instead of "-1" to disable a limit.

The file .limit_in_bytes is renamed to .max.

The file .usage_in_bytes is renamed to .current.

.failcnt is not provided as a single file anymore, but its value can
be read through the new flat-keyed files .events and .events.local,
through the "max" key.
Signed-off-by: NGiuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

faced7e0

01 12月, 2019 1 次提交

Documentation/admin-guide/cgroup-v2.rst: document why inactive_X + active_X may not equal X · 1603c8d1

由 Chris Down 提交于 11月 30, 2019

This has confused a significant number of people using cgroups inside
Facebook, and some of those outside as well judging by posts like
this[0] (although it's not a problem unique to cgroup v2).

If shmem handling in particular becomes more coherent at some point in
the future -- although that seems unlikely now -- we can change the
wording here.

[0]: https://unix.stackexchange.com/q/525092/10762

Link: http://lkml.kernel.org/r/20191111144958.GA11914@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1603c8d1

18 11月, 2019 1 次提交

docs: cgroup: mm: Fix spelling of "list" · 03189e8e

由 Chris Down 提交于 11月 11, 2019

Signed-off-by: NChris Down <chris@chrisdown.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: kernel-team@fb.com
Signed-off-by: NTejun Heo <tj@kernel.org>

03189e8e

08 10月, 2019 1 次提交

mm, memcg: proportional memory.{low,min} reclaim · 9783aa99

由 Chris Down 提交于 10月 06, 2019

cgroup v2 introduces two memory protection thresholds: memory.low
(best-effort) and memory.min (hard protection).  While they generally do
what they say on the tin, there is a limitation in their implementation
that makes them difficult to use effectively: that cliff behaviour often
manifests when they become eligible for reclaim.  This patch implements
more intuitive and usable behaviour, where we gradually mount more
reclaim pressure as cgroups further and further exceed their protection
thresholds.

This cliff edge behaviour happens because we only choose whether or not
to reclaim based on whether the memcg is within its protection limits
(see the use of mem_cgroup_protected in shrink_node), but we don't vary
our reclaim behaviour based on this information.  Imagine the following
timeline, with the numbers the lruvec size in this zone:

1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
   scanned. (?!)

* Of course, we won't usually scan all available pages in the zone even
  without this patch because of scan control priority, over-reclaim
  protection, etc.  However, as shown by the tests at the end, these
  techniques don't sufficiently throttle such an extreme change in input,
  so cliff-like behaviour isn't really averted by their existence alone.

Here's an example of how this plays out in practice.  At Facebook, we are
trying to protect various workloads from "system" software, like
configuration management tools, metric collectors, etc (see this[0] case
study).  In order to find a suitable memory.low value, we start by
determining the expected memory range within which the workload will be
comfortable operating.  This isn't an exact science -- memory usage deemed
"comfortable" will vary over time due to user behaviour, differences in
composition of work, etc, etc.  As such we need to ballpark memory.low,
but doing this is currently problematic:

1. If we end up setting it too low for the workload, it won't have
   *any* effect (see discussion above).  The group will receive the full
   weight of reclaim and won't have any priority while competing with the
   less important system software, as if we had no memory.low configured
   at all.

2. Because of this behaviour, we end up erring on the side of setting
   it too high, such that the comfort range is reliably covered.  However,
   protected memory is completely unavailable to the rest of the system,
   so we might cause undue memory and IO pressure there when we *know* we
   have some elasticity in the workload.

3. Even if we get the value totally right, smack in the middle of the
   comfort zone, we get extreme jumps between no pressure and full
   pressure that cause unpredictable pressure spikes in the workload due
   to the current binary reclaim behaviour.

With this patch, we can set it to our ballpark estimation without too much
worry.  Any undesirable behaviour, such as too much or too little reclaim
pressure on the workload or system will be proportional to how far our
estimation is off.  This means we can set memory.low much more
conservatively and thus waste less resources *without* the risk of the
workload falling off a cliff if we overshoot.

As a more abstract technical description, this unintuitive behaviour
results in having to give high-priority workloads a large protection
buffer on top of their expected usage to function reliably, as otherwise
we have abrupt periods of dramatically increased memory pressure which
hamper performance.  Having to set these thresholds so high wastes
resources and generally works against the principle of work conservation.
In addition, having proportional memory reclaim behaviour has other
benefits.  Most notably, before this patch it's basically mandatory to set
memory.low to a higher than desirable value because otherwise as soon as
you exceed memory.low, all protection is lost, and all pages are eligible
to scan again.  By contrast, having a gradual ramp in reclaim pressure
means that you now still get some protection when thresholds are exceeded,
which means that one can now be more comfortable setting memory.low to
lower values without worrying that all protection will be lost.  This is
important because workingset size is really hard to know exactly,
especially with variable workloads, so at least getting *some* protection
if your workingset size grows larger than you expect increases user
confidence in setting memory.low without a huge buffer on top being
needed.

Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
assistance in thinking about how to make this work better.

In testing these changes, I intended to verify that:

1. Changes in page scanning become gradual and proportional instead of
   binary.

   To test this, I experimented stepping further and further down
   memory.low protection on a workload that floats around 19G workingset
   when under memory.low protection, watching page scan rates for the
   workload cgroup:

   +------------+-----------------+--------------------+--------------+
   | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
   +------------+-----------------+--------------------+--------------+
   |        21G |               0 |                  0 | N/A          |
   |        17G |             867 |               3799 | 23%          |
   |        12G |            1203 |               3543 | 34%          |
   |         8G |            2534 |               3979 | 64%          |
   |         4G |            3980 |               4147 | 96%          |
   |          0 |            3799 |               3980 | 95%          |
   +------------+-----------------+--------------------+--------------+

   As you can see, the test kernel (with a kernel containing this
   patch) ramps up page scanning significantly more gradually than the
   control kernel (without this patch).

2. More gradual ramp up in reclaim aggression doesn't result in
   premature OOMs.

   To test this, I wrote a script that slowly increments the number of
   pages held by stress(1)'s --vm-keep mode until a production system
   entered severe overall memory contention.  This script runs in a highly
   protected slice taking up the majority of available system memory.
   Watching vmstat revealed that page scanning continued essentially
   nominally between test and control, without causing forward reclaim
   progress to become arrested.

[0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project

[akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
[chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
  Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9783aa99

01 10月, 2019 1 次提交

docs: fix memory.low description in cgroup-v2.rst · 6ee0fac1

由 Jon Haslam 提交于 9月 25, 2019

The current cgroup-v2.rst file contains an incorrect description of when
memory is reclaimed from a cgroup that is using the 'memory.low'
mechanism. This fix simply corrects the text to reflect the actual
implementation.

Fixes: 7854207f ("mm/docs: describe memory.low refinements")
Signed-off-by: NJon Haslam <jonhaslam@fb.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

6ee0fac1

03 9月, 2019 1 次提交

sched/uclamp: Extend CPU's cgroup controller · 2480c093

由 Patrick Bellasi 提交于 8月 22, 2019

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
	      i.e. the RUNNABLE tasks of this group will run at least at a
	      minimum frequency which corresponds to the uclamp.min
	      utilization

- uclamp.max: defines the maximum utilization which should be considered
	      i.e. the RUNNABLE tasks of this group will run up to a
	      maximum frequency which corresponds to the uclamp.max
	      utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies, while system wide clamps are defined by a generic
   interface which does not depends on cgroups. This system wide
   interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
   are a restriction of the group requests considering its parent's
   effective constraints. Root group effective constraints are defined
   by the system wide interface.
   This mechanism allows each (non-root) level of the hierarchy to:
   - request whatever clamp values it would like to get
   - effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
   sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.
Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NMichal Koutny <mkoutny@suse.com>
Acked-by: NTejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

2480c093

29 8月, 2019 1 次提交

blkcg: add tools/cgroup/iocost_coef_gen.py · 8504dea7

由 Tejun Heo 提交于 8月 28, 2019

Add a script which can be used to generate device-specific iocost
linear model coefficients.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8504dea7

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功