提交 · 055ed63be9b6cf1119cc475fd78d18a263117692 · openanolis / cloud-kernel

17 4月, 2020 1 次提交

alinux: mm, memcg: rework memsli interfaces · 055ed63b

由 Xu Yu 提交于 1月 13, 2020

to #26424368

This reworks memsli "start", "end", "update" interfaces to make it more
clear and symmetrical, by merging "update" action into "end", just like
what psi_memstall_{enter, leave} does.

Now the latency probe pattern of memsli is as follows:

memcg_lat_stat_start(&start);
/* kernel codes being probed */
memcg_lat_stat_end(MEM_LAT_XXX, start);

This also formats the codes and fixes the warning(s) produced when
CONFIG_MEMSLI is not set.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

055ed63b

16 4月, 2020 10 次提交

alinux: mm, memcg: add kconfig MEMSLI · 9e58d704

由 Xu Yu 提交于 1月 08, 2020

to #26424368

This introduces the new bool kconfig MEMSLI, determining whether the
memsli (memory latency histogram) feature should be built-in or not.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

9e58d704

alinux: mm, memcg: add memsli procfs switch interface · 892970b7

由 Xu Yu 提交于 12月 25, 2019

to #26424368

Since memsli also records latency histogram for swapout and swapin,
which are NOT in the slow memory path, the overhead of memsli could
be nonnegligible in some specific scenarios.

For example, in scenarios with frequent swapping out and in, memsli
could introduce overhead of ~1% of total run time of the synthetic
testcase.

This adds procfs interface for memsli switch. The memsli feature is
enabled by default, and you can now disable it by:

$ echo 0 > /proc/memsli/enabled

Apparently, you can check current memsli switch status by:

$ cat /proc/memsli/enabled

Note that disabling memsli at runtime will NOT clear the existing
latency histogram. You still need to manually reset the specified
latency histogram(s) by echo 0 into the corresponding cgroup control
file(s).
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

892970b7

alinux: mm, memcg: gather memsli/exstat from all possible cpus · 77663a9d

由 Xu Yu 提交于 12月 30, 2019

to #26424368

CPU hotplug may occur in some business scenarios, resulting in
unavailable per-cpu memsli/exstat data on those non-present or
offline CPU(s).

This fixes the potential problem by using for_each_possible_cpu
macro when gathering per-cpu memsli/exstat data.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

77663a9d

alinux: mm, memcg: account throttle over memory.high for memcg direct reclaim · 6c4e1cc3

由 Xu Yu 提交于 11月 12, 2019

to #26424368

Commit 6202ab24 ("mm, memcg: throttle
allocators when failing reclaim over memory.high") introduces explicit
throttling when reclaim is failing to keep memcg size contained at the
memory.high setting.

Just account this latency on memcg direct reclaim latency histogram.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

6c4e1cc3

alinux: mm, memcg: record latency of swapout and swapin in every memcg · ddfd4d5e

由 Xu Yu 提交于 11月 04, 2019

to #26424368

Probe and calculate the latency of global swapout, memcg swapout and
swapin respectively, and then group into the latency histogram in struct
mem_cgroup.

Note that the latency in each memcg is aggregated from all child memcgs.

Usage:

$ cat memory.direct_swapout_global_latency
0-1ms:  98313
1-5ms:  0
5-10ms:         0
10-100ms:       0
100-500ms:      0
500-1000ms:     0
>=1000ms:       0
total(ms):      52

Each line is the count of global swapout within the appropriate latency
range.

To clear the latency histogram:

$ echo 0 > memory.direct_swapout_global_latency
$ cat memory.direct_swapout_global_latency
0-1ms:  0
1-5ms:  0
5-10ms:         0
10-100ms:       0
100-500ms:      0
500-1000ms:     0
>=1000ms:       0
total(ms):      0

The usage of memory.direct_swapout_memcg_latency and
memory.direct_swapin_latency is the same as
memory.direct_swapout_global_latency.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

ddfd4d5e

alinux: mm, memcg: account reclaim_high for memcg direct reclaim · 39071079

由 Xu Yu 提交于 11月 12, 2019

to #26424368

We account reclaim_high in mem_cgroup_handle_over_high into memcg direct
reclaim latency histogram, due to possible future use of memory.high.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

39071079

alinux: mm, memcg: adjust the latency probe point for memcg direct reclaim · fe673ccf

由 Xu Yu 提交于 11月 04, 2019

to #26424368

Since there are features other than memcg direct reclaim which also
invoke try_to_free_mem_cgroup_pages, such as zombie memcg reaper, memcg
kswapd, etc,.  Move the latency probe point for memcg direct reclaim
from function try_to_free_mem_cgroup_pages to function try_charge, in
order to distinguish memcg direct reclaim.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

fe673ccf

alinux: mm, memcg: rework memory latency histogram interfaces · 837e53ab

由 Xu Yu 提交于 11月 04, 2019

to #26424368

There are some duplicate codes in the original implementation of memory
latency histogram, such as {x, y, z}_show, and {x, y, z}_write, where x,
y, z represents various types of memory latency.

This reworks common codes of memory latency histogram to make it easier
to add more types of memory latency later.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

837e53ab

alinux: mm, memcg: record latency of direct compact in every memcg · 4bec5cfe

由 Xu Yu 提交于 10月 31, 2019

to #26424368

Probe and calculate the latency of direct compact, and then group into
the latency histogram in struct mem_cgroup.

Note that the latency in each memcg is aggregated from all child memcgs.

Usage:

$ cat memory.direct_compact_latency
0-1ms:  1176
1-5ms:  259
5-10ms:         17
10-100ms:       10
100-500ms:      0
500-1000ms:     0
>=1000ms:       0
total(ms):      921

Each line is the count of direct compact within the appropriate latency
range.

To clear the latency histogram:

$ echo 0 > memory.direct_compact_latency
$ cat memory.direct_compact_latency
0-1ms:  0
1-5ms:  0
5-10ms:         0
10-100ms:       0
100-500ms:      0
500-1000ms:     0
>=1000ms:       0
total(ms):      0
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

4bec5cfe

alinux: mm, memcg: record latency of direct reclaim in every memcg · 83058e75

由 Xu Yu 提交于 10月 17, 2019

to #26424368

Probe and calculate the latency of global direct reclaim and memcg
direct reclaim, respectively, and then group into the latency histogram
in struct mem_cgroup. Besides, the total latency is accumulated each
time the histogram is updated.

Note that the latency in each memcg is aggregated from all child memcgs.

Usage:

$ cat memory.direct_reclaim_global_latency
0-1ms:  228
1-5ms:  283
5-10ms:         0
10-100ms:       0
100-500ms:      0
500-1000ms:     0
>=1000ms:       0
total(ms):      539

Each line is the count of global direct reclaim within the appropriate
latency range.

To clear the latency histogram:

$ echo 0 > memory.direct_reclaim_global_latency
$ cat memory.direct_reclaim_global_latency
0-1ms:  0
1-5ms:  0
5-10ms:         0
10-100ms:       0
100-500ms:      0
500-1000ms:     0
>=1000ms:       0
total(ms):      0

The usage of memory.direct_reclaim_memcg_latency is the same as
memory.direct_reclaim_global_latency.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

83058e75

13 4月, 2020 7 次提交

mm/page_reporting: add budget limit on how many pages can be reported per pass · d2932bdb

由 Alexander Duyck 提交于 2月 15, 2020

to #26589565

In order to keep ourselves from reporting pages that are just going to be
reused again in the case of heavy churn we can put a limit on how many
total pages we will process per pass.  Doing this will allow the worker
thread to go into idle much more quickly so that we avoid competing with
other threads that might be allocating or freeing pages.

The logic added here will limit the worker thread to no more than one
sixteenth of the total free pages in a given area per list.  Once that
limit is reached it will update the state so that at the end of the pass
we will reschedule the worker to try again in 2 seconds when the memory
churn has hopefully settled down.

Again this optimization doesn't show much of a benefit in the standard
case as the memory churn is minmal.  However with page allocator shuffling
enabled the gain is quite noticeable.  Below are the results with a THP
enabled version of the will-it-scale page_fault1 test showing the
improvement in iterations for 16 processes or threads.

Without:
tasks   processes       processes_idle  threads         threads_idle
16      8283274.75      0.17            5594261.00      38.15

With:
tasks   processes       processes_idle  threads         threads_idle
16      8767010.50      0.21            5791312.75      36.98

Link: http://lkml.kernel.org/r/20200211224719.29318.72113.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>

d2932bdb

mm/page_reporting: rotate reported pages to the tail of the list · 05c415bc

由 Alexander Duyck 提交于 2月 15, 2020

to #26589565

Rather than walking over the same pages again and again to get to the
pages that have yet to be reported we can save ourselves a significant
amount of time by simply rotating the list so that when we have a full
list of reported pages the head of the list is pointing to the next
non-reported page.  Doing this should save us some significant time when
processing each free list.

This doesn't gain us much in the standard case as all of the non-reported
pages should be near the top of the list already.  However in the case of
page shuffling this results in a noticeable improvement.  Below are the
will-it-scale page_fault1 w/ THP numbers for 16 tasks with and without
this patch.

Without:
tasks   processes       processes_idle  threads         threads_idle
16      8093776.25      0.17            5393242.00      38.20

With:
tasks   processes       processes_idle  threads         threads_idle
16      8283274.75      0.17            5594261.00      38.15

Link: http://lkml.kernel.org/r/20200211224708.29318.16862.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>

05c415bc

mm: introduce Reported pages · 2b197730

由 Alexander Duyck 提交于 2月 22, 2020

to #26589565

In order to pave the way for free page reporting in virtualized
environments we will need a way to get pages out of the free lists and
identify those pages after they have been returned.  To accomplish this,
this patch adds the concept of a Reported Buddy, which is essentially
meant to just be the Uptodate flag used in conjunction with the Buddy page
type.

To prevent the reported pages from leaking outside of the buddy lists I
added a check to clear the PageReported bit in the del_page_from_free_list
function.  As a result any reported page that is split, merged, or
allocated will have the flag cleared prior to the PageBuddy value being
cleared.

The process for reporting pages is fairly simple.  Once we free a page
that meets the minimum order for page reporting we will schedule a
worker thread to start 2s or more in the future.  That worker thread will begin
working from the lowest supported page reporting order up to MAX_ORDER - 1
pulling unreported pages from the free list and storing them in the
scatterlist.

When processing each individual free list it is necessary for the worker
thread to release the zone lock when it needs to stop and report the full
scatterlist of pages.  To reduce the work of the next iteration the worker
thread will rotate the free list so that the first unreported page in
the free list becomes the first entry in the list.

It will then call a reporting function providing information on how many
entries are in the scatterlist.  Once the function completes it will
return the pages to the free area from which they were allocated and start
over pulling more pages from the free areas until there are no longer
enough pages to report on to keep the worker busy, or we have processed as
many pages as were contained in the free area when we started processing
the list.

The worker thread will work in a round-robin fashion making its way though
each zone requesting reporting, and through each reportable free list
within that zone.  Once all free areas within the zone have been processed
it will check to see if there have been any requests for reporting while
it was processing.  If so it will reschedule the worker thread to start up
again in roughly 2s and exit.

Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>

2b197730

mm: add function __putback_isolated_page · 06df03ed

由 Alexander Duyck 提交于 2月 22, 2020

to #26589565

commit 624f58d8f4639676d2fa1238425ab0148d501c4a upstream linux-next.

There are cases where we would benefit from avoiding having to go
through
the allocation and free cycle to return an isolated page.

Examples for this might include page poisoning in which we isolate a page
and then put it back in the free list without ever having actually
allocated it.

This will enable us to also avoid notifiers for the future free page
reporting which will need to avoid retriggering page reporting when
returning pages that have been reported on.

Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>

06df03ed

mm: use zone and order instead of free area in free_list manipulators · 999c6c60

由 Alexander Duyck 提交于 2月 21, 2020

to #26589565

In order to enable the use of the zone from the list manipulator
functions
I will need access to the zone pointer.  As it turns out most of the
accessors were always just being directly passed &zone->free_area[order]
anyway so it would make sense to just fold that into the function itself
and pass the zone and order as arguments instead of the free area.

In order to be able to reference the zone we need to move the declaration
of the functions down so that we have the zone defined before we define
the list manipulation functions.  Since the functions are only used in the
file mm/page_alloc.c we can just move them there to reduce noise in the
header.

Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: NMel Gorman <mgorman@techsingularity.net>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NPankaj Gupta <pagupta@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>

999c6c60

mm: move buddy list manipulations into helpers · 54c14b30

由 Dan Williams 提交于 2月 21, 2020

to #26589565

commit b03641af680959df57c275a80ff0dc116627c7ae upstream

In preparation for runtime randomization of the zone lists, take all
(well, most of) the list_*() functions in the buddy allocator and put
them in helper functions.  Provide a common control point for injecting
additional behavior when freeing pages.

[dan.j.williams@intel.com: fix buddy list helpers]
  Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
[vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
  Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
Tested-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Robert Elliott <elliott@hpe.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>

54c14b30

mm/page_poison: expose page_poisoning_enabled to kernel modules · 1b7f4f07

由 Wei Wang 提交于 8月 27, 2018

to #26589565

commit d95f58f4a6ca002126c36c530fb096519c94baef upstream

In some usages, e.g. virtio-balloon, a kernel module needs to know if
page poisoning is in use. This patch exposes the page_poisoning_enabled
function to kernel modules.
Signed-off-by: NWei Wang <wei.w.wang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>

1b7f4f07

09 4月, 2020 1 次提交

mm/page_alloc.c: fix regression with deferred struct page init · 8b98dc27

由 Juergen Gross 提交于 7月 04, 2019

fix #26417771

commit b9705d8778e7adc97de38f405f835a2426e14d84 upstream.

Commit 0e56acae4b4d ("mm: initialize MAX_ORDER_NR_PAGES at a time
instead of doing larger sections") is causing a regression on some
systems when the kernel is booted as Xen dom0.

The system will just hang in early boot.

Reason is an endless loop in get_page_from_freelist() in case the first
zone looked at has no free memory.  deferred_grow_zone() is always
returning true due to the following code snipplet:

  /* If the zone is empty somebody else may have cleared out the zone */
  if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
                                           first_deferred_pfn)) {
          pgdat->first_deferred_pfn = ULONG_MAX;
          pgdat_resize_unlock(pgdat, &flags);
          return true;
  }

This in turn results in the loop as get_page_from_freelist() is assuming
forward progress can be made by doing some more struct page
initialization.

Link: http://lkml.kernel.org/r/20190620160821.4210-1-jgross@suse.com
Fixes: 0e56acae4b4d ("mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections")
Signed-off-by: NJuergen Gross <jgross@suse.com>
Suggested-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8b98dc27

25 3月, 2020 1 次提交

alinux: mm, memcg: optimize division operation with memcg counters · b37fe7db

由 Xu Yu 提交于 3月 18, 2020

fix #25926771

Specifically, replace `val / 1000000` with `val >> 20` to do the
optimization.

This also fixes the possible compiling error when building with
ARCH=i386, which reports undefined reference to `__udivdi3`.

Fixes: 40969475 ("alinux: mm, memcg: record latency of memcg wmark reclaim")
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

b37fe7db

18 3月, 2020 20 次提交

mm: fix tick timer stall during deferred page init · bdfadace

由 Shile Zhang 提交于 3月 10, 2020

commit 07447453db3aebb6a0917592f411a7122d12a8b9 upstream linux-next.

When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdatinit' kthread will
initialise the deferred pages with local interrupts disabled. It is
introduced by commit 3a2d7fa8 ("mm: disable interrupts while
initializing deferred pages").

On machine with NCPUS <= 2, the 'pgdatinit' kthread could be bound to
the boot CPU, which could caused the tick timer long time stall, system
jiffies not be updated in time.

The dmesg shown that:

    [    0.197975] node 0 initialised, 32170688 pages in 1ms

Obviously, 1ms is unreasonable.

Now, fix it by restore in the pending interrupts for every 32*1204 pages
(128MB) initialized, give the chance to update the systemd jiffies.
The reasonable demsg shown likes:

    [    1.069306] node 0 initialised, 32203456 pages in 894ms

Link: http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com
Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Co-developed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

bdfadace

alinux: mm, memcg: export workingset counters on memcg v1 · 27393b9b

由 Xu Yu 提交于 2月 22, 2020

This exports the workingset counters, i.e., workingset_refault,
workingset_activate, workingset_restore, and workingset_nodereclaim, to
memory cgroup v1.

The stat collection of these counters is shared between memory cgroup v1
and v2.  What this patch does is just to export them on memory cgroup v1.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

27393b9b

alinux: mm, memcg: abort priority oom if with oom victim · ec661706

由 Xu Yu 提交于 3月 13, 2020

Explicitly abort mem_cgroup_select_bad_process in priority oom if there
is already a task as oom victim without MMF_OOM_SKIP flag set.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

ec661706

alinux: mm, memcg: account number of processes in the css · 2061acd6

由 Xu Yu 提交于 3月 13, 2020

Since commit e0205ae40f12 ("mm: memcontrol: use CSS_TASK_ITER_PROCS at
mem_cgroup_scan_tasks()") made mem_cgroup_scan_tasks() to check only one
thread from each thread group, we can make cgroup_subsys_state::nr_tasks
to record only the thread group leader, i.e., process, instead of
thread(s). Furthermore, this renames cgroup_subsys_state::nr_tasks to
cgroup_subsys_state::nr_procs.

Fixes: f061cd88 ("alinux: kernel: cgroup: account number of tasks in
the css and its descendants")
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

2061acd6

mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks() · 7bf04cbb

由 Tetsuo Handa 提交于 7月 11, 2019

commit f168a9a54ec39b3f832c353733898b713b6b5c1f upstream.

Since commit c03cd7738a83 ("cgroup: Include dying leaders with live
threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
only one thread from each thread group.

[penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()]
  Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

7bf04cbb

alinux: mm, memcg: fix soft lockup in priority oom · fb4c5ea6

由 Xu Yu 提交于 3月 10, 2020

Assuming that there is a memory cgroup tree as follows:

        A (use_priority_oom=1, limit=2.5G)
       / \
      /   C (priority=3, usage=1.5G)
     B (priority=0, usage=1G)

As task in C (task-c) invokes oom-killer, task in B (task-b) is chosen
and killed, and then task-c returns from mem_cgroup_oom and retries in
try_charge.

If memory page_counter of B has not been reset yet, leading to task-c
invokes oom-killer again, the soft lockup may happen. In this situation,
task-c keeps selecting bad process in B, while the only task-b in B has
already been set PF_EXITING flag, which makes task-b skipped in
css_task_iter_advance.

Finally, task-c selected no bad process in B and keeps retrying, and
task-b is stalled in synchronize_rcu when do_exit, exit_task_namespaces
specifically.

In a nutshell, the new behavior of css_task_iter_advance, i.e., commit
c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS
iterations"), causes priority oom to misbehave.

This fixes the soft lockup by accounting num_oom_skip of the victim
memcg and its parents (sift up to oc->memcg), if no bad process is
chosen from it.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

fb4c5ea6

alinux: mm, memcg: record latency of memcg wmark reclaim · 40969475

由 Xu Yu 提交于 12月 28, 2019

The memcg background async page reclaim, a.k.a, memcg kswapd, is
implemented with a dedicated unbound workqueue currently.

However, memcg kswapd will run too frequently, resulting in high
overhead, page cache thrashing, frequent dirty page writeback, etc., due
to improper memcg memory.wmark_ratio, unreasonable memcg memor capacity,
or even abnormal memcg memory usage.

We need to find out the problematic memcg(s) where memcg kswapd
introduces significant overhead.

This records the latency of each run of memcg kswapd work, and then
aggregates into the exstat of per memcg.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

40969475

mm: fix trying to reclaim unevictable lru page when calling madvise_pageout · 88ec97ec

由 zhong jiang 提交于 11月 15, 2019

commit 82072962973008201b817fae1896512977dd5083 upstream

Recently, I hit the following issue when running upstream.

  kernel BUG at mm/vmscan.c:1521!
  invalid opcode: 0000 [#1] SMP KASAN PTI
  CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
  RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
  Call Trace:
   reclaim_pages+0x499/0x800 mm/vmscan.c:2188
   madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
   walk_pmd_range mm/pagewalk.c:53 [inline]
   walk_pud_range mm/pagewalk.c:112 [inline]
   walk_p4d_range mm/pagewalk.c:139 [inline]
   walk_pgd_range mm/pagewalk.c:166 [inline]
   __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
   walk_page_range+0x179/0x310 mm/pagewalk.c:349
   madvise_pageout_page_range mm/madvise.c:506 [inline]
   madvise_pageout+0x1f0/0x330 mm/madvise.c:542
   madvise_vma mm/madvise.c:931 [inline]
   __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
   do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

madvise_pageout() accesses the specified range of the vma and isolates
them, then runs shrink_page_list() to reclaim its memory.  But it also
isolates the unevictable pages to reclaim.  Hence, we can catch the
cases in shrink_page_list().

The root cause is that we scan the page tables instead of specific LRU
list.  and so we need to filter out the unevictable lru pages from our
end.

Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com
Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMinchan Kim <minchan@kernel.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

88ec97ec

mm: factor out common parts between MADV_COLD and MADV_PAGEOUT · b6a18a3c

由 Minchan Kim 提交于 9月 25, 2019

commit d616d5126503967bf365db0711ee3c78b356efe9 upstream

There are many common parts between MADV_COLD and MADV_PAGEOUT.
This patch factor them out to save code duplication.

Link: http://lkml.kernel.org/r/20190726023435.214162-6-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: kbuild test robot <lkp@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

b6a18a3c

mm: introduce MADV_PAGEOUT · 23757dcc

由 Minchan Kim 提交于 9月 25, 2019

commit 1a4e58cce84ee88129d5d49c064bd2852b481357 upstream

When a process expects no accesses to a certain memory range for a long
time, it could hint kernel that the pages can be reclaimed instantly but
data should be preserved for future use.  This could reduce workingset
eviction so it ends up increasing performance.

This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
MADV_PAGEOUT can be used by a process to mark a memory range as not
expected to be used for a long time so that kernel reclaims *any LRU*
pages instantly.  The hint can help kernel in deciding which pages to
evict proactively.

A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
intentionally because it's automatically bounded by PMD size.  If PMD
size(e.g., 256) makes some trouble, we could fix it later by limit it to
SWAP_CLUSTER_MAX[1].

- man-page material

MADV_PAGEOUT (since Linux x.x)

Do not expect access in the near future so pages in the specified
regions could be reclaimed instantly regardless of memory pressure.
Thus, access in the range after successful operation could cause
major page fault but never lose the up-to-date contents unlike
MADV_DONTNEED. Pages belonging to a shared mapping are only processed
if a write access is allowed for the calling process.

MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
VM_PFNMAP pages.

[1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/

[minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
  Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
Reported-by: Nkbuild test robot <lkp@intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

23757dcc

mm: introduce MADV_COLD · 1af766e8

由 Minchan Kim 提交于 9月 25, 2019

commit 9c276cc65a58faf98be8e56962745ec99ab87636 upstream

Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.

- Background

The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start.  While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.

To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon).  They are likely to be killed by
lmkd if the system has to reclaim memory.  In that sense they are similar
to entries in any other cache.  Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.

- Problem

Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap.  Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.

- Approach

The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state.  Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.

To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly.  These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space.  MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.

This patch (of 5):

When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use.  This could reduce
workingset eviction so it ends up increasing performance.

This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future.  The hint can help kernel in deciding which
pages to evict early during memory pressure.

It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

	active file page -> inactive file LRU
	active anon page -> inacdtive anon LRU

Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault.  Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages.  It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied.  Even, it could give a bonus to make
them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost.  Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU.  Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device.  Let's start simpler way without adding
complexity at this moment.  However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.

* man-page material

MADV_COLD (since Linux x.x)

Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies.  In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.

MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.

[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
Reported-by: Nkbuild test robot <lkp@intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

1af766e8

mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM · a0747c91

由 Minchan Kim 提交于 9月 25, 2019

commit 8940b34a4e082ae11498ddae8432f2ac07685d1c upstream

The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
as default.  It is for preventing to reclaim dirty pages when CMA try to
migrate pages.  Strictly speaking, we don't need it because CMA didn't
allow to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.

Moreover, it has a problem to prevent anonymous pages's swap out even
though force_reclaim = true in shrink_page_list on upcoming patch.  So
this patch makes references's default value to PAGEREF_RECLAIM and rename
force_reclaim with ignore_references to make it more clear.

This is a preparatory work for next patch.

Link: http://lkml.kernel.org/r/20190726023435.214162-3-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: kbuild test robot <lkp@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

a0747c91

alinux: mm: add proc interface to control context readahead · 2e38a0f2

由 Xiaoguang Wang 提交于 3月 09, 2020

For some workloads whose io activities are mostly random, context
readahead feature can introduce unnecessary io read operations, which
will impact app's performance. Context readahead's algorithm is
straightforward and not that smart.

This patch adds "/proc/sys/vm/enable_context_readahead" to control
whether to disable or enable this feature. Currently we enable context
readahead default, user can echo 0 to /proc/sys/vm/enable_context_readahead
to disable context readahead.

We also have tested mongodb's performance in 'random point select' case,
With context readahead enabled:
  mongodb eps 12409
With context readahead disabled:
  mongodb eps 14443
About 16% performance improvement.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2e38a0f2

alinux: doc: use unified official project name Cloud Kernel · a60721b9

由 Caspar Zhang 提交于 2月 22, 2020

Cloud Kernel is the official name of our project, this patch unitizes
the project names used in docs and comments.
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

a60721b9

alinux: mm: oom_kill: show killed task's cgroup info in global oom · 5028e358

由 Wenwei Tao 提交于 9月 20, 2019

Some users want to know the killed task's cgroup info in global
oom, this message would help them to make upper decision.
Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

5028e358

alinux: mm: memcontrol: enable oom.group on cgroup-v1 · 7d41295c

由 Wenwei Tao 提交于 9月 10, 2019

Enable oom.group on cgroup-v1.
Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

7d41295c

alinux: mm: memcontrol: introduce memcg priority oom · 52e375fc

由 Wenwei Tao 提交于 8月 23, 2019

Under memory pressure reclaim and oom would happen, with multiple
cgroups exist in one system, we might want some of their memory
or tasks survived the reclaim and oom while there are other
candidates.

The @memory.low and @memory.min have make that happen during reclaim,
this patch introduces memcg priority oom to meet above requirement in
the oom.

The priority is from 0 to 12, the higher number the higher priority.
When oom happens it always choose victim from low priority memcg.
And it works both for memcg oom and global oom, it can be enabled/disabled
through @memory.use_priority_oom, for global oom through the root
memcg's @memory.use_priority_oom, it is disabled by default.
Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

52e375fc

alinux: memcg: Account throttled time due to memory.wmark_min_adj · ef467b9d

由 Xunlei Pang 提交于 9月 01, 2019

Accessing original memory.stat turned out to be one heavy
operation which has been caused many real product problems.

Introduce new cgroup memory.exstat, memory.exstat stands
for "extra/extended memory.stat", which contains dedicated
statistics from Alibaba Clould Kernel.

memory.exstat is supposed to provide hierarchical statistics.

Export its first "wmark_min_throttled_ms", and will add more
like direct reclaim, direct compaction, etc.
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

ef467b9d

alinux: memcg: Introduce memory.wmark_min_adj · 60be0f54

由 Xunlei Pang 提交于 8月 27, 2019

In co-location environment, there are more or less some memory
overcommitment, then BATCH tasks may break the shared global min
watermark resulting in all types of applications falling into
the direct reclaim slow path hurting the RT of LS tasks.
(NOTE: BATCH tasks tolerate big latency spike even in seconds
as long as doesn't hurt its overal throughput. While LS tasks
are very Latency-Sensitive, they may time out or fail in case
of sudden latency spike lasts like hundreds of ms typically.)

Actually BATCH tasks are not sensitive to memory latency, they
can be assigned a strict min watermark which is different from
that of LS tasks(which can be aissgned a lenient min watermark
accordingly), thus isolating each other in case of global memory
allocation. This is kind of like the idea behind ALLOC_HARDER
for rt_task(), see gfp_to_alloc_flags().

memory.wmark_min_adj stands for memcg global WMARK_MIN adjustment,
it is used to realize separate min watermarks above-mentioned for
memcgs, its valid value is within [-25, 50], specifically:
negative value means to be relative to [0, WMARK_MIN],
positive value means to be relative to [WMARK_MIN, WMARK_LOW].
For examples,
  -25 means "WMARK_MIN + (WMARK_MIN - 0) * (-25%)"
   50 means "WMARK_MIN + (WMARK_LOW - WMARK_MIN) * 50%"

Note that the minimum -25 is what ALLOC_HARDER uses which is safe
for us to adopt, and the maximum 50 is one experienced value.

Negative memory.wmark_min_adj means high QoS requirements, it can
allocate below the global WMARK_MIN, which is kind of like the idea
behind ALLOC_HARDER, see gfp_to_alloc_flags().

Positive memory.wmark_min_adj means low QoS requirements, thus when
allocation broke memcg min watermark, it should trigger direct reclaim
traditionally, and we trigger throttle instead to further prevent
them from disturbing others.

With this interface, we can assign positive values for BATCH memcgs
and negative values for LS memcgs.

memory.wmark_min_adj default value is 0, and inherit from its parent,
Note that the final effective wmark_min_adj will consider all the
hierarchical values, its value is the maximal(most conservative)
wmark_min_adj along the hierarchy but excluding intermediate default
values(zero).
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

60be0f54

alinux: memcg: Provide users the ability to reap zombie memcgs · 63442ea9

由 Xunlei Pang 提交于 5月 06, 2019

After memcg was deleted, page caches still reference to this memcg
causing large number of dead(zombie) memcgs in the system. Then it
slows down access to "/sys/fs/cgroup/cpu/memory.stat", etc due to
tons of iterations, further causing various latencies.

This patch introduces two ways to reclaim these zombie memcgs.
1) Background kthread reaper
Introduce a kernel thread "memcg_zombie_reaper" to reclaim zombie
memcgs at background periodically.

Several knobs are also added to control the reaper scan frequency:
- /sys/kernel/mm/memcg_reaper/scan_interval
  The scan period in second. Default 5s.
- /sys/kernel/mm/memcg_reaper/pages_scan
  The scan rate of pages per scan. Default 1310720(5GiB for 4KiB page).
- /sys/kernel/mm/memcg_reaper/verbose
  Output some zombie memcg information for debug purpose. Default off.
- /sys/kernel/mm/memcg_reaper/reap_background
  "on/off" switch. Default "0" means off. Write "1" to switch it on.

2) One-shot trigger by users
- /sys/kernel/mm/memcg_reaper/reap
  Write "1" to trigger one round of zombie memcg reaping, but without
  any guarantee, you may need to launch multiple rounds as needed.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

63442ea9

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功