提交 · a66281c5e1b2f55db95989c03f619b8cbeeedd0f · openeuler / Kernel

27 7月, 2023 4 次提交

mm/sharepool: Add sp_group_sem protection. · a66281c5

由 Xu Qiang 提交于 7月 27, 2023

hulk inclusion
category: other
bugzilla: https://gitee.com/openeuler/kernel/issues/I6GI0X

----------------------------------------------

In sp_update_process_stat, traverse node_list
and lock protection is required.
Signed-off-by: NXu Qiang <xuqiang36@huawei.com>

a66281c5

mm/sharepool: Delete SPG_FLAG_NON_DVPP. · 7843fb44

由 Xu Qiang 提交于 7月 27, 2023

hulk inclusion
category: other
bugzilla: https://gitee.com/openeuler/kernel/issues/I6GI0X

----------------------------------------------

SPG_FLAG_NON_DVPP is no longer used in downstream systems.
Signed-off-by: NXu Qiang <xuqiang36@huawei.com>

7843fb44

mm/sharepool: Change data type of members in sp_spa_stat to atomic64. · d4ac07a0

由 Xu Qiang 提交于 7月 27, 2023

hulk inclusion
category: other
bugzilla: https://gitee.com/openeuler/kernel/issues/I6GI0X

----------------------------------------------

Member of sp_spa_stat is changed to atomic64_t to
solve the concurrency problem.Spa_stat no longer
needs sp_area_lock protection.
Signed-off-by: NXu Qiang <xuqiang36@huawei.com>

d4ac07a0

mm/sharepool: Add mg_sp_alloc_nodemask · 8deff3a6

由 Chen Jun 提交于 7月 27, 2023

hulk inclusion
category: feature
bugzilla: N/A

--------------------------------

Support alloc memory from nodes.

mg_sp_alloc allow to alloc memory from one node.
If the node have no enough memory, the caller would
pick a next node. But that has a lot of overhead.

To improve performance, we support a new interface to
alloc memory from nodes.
Signed-off-by: NChen Jun <chenjun102@huawei.com>

8deff3a6

25 7月, 2023 1 次提交

mm: vmscan: fix extreme overreclaim and swap floods · 73d4a8a7

由 Johannes Weiner 提交于 8月 02, 2022

stable inclusion
from stable-v5.10.157
commit d925dd3e444cb7f0fab0208fed82673fd61f9765
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7MU59

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=d925dd3e444cb7f0fab0208fed82673fd61f9765

--------------------------------

commit f53af428 upstream.

During proactive reclaim, we sometimes observe severe overreclaim, with
several thousand times more pages reclaimed than requested.

This trace was obtained from shrink_lruvec() during such an instance:

    prio:0 anon_cost:1141521 file_cost:7767
    nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
    nr=[7161123 345 578 1111]

While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
by swapping.  These requests take over a minute, during which the write()
to memory.reclaim is unkillably stuck inside the kernel.

Digging into the source, this is caused by the proportional reclaim
bailout logic.  This code tries to resolve a fundamental conflict: to
reclaim roughly what was requested, while also aging all LRUs fairly and
in accordance to their size, swappiness, refault rates etc.  The way it
attempts fairness is that once the reclaim goal has been reached, it stops
scanning the LRUs with the smaller remaining scan targets, and adjusts the
remainder of the bigger LRUs according to how much of the smaller LRUs was
scanned.  It then finishes scanning that remainder regardless of the
reclaim goal.

This works fine if priority levels are low and the LRU lists are
comparable in size.  However, in this instance, the cgroup that is
targeted by proactive reclaim has almost no files left - they've already
been squeezed out by proactive reclaim earlier - and the remaining anon
pages are hot.  Anon rotations cause the priority level to drop to 0,
which results in reclaim targeting all of anon (a lot) and all of file
(almost nothing).  By the time reclaim decides to bail, it has scanned
most or all of the file target, and therefor must also scan most or all of
the enormous anon target.  This target is thousands of times larger than
the reclaim goal, thus causing the overreclaim.

The bailout code hasn't changed in years, why is this failing now?  The
most likely explanations are two other recent changes in anon reclaim:

1. Before the series starting with commit 5df74196 ("mm: fix LRU
   balancing effect of new transparent huge pages"), the VM was
   overall relatively reluctant to swap at all, even if swap was
   configured. This means the LRU balancing code didn't come into play
   as often as it does now, and mostly in high pressure situations
   where pronounced swap activity wouldn't be as surprising.

2. For historic reasons, shrink_lruvec() loops on the scan targets of
   all LRU lists except the active anon one, meaning it would bail if
   the only remaining pages to scan were active anon - even if there
   were a lot of them.

   Before the series starting with commit ccc5dc67 ("mm/vmscan:
   make active/inactive ratio as 1:1 for anon lru"), most anon pages
   would live on the active LRU; the inactive one would contain only a
   handful of preselected reclaim candidates. After the series, anon
   gets aged similarly to file, and the inactive list is the default
   for new anon pages as well, making it often the much bigger list.

   As a result, the VM is now more likely to actually finish large
   anon targets than before.

Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
larger LRU lists is made before bailing out on a met reclaim goal.

This fixes the extreme overreclaim problem.

Fairness is more subtle and harder to evaluate.  No obvious misbehavior
was observed on the test workload, in any case.  Conceptually, fairness
should primarily be a cumulative effect from regular, lower priority
scans.  Once the VM is in trouble and needs to escalate scan targets to
make forward progress, fairness needs to take a backseat.  This is also
acknowledged by the myriad exceptions in get_scan_count().  This patch
makes fairness decrease gradually, as it keeps fairness work static over
increasing priority levels with growing scan targets.  This should make
more sense - although we may have to re-visit the exact values.

Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NRik van Riel <riel@surriel.com>
Acked-by: NMel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Nsanglipeng <sanglipeng1@jd.com>

73d4a8a7

20 7月, 2023 2 次提交

mm: fs: initialize fsdata passed to write_begin/write_end interface · 87bfaacf

由 Alexander Potapenko 提交于 9月 15, 2022

stable inclusion
from stable-v5.10.156
commit 294ef12dccc6de01de3322b21a0c235474952b63
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7MCG1

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=294ef12dccc6de01de3322b21a0c235474952b63

--------------------------------

commit 1468c6f4 upstream.

Functions implementing the a_ops->write_end() interface accept the `void
*fsdata` parameter that is supposed to be initialized by the corresponding
a_ops->write_begin() (which accepts `void **fsdata`).

However not all a_ops->write_begin() implementations initialize `fsdata`
unconditionally, so it may get passed uninitialized to a_ops->write_end(),
resulting in undefined behavior.

Fix this by initializing fsdata with NULL before the call to
write_begin(), rather than doing so in all possible a_ops implementations.

This patch covers only the following cases found by running x86 KMSAN
under syzkaller:

 - generic_perform_write()
 - cont_expand_zero() and generic_cont_expand_simple()
 - page_symlink()

Other cases of passing uninitialized fsdata may persist in the codebase.

Link: https://lkml.kernel.org/r/20220915150417.722975-43-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Nsanglipeng <sanglipeng1@jd.com>

87bfaacf

maccess: Fix writing offset in case of fault in strncpy_from_kernel_nofault() · 67efa5f7

由 Alban Crequy 提交于 11月 10, 2022

stable inclusion
from stable-v5.10.156
commit db744288af730abb66312f40b087d1dbf794c5f4
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7MCG1

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=db744288af730abb66312f40b087d1dbf794c5f4

--------------------------------

commit 8678ea06 upstream.

If a page fault occurs while copying the first byte, this function resets one
byte before dst.
As a consequence, an address could be modified and leaded to kernel crashes if
case the modified address was accessed later.

Fixes: b58294ea ("maccess: allow architectures to provide kernel probing directly")
Signed-off-by: NAlban Crequy <albancrequy@linux.microsoft.com>
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Tested-by: NFrancis Laniel <flaniel@linux.microsoft.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org> [5.8]
Link: https://lore.kernel.org/bpf/20221110085614.111213-2-albancrequy@linux.microsoft.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Nsanglipeng <sanglipeng1@jd.com>

67efa5f7

18 7月, 2023 1 次提交

mm/memremap.c: map FS_DAX device memory as decrypted · 69cb3d15

由 Pankaj Gupta 提交于 11月 02, 2022

stable inclusion
from stable-v5.10.155
commit 0b692d41ee5c88097ecf5dbb37c59083044c996a
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7M5F4

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0b692d41ee5c88097ecf5dbb37c59083044c996a

--------------------------------

commit 867400af upstream.

virtio_pmem use devm_memremap_pages() to map the device memory.  By
default this memory is mapped as encrypted with SEV.  Guest reboot changes
the current encryption key and guest no longer properly decrypts the FSDAX
device meta data.

Mark the corresponding device memory region for FSDAX devices (mapped with
memremap_pages) as decrypted to retain the persistent memory property.

Link: https://lkml.kernel.org/r/20221102160728.3184016-1-pankaj.gupta@amd.com
Fixes: b7b3c01b ("mm/memremap_pages: support multiple ranges per invocation")
Signed-off-by: NPankaj Gupta <pankaj.gupta@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Nsanglipeng <sanglipeng1@jd.com>

69cb3d15

10 7月, 2023 1 次提交

etmem: fix the div 0 problem in swapcache reclaim process · 905e7deb

由 liubo 提交于 7月 10, 2023

euleros inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7JI6K
CVE: NA

----------------------------------------------------

In the swapcache recycling process, the number of pages
to be reclaimed on each node is obtained as follows:

nr_to_reclaim[nid_num] = (swapcache_to_reclaim / (swapcache_total_reclaimable / nr[nid_num]));

However, nr[nid_num] is obtained by traversing the number
of swapcache pages on each node.
If there are multiple nodes in the environment and
no swap process occurs on a node, no swapcache page exists.
The value of nr[nid_num] may be 0.

Therefore, division by zero errors may occur.
Signed-off-by: Nliubo <liubo254@huawei.com>

905e7deb

26 6月, 2023 3 次提交

mm/hugetlb_vmemmap: remap head page to newly allocated page · 4aae6225

由 Joao Martins 提交于 6月 26, 2023

mainline inclusion
from mainline-v6.2-rc1
commit 11aad263
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6SROX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11aad2631bf74b3c811dee76154702aab855a323

--------------------------------

Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed
back to page allocator is as following: for a 2M hugetlb page it will reuse
the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a
1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially,
that means that it breaks the first 4K of a potentially contiguous chunk of
memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For
this reason the memory that it's free back to page allocator cannot be used
for hugetlb to allocate huge pages of the same size, but rather only of a
smaller huge page size:

Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node
having 64G):

* Before allocation:
Free pages count per migrate type at order       0      1      2      3
4      5      6      7      8      9     10
...
Node    0, zone   Normal, type      Movable    340    100     32     15
1      2      0      0      0      1  15558

$ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
$ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
 31987

* After:

Node    0, zone   Normal, type      Movable  30893  32006  31515      7
0      0      0      0      0      0      0

Notice how the memory freed back are put back into 4K / 8K / 16K page
pools. And it allocates a total of 31987 pages (63974M).

To fix this behaviour rather than remapping second vmemmap page (thus
breaking the contiguous block of memory backing the struct pages)
repopulate the first vmemmap page with a new one. We allocate and copy
from the currently mapped vmemmap page, and then remap it later on.
The same algorithm works if there's a pre initialized walk::reuse_page
and the head page doesn't need to be skipped and instead we remap it
when the @addr being changed is the @reuse_addr.

The new head page is allocated in vmemmap_remap_free() given that on
restore there's no need for functional change. Note that, because right
now one hugepage is remapped at a time, thus only one free 4K page at a
time is needed to remap the head page. Should it fail to allocate said
new page, it reuses the one that's already mapped just like before. As a
result, for every 64G of contiguous hugepages it can give back 1G more
of contiguous memory per 64G, while needing in total 128M new 4K pages
(for 2M hugetlb) or 256k (for 1G hugetlb).

After the changes, try to assign a 64G node to hugetlb (on a 128G 2node
guest, each node with 64G):

* Before allocation
Free pages count per migrate type at order       0      1      2      3
4      5      6      7      8      9     10
...
Node    0, zone   Normal, type      Movable      1      1      1      0
0      1      0      0      1      1  15564

$ echo 32768  > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
$ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
32394

* After:

Node    0, zone   Normal, type      Movable      0     50     97    108
96     81     70     46     18      0      0

In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out
of the 32394 (64788M) allocated. So the memory freed back is indeed being
used back in hugetlb and there's no massive order-0..order-2 pages
accumulated unused.

[joao.m.martins@oracle.com: v3]
  Link: https://lkml.kernel.org/r/20221109200623.96867-1-joao.m.martins@oracle.com
[joao.m.martins@oracle.com: add smp_wmb() to ensure page contents are visible prior to PTE write]
  Link: https://lkml.kernel.org/r/20221110121214.6297-1-joao.m.martins@oracle.com
Link: https://lkml.kernel.org/r/20221107153922.77094-1-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	mm/hugetlb_vmemmap.c
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

4aae6225

hugetlb: fix hugepages_setup when deal with pernode · 3aa26c25

由 Peng Liu 提交于 6月 26, 2023

mainline inclusion
from mainline-v5.19-rc1
commit f87442f4
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6OWV4
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f87442f407af80dac4dc81c8a7772b71b36b2e09

--------------------------------

Hugepages can be specified to pernode since "hugetlbfs: extend the
definition of hugepages parameter to support node allocation", but the
following problem is observed.

Confusing behavior is observed when both 1G and 2M hugepage is set
after "numa=off".
 cmdline hugepage settings:
  hugepagesz=1G hugepages=0:3,1:3
  hugepagesz=2M hugepages=0:1024,1:1024
 results:
  HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
  HugeTLB registered 2.00 MiB page size, pre-allocated 1024 pages

Furthermore, confusing behavior can be also observed when an invalid node
behind a valid node.  To fix this, never allocate any typical hugepage
when an invalid parameter is received.

Link: https://lkml.kernel.org/r/20220413032915.251254-3-liupeng256@huawei.com
Fixes: b5389086 ("hugetlbfs: extend the definition of hugepages parameter to support node allocation")
Signed-off-by: NPeng Liu <liupeng256@huawei.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liu Yuntao <liuyuntao10@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

3aa26c25

hugetlb: fix wrong use of nr_online_nodes · 2e35014f

由 Peng Liu 提交于 6月 26, 2023

mainline inclusion
from mainline-v5.19-rc1
commit 0a7a0f6f
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6OWV4
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a7a0f6f7f3679c906fc55e3805c1d5e2c566f55

--------------------------------

Patch series "hugetlb: Fix some incorrect behavior", v3.

This series fix three bugs of hugetlb:
1) Invalid use of nr_online_nodes;
2) Inconsistency between 1G hugepage and 2M hugepage;
3) Useless information in dmesg.

This patch (of 4):

Certain systems are designed to have sparse/discontiguous nodes.  In this
case, nr_online_nodes can not be used to walk through numa node.  Also, a
valid node may be greater than nr_online_nodes.

However, in hugetlb, it is assumed that nodes are contiguous.

For sparse/discontiguous nodes, the current code may treat a valid node
as invalid, and will fail to allocate all hugepages on a valid node that
"nid >= nr_online_nodes".

As David suggested:

	if (tmp >= nr_online_nodes)
		goto invalid;

Just imagine node 0 and node 2 are online, and node 1 is offline.
Assuming that "node < 2" is valid is wrong.

Recheck all the places that use nr_online_nodes, and repair them one by
one.

[liupeng256@huawei.com: v4]
  Link: https://lkml.kernel.org/r/20220416103526.3287348-1-liupeng256@huawei.com
Link: https://lkml.kernel.org/r/20220413032915.251254-1-liupeng256@huawei.com
Link: https://lkml.kernel.org/r/20220413032915.251254-2-liupeng256@huawei.com
Fixes: 4178158e ("hugetlbfs: fix issue of preallocation of gigantic pages can't work")
Fixes: b5389086 ("hugetlbfs: extend the definition of hugepages parameter to support node allocation")
Fixes: e79ce983 ("hugetlbfs: fix a truncation issue in hugepages parameter")
Fixes: f9317f77 ("hugetlb: clean up potential spectre issue warnings")
Signed-off-by: NPeng Liu <liupeng256@huawei.com>
Suggested-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NDavidlohr Bueso <dave@stgolabs.net>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Liu Yuntao <liuyuntao10@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	mm/hugetlb.c
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

2e35014f

25 6月, 2023 8 次提交

mm: swap_slots: add per-type slot cache · 8e41c366

由 Liu Shixin 提交于 6月 25, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

--------------------------------

Since we support per-memcg swapfile control, we need per-type slot
cache to optimize performance. To reduce memory waste, allocate per-type
slot cache when enable feature or online the corresponding swap device.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

8e41c366

mm/swapfile: introduce per-memcg swapfile control · 682fc25d

由 Liu Shixin 提交于 6月 25, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

--------------------------------

With memory.swapfile interface, the avail swap device can be limit for
memcg. The acceptable parameters are 'all', 'none' and valid swap device.
Usage:
	echo /dev/zram0 > memory.swapfile

If the swap device is offline, the swapfile will be fallback to 'none'.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

682fc25d

memcg: add restrict to swap to cgroup1 · 5361bef3

由 Liu Shixin 提交于 6月 25, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

--------------------------------

The memsw can't limit the usage of swap space. Add memory.swap.max
interface to limit the difference value of memsw.usage and memory.usage.
Since a page may occupy both swap entry and a swap cache page, this value
is not exactly equal to swap.usage.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

5361bef3

memcg: introduce per-memcg swapin interface · 9bbb63c8

由 Liu Shixin 提交于 6月 25, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

--------------------------------

Add a new per-memcg swapin interface to load data into memory in advance
to improve access efficiency.
Usage:
	# echo 0 > memory.force_swapin
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

9bbb63c8

memcg: introduce memcg swap qos feature · eefe54b2

由 Liu Shixin 提交于 6月 25, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

--------------------------------

Introduce memcg swap qos including subsequent sub-features.
Add CONFIG_MEMCG_SWAP_QOS and static key memcg_swap_qos_key.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

eefe54b2

memcg: add page type to memory.reclaim interface · 0f6acee1

由 Liu Shixin 提交于 6月 25, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

--------------------------------

Add anon/file to memory.reclaim interface to limit only reclaim one type
pages. The lru algorithm can reclaim cold pages and balance between file
and anon. But it didn't consider the speed of backend device. For example,
if there is zram device, reclaim anon pages might has less impact on
performance. So extend memory.reclaim interface to reclaim one type pages.
Usage:
	"echo <size> type=anon > memory.reclaim"
	"echo <size> type=file > memory.reclaim"

Also compatible with the previous format.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

0f6acee1

mm: vmpressure: don't count proactive reclaim in vmpressure · 789303ae

由 Yosry Ahmed 提交于 6月 25, 2023

mainline inclusion
from mainline-v6.0-rc1
commit 73b73bac
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=73b73bac90d97400e29e585c678c4d0ebfd2680d

--------------------------------

memory.reclaim is a cgroup v2 interface that allows users to proactively
reclaim memory from a memcg, without real memory pressure.  Reclaim
operations invoke vmpressure, which is used: (a) To notify userspace of
reclaim efficiency in cgroup v1, and (b) As a signal for a memcg being
under memory pressure for networking (see
mem_cgroup_under_socket_pressure()).

For (a), vmpressure notifications in v1 are not affected by this change
since memory.reclaim is a v2 feature.

For (b), the effects of the vmpressure signal (according to Shakeel [1])
are as follows:
1. Reducing send and receive buffers of the current socket.
2. May drop packets on the rx path.
3. May throttle current thread on the tx path.

Since proactive reclaim is invoked directly by userspace, not by memory
pressure, it makes sense not to throttle networking.  Hence, this change
makes sure that proactive reclaim caused by memory.reclaim does not
trigger vmpressure.

[1] https://lore.kernel.org/lkml/CALvZod68WdrXEmBpOkadhB5GPYmCXaDZzXH=yyGOCAjFRn4NDQ@mail.gmail.com/

[yosryahmed@google.com: update documentation]
  Link: https://lkml.kernel.org/r/20220721173015.2643248-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20220714064918.2576464-1-yosryahmed@google.comSigned-off-by: NYosry Ahmed <yosryahmed@google.com>
Acked-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

789303ae

mm/memory_hotplug: extend offline_and_remove_memory() to handle more than one memory block · 9b7206bc

由 David Hildenbrand 提交于 6月 25, 2023

mainline inclusion
from mainline-v5.11-rc1
commit 8dc4bb58
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7F3HQ
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8dc4bb58a146655eb057247d7c9d19e73928715b

--------------------------------

virtio-mem soon wants to use offline_and_remove_memory() memory that
exceeds a single Linux memory block (memory_block_size_bytes()). Let's
remove that restriction.

Let's remember the old state and try to restore that if anything goes
wrong. While re-onlining can, in general, fail, it's highly unlikely to
happen (usually only when a notifier fails to allocate memory, and these
are rather rare).

This will be used by virtio-mem to offline+remove memory ranges that are
bigger than a single memory block - for example, with a device block
size of 1 GiB (e.g., gigantic pages in the hypervisor) and a Linux memory
block size of 128MB.

While we could compress the state into 2 bit, using 8 bit is much
easier.

This handling is similar, but different to acpi_scan_try_to_offline():

a) We don't try to offline twice. I am not sure if this CONFIG_MEMCG
optimization is still relevant - it should only apply to ZONE_NORMAL
(where we have no guarantees). If relevant, we can always add it.

b) acpi_scan_try_to_offline() simply onlines all memory in case
something goes wrong. It doesn't restore previous online type. Let's do
that, so we won't overwrite what e.g., user space configured.
Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20201112133815.13332-28-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

9b7206bc

21 6月, 2023 1 次提交

mm: mem_reliable: Fix reliable page counter mismatch problem · e70b561e

由 Ma Wupeng 提交于 6月 21, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I77BDW
CVE: NA

--------------------------------

During copy_present_pte, rss counter is increased but the corresponding
reliable page counter is not updated. This will lead to reliable page
counter mismatch. Fix this by adding reliable page counter.

Fixes: d81e9624 ("proc: Count reliable memory usage of reliable tasks")
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NNanyong Sun <sunnanyong@huawei.com>

e70b561e

16 6月, 2023 1 次提交

mm: oom: move memcg_print_bad_task() out of mem_cgroup_scan_tasks() · 9cd6f55e

由 Kang Chen 提交于 4月 16, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NYW4
CVE: NA

--------------------------------

raw call flow:

oom_kill_process
  -> mem_cgroup_scan_tasks(.., .., message)
    -> memcg_print_bad_task(message, ..)

message is "const char*" type, and incorrectly cast to
"oom_control*" type in memcg_print_bad_task.

Fix it by moving memcg_print_bad_task out of mem_cgroup_scan_tasks
and call it in select_bad_process and dump_tasks. Furthermore,
use struct oom_control* directly and remove the useless parm `ret`.
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NKang Chen <void0red@hust.edu.cn>
(cherry picked from commit 789038c7)

9cd6f55e

13 6月, 2023 1 次提交

userswap: fix kmalloc ENOMEM failed for a large memory · 6935faf1

由 ZhangPeng 提交于 6月 13, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

If the swapped-out memory is large, such as tens of gigabytes, we will
allocate a large management structure, which may be tens of megabytes or
hundreds of megabytes. So if we use kmalloc to allocate management
structures it may fail.
Fix this by changing kmalloc to kvzalloc and kfree to kvfree.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

6935faf1

07 6月, 2023 6 次提交

mm/dynamic_hugetlb: fix type error of pfn in __hpool_split_gigantic_page() · c5fd2410

由 Liu Shixin 提交于 6月 07, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE
CVE: NA

--------------------------------

The type of pfn is int, which can result in truncation.
Change its type to unsigned long to fix the problem.

Fixes: eef7b4fd ("mm/dynamic_hugetlb: use pfn to traverse subpages")
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

c5fd2410

mm/dynamic_hugetlb: set PagePool to bad page · 2a9df328

由 Liu Shixin 提交于 6月 07, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE
CVE: NA

--------------------------------

Before discard the bad page, set PagePool flag to distinguish from free
page. And increase used_pages to guarantee used + freed = total.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

2a9df328

mm/dynamic_hugetlb: replace spin_lock with mutex_lock and fix kabi broken · e5d96a31

由 Liu Shixin 提交于 6月 07, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6MH03
CVE: NA

--------------------------------

When memory is fragmented, update_reserve_pages() may call migrate_pages()
to collect continuous memory. This function can sleep, so we should use
mutex lock instead of spin lock. Use KABI_EXTEND to fix kabi broken.

Fixes: 0c06a1c0 ("mm/dynamic_hugetlb: add interface to configure the count of hugepages")
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

e5d96a31

mm/dynamic_hugetlb: isolate hugepage without dissolve · 2430060b

由 Liu Shixin 提交于 6月 07, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE
CVE: NA

--------------------------------

The memory hotplug and memory failure will dissolve freed hugepages to
buddy system, this is not the expected behavior for dynamic hugetlb.
Skip the dissolve operation for hugepages belonging to dynamic hugetlb.
For memory hotplug, the hotplug operation is not allowed, if dhugetlb
pool existed. For memory failure, the hugepage will be discard directly.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

2430060b

mm/dynamic_hugetlb: support dynamic hugetlb on arm64 · 036ab0bb

由 Liu Shixin 提交于 6月 07, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE
CVE: NA

--------------------------------

To support dynamic hugetlb on arm64, we need to do two more things. The
first one is to fix kabi broken in mem_cgroup, we use kabi_reserve_5 to
fix it in previous patch. The second one is to check cont-bit hugetlb
since this feature only support for PMD-size and PUD-size hugepage.

This feature only support for 4KB pagesize, not support for 16KB and 64KB.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

036ab0bb

mm: page_counter: remove unneeded atomic ops for low/min · 08216776

由 Shakeel Butt 提交于 6月 07, 2023

mainline inclusion
from mainline-v6.1-rc1
commit cfdab60b
category: perf
bugzilla: https://gitee.com/openeuler/kernel/issues/I7BHGR
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cfdab60bfa66b2dc0391c9e405b8af6039924cd4

--------------------------------

Patch series "memcg: optimize charge codepath", v2.

Recently Linux networking stack has moved from a very old per socket
pre-charge caching to per-cpu caching to avoid pre-charge fragmentation
and unwarranted OOMs.  One impact of this change is that for network
traffic workloads, memcg charging codepath can become a bottleneck.  The
kernel test robot has also reported this regression[1].  This patch series
tries to improve the memcg charging for such workloads.

This patch series implement three optimizations:
(A) Reduce atomic ops in page counter update path.
(B) Change layout of struct page_counter to eliminate false sharing
    between usage and high.
(C) Increase the memcg charge batch to 64.

To evaluate the impact of these optimizations, on a 72 CPUs machine, we
ran the following workload in root memcg and then compared with scenario
where the workload is run in a three level of cgroup hierarchy with top
level having min and low setup appropriately.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
1. root memcg		21694.8 Mbps
2. 6.0-rc1		10482.7 Mbps (-51.6%)
3. 6.0-rc1 + (A)	14542.5 Mbps (-32.9%)
4. 6.0-rc1 + (B)	12413.7 Mbps (-42.7%)
5. 6.0-rc1 + (C)	17063.7 Mbps (-21.3%)
6. 6.0-rc1 + (A+B+C)	20120.3 Mbps (-7.2%)

With all three optimizations, the memcg overhead of this workload has
been reduced from 51.6% to just 7.2%.

[1] https://lore.kernel.org/linux-mm/20220619150456.GB34471@xsang-OptiPlex-9020/

This patch (of 3):

For cgroups using low or min protections, the function
propagate_protected_usage() was doing an atomic xchg() operation
irrespectively.  We can optimize out this atomic operation for one
specific scenario where the workload is using the protection (i.e.  min >
0) and the usage is above the protection (i.e.  usage > min).

This scenario is actually very common where the users want a part of their
workload to be protected against the external reclaim.  Though this
optimization does introduce a race when the usage is around the protection
and concurrent charges and uncharged trip it over or under the protection.
In such cases, we might see lower effective protection but the subsequent
charge/uncharge will correct it.

To evaluate the impact of this optimization, on a 72 CPUs machine, we ran
the following workload in a three level of cgroup hierarchy with top level
having min and low setup appropriately to see if this optimization is
effective for the mentioned case.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)	10482.7 Mbps
With patch		14542.5 Mbps (38.7% improvement)

With the patch, the throughput improved by 38.7%

Link: https://lkml.kernel.org/r/20220825000506.239406-1-shakeelb@google.com
Link: https://lkml.kernel.org/r/20220825000506.239406-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Reported-by: Nkernel test robot <oliver.sang@intel.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: NFeng Tang <feng.tang@intel.com>
Acked-by: NRoman Gushchin <roman.gushchin@linux.dev>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oliver Sang <oliver.sang@intel.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

08216776

30 5月, 2023 7 次提交

userswap: fix variable uninitialized in uswap_unmap_anon_page() · 98fe1222

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

When the pte is none and old_pte is not NULL, _old_pte is uninitialized
and will be passed out of uswap_unmap_anon_page().
To fix this, add a return value to uswap_unmap_anon_page to indicate
whether the pte is none before unmapping.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

98fe1222

userswap: mark swap-out buffer PTE as writable · f087df84

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Add a VM_WRITE check for the swap-out buffer. If the swap-out buffer VMA
contains VM_WRITE, the PTE should be marked as writable.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

f087df84

userswap: add VMA check for uswap registration · bdc54503

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Add VMA check for uswap registration to make sure that swap-in VA is of
the same type as swap-out VA.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

bdc54503

userswap: add handling of ZERO_PAGE · 239fdf8b

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

We call follow_page() with FOLL_DUMP to handle ZERO_PAGE. Although
FOLL_DUMP is intended for get_dump_page(), it just so happens that its
special treatment of the ZERO_PAGE (returning an error instead of doing
get_page) suits uswap very well. If somehow an abnormal page has sneaked
into the range, we won't oops here.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

239fdf8b

userswap: add page_count() check for swap-out VA · 92751dfb

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Add page_count() check for swap-out VA to make sure that no other kernel
mechanism is using this physical page.
Add lru_add_drain_all() before swap-out VA to correct page_count() of
swap-out pages.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

92751dfb

userswap: add VMA check for swap-in and swap-out buffer · 4ca4ac49

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Add VMA check for swap-in and swap-out buffer to make sure that swap-in
and swap-out buffer is of the same type as swap-out VA.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

4ca4ac49

userswap: fix BUG_ON in __mcopy_atomic() · b9d83c98

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

When the swap-in buffer contains no physical pages, the errno in
mfill_atomic_pte_nocopy() will be ENOENT. A BUG_ON error will occur
because the userswap feature does not use the struct page *page and page
is set to NULL.
To fix this issue, the errno should be changed from ENOENT to EINVAL.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

b9d83c98

19 5月, 2023 4 次提交

memcg: support ksm merge any mode per cgroup · 0f6fb357

由 Nanyong Sun 提交于 5月 19, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

----------------------------------------------------------------------

Add control file "memory.ksm" to enable ksm per cgroup.
Echo to 1 will set all tasks currently in the cgroup to ksm merge
any mode, which means ksm gets enabled for all vma's of a process.
Meanwhile echo to 0 will disable ksm for them and unmerge the
merged pages.
Cat the file will show the above state and ksm related profits
of this cgroup.
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

0f6fb357

mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0 · 351ceedb

由 David Hildenbrand 提交于 5月 19, 2023

mainline inclusion
from mainline-v6.4-rc1
commit 24139c07
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=24139c07f413ef4b555482c758343d71392a19bc

----------------------------------------------------------------------

Patch series "mm/ksm: improve PR_SET_MEMORY_MERGE=0 handling and cleanup
disabling KSM", v2.

(1) Make PR_SET_MEMORY_MERGE=0 unmerge pages like setting MADV_UNMERGEABLE
does, (2) add a selftest for it and (3) factor out disabling of KSM from
s390/gmap code.

This patch (of 3):

Let's unmerge any KSM pages when setting PR_SET_MEMORY_MERGE=0, and clear
the VM_MERGEABLE flag from all VMAs -- just like KSM would.  Of course,
only do that if we previously set PR_SET_MEMORY_MERGE=1.

Link: https://lkml.kernel.org/r/20230422205420.30372-1-david@redhat.com
Link: https://lkml.kernel.org/r/20230422205420.30372-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NStefan Roesch <shr@devkernel.io>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	mm/ksm.c
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

351ceedb

mm: add new KSM process and sysfs knobs · a098d41e

由 Stefan Roesch 提交于 5月 19, 2023

mainline inclusion
from mainline-v6.4-rc1
commit d21077fb
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d21077fbc2fc987c2e593c34dc3b4d84e546dc9f

----------------------------------------------------------------------

This adds the general_profit KSM sysfs knob and the process profit metric
knobs to ksm_stat.

1) expose general_profit metric

   The documentation mentions a general profit metric, however this
   metric is not calculated.  In addition the formula depends on the size
   of internal structures, which makes it more difficult for an
   administrator to make the calculation.  Adding the metric for a better
   user experience.

2) document general_profit sysfs knob

3) calculate ksm process profit metric

   The ksm documentation mentions the process profit metric and how to
   calculate it.  This adds the calculation of the metric.

4) mm: expose ksm process profit metric in ksm_stat

   This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
   The documentation mentions the formula for the ksm process profit
   metric, however it does not calculate it.  In addition the formula
   depends on the size of internal structures.  So it makes sense to
   expose it.

5) document new procfs ksm knobs

Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io>
Reviewed-by: NBagas Sanjaya <bagasdotme@gmail.com>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

a098d41e

mm: add new api to enable ksm per process · 2cd2cdfe

由 Stefan Roesch 提交于 5月 19, 2023

mainline inclusion
from mainline-v6.4-rc1
commit d7597f59
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d7597f59d1d33e9efbffa7060deb9ee5bd119e62

----------------------------------------------------------------------

Patch series "mm: process/cgroup ksm support", v9.

So far KSM can only be enabled by calling madvise for memory regions.  To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.

Use case 1:
  The madvise call is not available in the programming language.  An
  example for this are programs with forked workloads using a garbage
  collected language without pointers.  In such a language madvise cannot
  be made available.

  In addition the addresses of objects get moved around as they are
  garbage collected.  KSM sharing needs to be enabled "from the outside"
  for these type of workloads.

Use case 2:
  The same interpreter can also be used for workloads where KSM brings
  no benefit or even has overhead.  We'd like to be able to enable KSM on
  a workload by workload basis.

Use case 3:
  With the madvise call sharing opportunities are only enabled for the
  current process: it is a workload-local decision.  A considerable number
  of sharing opportunities may exist across multiple workloads or jobs (if
  they are part of the same security domain).  Only a higler level entity
  like a job scheduler or container can know for certain if its running
  one or more instances of a job.  That job scheduler however doesn't have
  the necessary internal workload knowledge to make targeted madvise
  calls.

Security concerns:

  In previous discussions security concerns have been brought up.  The
  problem is that an individual workload does not have the knowledge about
  what else is running on a machine.  Therefore it has to be very
  conservative in what memory areas can be shared or not.  However, if the
  system is dedicated to running multiple jobs within the same security
  domain, its the job scheduler that has the knowledge that sharing can be
  safely enabled and is even desirable.

Performance:

  Experiments with using UKSM have shown a capacity increase of around 20%.

  Here are the metrics from an instagram workload (taken from a machine
  with 64GB main memory):

   full_scans: 445
   general_profit: 20158298048
   max_page_sharing: 256
   merge_across_nodes: 1
   pages_shared: 129547
   pages_sharing: 5119146
   pages_to_scan: 4000
   pages_unshared: 1760924
   pages_volatile: 10761341
   run: 1
   sleep_millisecs: 20
   stable_node_chains: 167
   stable_node_chains_prune_millisecs: 2000
   stable_node_dups: 2751
   use_zero_pages: 0
   zero_pages_sharing: 0

After the service is running for 30 minutes to an hour, 4 to 5 million
shared pages are common for this workload when using KSM.

Detailed changes:

1. New options for prctl system command
   This patch series adds two new options to the prctl system call.
   The first one allows to enable KSM at the process level and the second
   one to query the setting.

The setting will be inherited by child processes.

With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing
   When KSM is enabled at the process level, the KSM code will iterate
   over all the VMA's and enable KSM for the eligible VMA's.

   When forking a process that has KSM enabled, the setting will be
   inherited by the new child process.

3. Add general_profit metric
   The general_profit metric of KSM is specified in the documentation,
   but not calculated.  This adds the general profit metric to
   /sys/kernel/debug/mm/ksm.

4. Add more metrics to ksm_stat
   This adds the process profit metric to /proc/<pid>/ksm_stat.

5. Add more tests to ksm_tests and ksm_functional_tests
   This adds an option to specify the merge type to the ksm_tests.
   This allows to test madvise and prctl KSM.

   It also adds a two new tests to ksm_functional_tests: one to test
   the new prctl options and the other one is a fork test to verify that
   the KSM process setting is inherited by client processes.

This patch (of 3):

So far KSM can only be enabled by calling madvise for memory regions.  To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.

1. New options for prctl system command

   This patch series adds two new options to the prctl system call.
   The first one allows to enable KSM at the process level and the second
   one to query the setting.

   The setting will be inherited by child processes.

   With the above setting, KSM can be enabled for the seed process of a
   cgroup and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing

   When KSM is enabled at the process level, the KSM code will iterate
   over all the VMA's and enable KSM for the eligible VMA's.

   When forking a process that has KSM enabled, the setting will be
   inherited by the new child process.

  1) Introduce new MMF_VM_MERGE_ANY flag

     This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
     is set, kernel samepage merging (ksm) gets enabled for all vma's of a
     process.

  2) Setting VM_MERGEABLE on VMA creation

     When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
     VM_MERGEABLE flag will be set for this VMA.

  3) support disabling of ksm for a process

     This adds the ability to disable ksm for a process if ksm has been
     enabled for the process with prctl.

  4) add new prctl option to get and set ksm for a process

     This adds two new options to the prctl system call
     - enable ksm for all vmas of a process (if the vmas support it).
     - query if ksm has been enabled for a process.

3. Disabling MMF_VM_MERGE_ANY for storage keys in s390

   In the s390 architecture when storage keys are used, the
   MMF_VM_MERGE_ANY will be disabled.

Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	kernel/sys.c mm/ksm.c mm/mmap.c
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

2cd2cdfe

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功