提交 · 2a8888185a0f50e832b9886e75867deec2719472 · openeuler / Kernel

26 6月, 2023 3 次提交

mm/hugetlb_vmemmap: remap head page to newly allocated page · 4aae6225

由 Joao Martins 提交于 6月 26, 2023

mainline inclusion
from mainline-v6.2-rc1
commit 11aad263
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6SROX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11aad2631bf74b3c811dee76154702aab855a323

--------------------------------

Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed
back to page allocator is as following: for a 2M hugetlb page it will reuse
the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a
1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially,
that means that it breaks the first 4K of a potentially contiguous chunk of
memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For
this reason the memory that it's free back to page allocator cannot be used
for hugetlb to allocate huge pages of the same size, but rather only of a
smaller huge page size:

Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node
having 64G):

* Before allocation:
Free pages count per migrate type at order       0      1      2      3
4      5      6      7      8      9     10
...
Node    0, zone   Normal, type      Movable    340    100     32     15
1      2      0      0      0      1  15558

$ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
$ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
 31987

* After:

Node    0, zone   Normal, type      Movable  30893  32006  31515      7
0      0      0      0      0      0      0

Notice how the memory freed back are put back into 4K / 8K / 16K page
pools. And it allocates a total of 31987 pages (63974M).

To fix this behaviour rather than remapping second vmemmap page (thus
breaking the contiguous block of memory backing the struct pages)
repopulate the first vmemmap page with a new one. We allocate and copy
from the currently mapped vmemmap page, and then remap it later on.
The same algorithm works if there's a pre initialized walk::reuse_page
and the head page doesn't need to be skipped and instead we remap it
when the @addr being changed is the @reuse_addr.

The new head page is allocated in vmemmap_remap_free() given that on
restore there's no need for functional change. Note that, because right
now one hugepage is remapped at a time, thus only one free 4K page at a
time is needed to remap the head page. Should it fail to allocate said
new page, it reuses the one that's already mapped just like before. As a
result, for every 64G of contiguous hugepages it can give back 1G more
of contiguous memory per 64G, while needing in total 128M new 4K pages
(for 2M hugetlb) or 256k (for 1G hugetlb).

After the changes, try to assign a 64G node to hugetlb (on a 128G 2node
guest, each node with 64G):

* Before allocation
Free pages count per migrate type at order       0      1      2      3
4      5      6      7      8      9     10
...
Node    0, zone   Normal, type      Movable      1      1      1      0
0      1      0      0      1      1  15564

$ echo 32768  > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
$ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
32394

* After:

Node    0, zone   Normal, type      Movable      0     50     97    108
96     81     70     46     18      0      0

In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out
of the 32394 (64788M) allocated. So the memory freed back is indeed being
used back in hugetlb and there's no massive order-0..order-2 pages
accumulated unused.

[joao.m.martins@oracle.com: v3]
  Link: https://lkml.kernel.org/r/20221109200623.96867-1-joao.m.martins@oracle.com
[joao.m.martins@oracle.com: add smp_wmb() to ensure page contents are visible prior to PTE write]
  Link: https://lkml.kernel.org/r/20221110121214.6297-1-joao.m.martins@oracle.com
Link: https://lkml.kernel.org/r/20221107153922.77094-1-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	mm/hugetlb_vmemmap.c
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

4aae6225

hugetlb: fix hugepages_setup when deal with pernode · 3aa26c25

由 Peng Liu 提交于 6月 26, 2023

mainline inclusion
from mainline-v5.19-rc1
commit f87442f4
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6OWV4
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f87442f407af80dac4dc81c8a7772b71b36b2e09

--------------------------------

Hugepages can be specified to pernode since "hugetlbfs: extend the
definition of hugepages parameter to support node allocation", but the
following problem is observed.

Confusing behavior is observed when both 1G and 2M hugepage is set
after "numa=off".
 cmdline hugepage settings:
  hugepagesz=1G hugepages=0:3,1:3
  hugepagesz=2M hugepages=0:1024,1:1024
 results:
  HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
  HugeTLB registered 2.00 MiB page size, pre-allocated 1024 pages

Furthermore, confusing behavior can be also observed when an invalid node
behind a valid node.  To fix this, never allocate any typical hugepage
when an invalid parameter is received.

Link: https://lkml.kernel.org/r/20220413032915.251254-3-liupeng256@huawei.com
Fixes: b5389086 ("hugetlbfs: extend the definition of hugepages parameter to support node allocation")
Signed-off-by: NPeng Liu <liupeng256@huawei.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liu Yuntao <liuyuntao10@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

3aa26c25

hugetlb: fix wrong use of nr_online_nodes · 2e35014f

由 Peng Liu 提交于 6月 26, 2023

mainline inclusion
from mainline-v5.19-rc1
commit 0a7a0f6f
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6OWV4
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a7a0f6f7f3679c906fc55e3805c1d5e2c566f55

--------------------------------

Patch series "hugetlb: Fix some incorrect behavior", v3.

This series fix three bugs of hugetlb:
1) Invalid use of nr_online_nodes;
2) Inconsistency between 1G hugepage and 2M hugepage;
3) Useless information in dmesg.

This patch (of 4):

Certain systems are designed to have sparse/discontiguous nodes.  In this
case, nr_online_nodes can not be used to walk through numa node.  Also, a
valid node may be greater than nr_online_nodes.

However, in hugetlb, it is assumed that nodes are contiguous.

For sparse/discontiguous nodes, the current code may treat a valid node
as invalid, and will fail to allocate all hugepages on a valid node that
"nid >= nr_online_nodes".

As David suggested:

	if (tmp >= nr_online_nodes)
		goto invalid;

Just imagine node 0 and node 2 are online, and node 1 is offline.
Assuming that "node < 2" is valid is wrong.

Recheck all the places that use nr_online_nodes, and repair them one by
one.

[liupeng256@huawei.com: v4]
  Link: https://lkml.kernel.org/r/20220416103526.3287348-1-liupeng256@huawei.com
Link: https://lkml.kernel.org/r/20220413032915.251254-1-liupeng256@huawei.com
Link: https://lkml.kernel.org/r/20220413032915.251254-2-liupeng256@huawei.com
Fixes: 4178158e ("hugetlbfs: fix issue of preallocation of gigantic pages can't work")
Fixes: b5389086 ("hugetlbfs: extend the definition of hugepages parameter to support node allocation")
Fixes: e79ce983 ("hugetlbfs: fix a truncation issue in hugepages parameter")
Fixes: f9317f77 ("hugetlb: clean up potential spectre issue warnings")
Signed-off-by: NPeng Liu <liupeng256@huawei.com>
Suggested-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NDavidlohr Bueso <dave@stgolabs.net>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Liu Yuntao <liuyuntao10@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	mm/hugetlb.c
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

2e35014f

25 6月, 2023 8 次提交

mm: swap_slots: add per-type slot cache · 8e41c366

由 Liu Shixin 提交于 6月 25, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

--------------------------------

Since we support per-memcg swapfile control, we need per-type slot
cache to optimize performance. To reduce memory waste, allocate per-type
slot cache when enable feature or online the corresponding swap device.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

8e41c366

mm/swapfile: introduce per-memcg swapfile control · 682fc25d

由 Liu Shixin 提交于 6月 25, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

--------------------------------

With memory.swapfile interface, the avail swap device can be limit for
memcg. The acceptable parameters are 'all', 'none' and valid swap device.
Usage:
	echo /dev/zram0 > memory.swapfile

If the swap device is offline, the swapfile will be fallback to 'none'.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

682fc25d

memcg: add restrict to swap to cgroup1 · 5361bef3

由 Liu Shixin 提交于 6月 25, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

--------------------------------

The memsw can't limit the usage of swap space. Add memory.swap.max
interface to limit the difference value of memsw.usage and memory.usage.
Since a page may occupy both swap entry and a swap cache page, this value
is not exactly equal to swap.usage.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

5361bef3

memcg: introduce per-memcg swapin interface · 9bbb63c8

由 Liu Shixin 提交于 6月 25, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

--------------------------------

Add a new per-memcg swapin interface to load data into memory in advance
to improve access efficiency.
Usage:
	# echo 0 > memory.force_swapin
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

9bbb63c8

memcg: introduce memcg swap qos feature · eefe54b2

由 Liu Shixin 提交于 6月 25, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

--------------------------------

Introduce memcg swap qos including subsequent sub-features.
Add CONFIG_MEMCG_SWAP_QOS and static key memcg_swap_qos_key.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

eefe54b2

memcg: add page type to memory.reclaim interface · 0f6acee1

由 Liu Shixin 提交于 6月 25, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

--------------------------------

Add anon/file to memory.reclaim interface to limit only reclaim one type
pages. The lru algorithm can reclaim cold pages and balance between file
and anon. But it didn't consider the speed of backend device. For example,
if there is zram device, reclaim anon pages might has less impact on
performance. So extend memory.reclaim interface to reclaim one type pages.
Usage:
	"echo <size> type=anon > memory.reclaim"
	"echo <size> type=file > memory.reclaim"

Also compatible with the previous format.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

0f6acee1

mm: vmpressure: don't count proactive reclaim in vmpressure · 789303ae

由 Yosry Ahmed 提交于 6月 25, 2023

mainline inclusion
from mainline-v6.0-rc1
commit 73b73bac
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=73b73bac90d97400e29e585c678c4d0ebfd2680d

--------------------------------

memory.reclaim is a cgroup v2 interface that allows users to proactively
reclaim memory from a memcg, without real memory pressure.  Reclaim
operations invoke vmpressure, which is used: (a) To notify userspace of
reclaim efficiency in cgroup v1, and (b) As a signal for a memcg being
under memory pressure for networking (see
mem_cgroup_under_socket_pressure()).

For (a), vmpressure notifications in v1 are not affected by this change
since memory.reclaim is a v2 feature.

For (b), the effects of the vmpressure signal (according to Shakeel [1])
are as follows:
1. Reducing send and receive buffers of the current socket.
2. May drop packets on the rx path.
3. May throttle current thread on the tx path.

Since proactive reclaim is invoked directly by userspace, not by memory
pressure, it makes sense not to throttle networking.  Hence, this change
makes sure that proactive reclaim caused by memory.reclaim does not
trigger vmpressure.

[1] https://lore.kernel.org/lkml/CALvZod68WdrXEmBpOkadhB5GPYmCXaDZzXH=yyGOCAjFRn4NDQ@mail.gmail.com/

[yosryahmed@google.com: update documentation]
  Link: https://lkml.kernel.org/r/20220721173015.2643248-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20220714064918.2576464-1-yosryahmed@google.comSigned-off-by: NYosry Ahmed <yosryahmed@google.com>
Acked-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

789303ae

mm/memory_hotplug: extend offline_and_remove_memory() to handle more than one memory block · 9b7206bc

由 David Hildenbrand 提交于 6月 25, 2023

mainline inclusion
from mainline-v5.11-rc1
commit 8dc4bb58
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7F3HQ
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8dc4bb58a146655eb057247d7c9d19e73928715b

--------------------------------

virtio-mem soon wants to use offline_and_remove_memory() memory that
exceeds a single Linux memory block (memory_block_size_bytes()). Let's
remove that restriction.

Let's remember the old state and try to restore that if anything goes
wrong. While re-onlining can, in general, fail, it's highly unlikely to
happen (usually only when a notifier fails to allocate memory, and these
are rather rare).

This will be used by virtio-mem to offline+remove memory ranges that are
bigger than a single memory block - for example, with a device block
size of 1 GiB (e.g., gigantic pages in the hypervisor) and a Linux memory
block size of 128MB.

While we could compress the state into 2 bit, using 8 bit is much
easier.

This handling is similar, but different to acpi_scan_try_to_offline():

a) We don't try to offline twice. I am not sure if this CONFIG_MEMCG
optimization is still relevant - it should only apply to ZONE_NORMAL
(where we have no guarantees). If relevant, we can always add it.

b) acpi_scan_try_to_offline() simply onlines all memory in case
something goes wrong. It doesn't restore previous online type. Let's do
that, so we won't overwrite what e.g., user space configured.
Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20201112133815.13332-28-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

9b7206bc

21 6月, 2023 1 次提交

mm: mem_reliable: Fix reliable page counter mismatch problem · e70b561e

由 Ma Wupeng 提交于 6月 21, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I77BDW
CVE: NA

--------------------------------

During copy_present_pte, rss counter is increased but the corresponding
reliable page counter is not updated. This will lead to reliable page
counter mismatch. Fix this by adding reliable page counter.

Fixes: d81e9624 ("proc: Count reliable memory usage of reliable tasks")
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NNanyong Sun <sunnanyong@huawei.com>

e70b561e

16 6月, 2023 1 次提交

mm: oom: move memcg_print_bad_task() out of mem_cgroup_scan_tasks() · 9cd6f55e

由 Kang Chen 提交于 4月 16, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NYW4
CVE: NA

--------------------------------

raw call flow:

oom_kill_process
  -> mem_cgroup_scan_tasks(.., .., message)
    -> memcg_print_bad_task(message, ..)

message is "const char*" type, and incorrectly cast to
"oom_control*" type in memcg_print_bad_task.

Fix it by moving memcg_print_bad_task out of mem_cgroup_scan_tasks
and call it in select_bad_process and dump_tasks. Furthermore,
use struct oom_control* directly and remove the useless parm `ret`.
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NKang Chen <void0red@hust.edu.cn>
(cherry picked from commit 789038c7)

9cd6f55e

13 6月, 2023 1 次提交

userswap: fix kmalloc ENOMEM failed for a large memory · 6935faf1

由 ZhangPeng 提交于 6月 13, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

If the swapped-out memory is large, such as tens of gigabytes, we will
allocate a large management structure, which may be tens of megabytes or
hundreds of megabytes. So if we use kmalloc to allocate management
structures it may fail.
Fix this by changing kmalloc to kvzalloc and kfree to kvfree.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

6935faf1

07 6月, 2023 6 次提交

mm/dynamic_hugetlb: fix type error of pfn in __hpool_split_gigantic_page() · c5fd2410

由 Liu Shixin 提交于 6月 07, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE
CVE: NA

--------------------------------

The type of pfn is int, which can result in truncation.
Change its type to unsigned long to fix the problem.

Fixes: eef7b4fd ("mm/dynamic_hugetlb: use pfn to traverse subpages")
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

c5fd2410

mm/dynamic_hugetlb: set PagePool to bad page · 2a9df328

由 Liu Shixin 提交于 6月 07, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE
CVE: NA

--------------------------------

Before discard the bad page, set PagePool flag to distinguish from free
page. And increase used_pages to guarantee used + freed = total.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

2a9df328

mm/dynamic_hugetlb: replace spin_lock with mutex_lock and fix kabi broken · e5d96a31

由 Liu Shixin 提交于 6月 07, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6MH03
CVE: NA

--------------------------------

When memory is fragmented, update_reserve_pages() may call migrate_pages()
to collect continuous memory. This function can sleep, so we should use
mutex lock instead of spin lock. Use KABI_EXTEND to fix kabi broken.

Fixes: 0c06a1c0 ("mm/dynamic_hugetlb: add interface to configure the count of hugepages")
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

e5d96a31

mm/dynamic_hugetlb: isolate hugepage without dissolve · 2430060b

由 Liu Shixin 提交于 6月 07, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE
CVE: NA

--------------------------------

The memory hotplug and memory failure will dissolve freed hugepages to
buddy system, this is not the expected behavior for dynamic hugetlb.
Skip the dissolve operation for hugepages belonging to dynamic hugetlb.
For memory hotplug, the hotplug operation is not allowed, if dhugetlb
pool existed. For memory failure, the hugepage will be discard directly.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

2430060b

mm/dynamic_hugetlb: support dynamic hugetlb on arm64 · 036ab0bb

由 Liu Shixin 提交于 6月 07, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE
CVE: NA

--------------------------------

To support dynamic hugetlb on arm64, we need to do two more things. The
first one is to fix kabi broken in mem_cgroup, we use kabi_reserve_5 to
fix it in previous patch. The second one is to check cont-bit hugetlb
since this feature only support for PMD-size and PUD-size hugepage.

This feature only support for 4KB pagesize, not support for 16KB and 64KB.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>

036ab0bb

mm: page_counter: remove unneeded atomic ops for low/min · 08216776

由 Shakeel Butt 提交于 6月 07, 2023

mainline inclusion
from mainline-v6.1-rc1
commit cfdab60b
category: perf
bugzilla: https://gitee.com/openeuler/kernel/issues/I7BHGR
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cfdab60bfa66b2dc0391c9e405b8af6039924cd4

--------------------------------

Patch series "memcg: optimize charge codepath", v2.

Recently Linux networking stack has moved from a very old per socket
pre-charge caching to per-cpu caching to avoid pre-charge fragmentation
and unwarranted OOMs.  One impact of this change is that for network
traffic workloads, memcg charging codepath can become a bottleneck.  The
kernel test robot has also reported this regression[1].  This patch series
tries to improve the memcg charging for such workloads.

This patch series implement three optimizations:
(A) Reduce atomic ops in page counter update path.
(B) Change layout of struct page_counter to eliminate false sharing
    between usage and high.
(C) Increase the memcg charge batch to 64.

To evaluate the impact of these optimizations, on a 72 CPUs machine, we
ran the following workload in root memcg and then compared with scenario
where the workload is run in a three level of cgroup hierarchy with top
level having min and low setup appropriately.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
1. root memcg		21694.8 Mbps
2. 6.0-rc1		10482.7 Mbps (-51.6%)
3. 6.0-rc1 + (A)	14542.5 Mbps (-32.9%)
4. 6.0-rc1 + (B)	12413.7 Mbps (-42.7%)
5. 6.0-rc1 + (C)	17063.7 Mbps (-21.3%)
6. 6.0-rc1 + (A+B+C)	20120.3 Mbps (-7.2%)

With all three optimizations, the memcg overhead of this workload has
been reduced from 51.6% to just 7.2%.

[1] https://lore.kernel.org/linux-mm/20220619150456.GB34471@xsang-OptiPlex-9020/

This patch (of 3):

For cgroups using low or min protections, the function
propagate_protected_usage() was doing an atomic xchg() operation
irrespectively.  We can optimize out this atomic operation for one
specific scenario where the workload is using the protection (i.e.  min >
0) and the usage is above the protection (i.e.  usage > min).

This scenario is actually very common where the users want a part of their
workload to be protected against the external reclaim.  Though this
optimization does introduce a race when the usage is around the protection
and concurrent charges and uncharged trip it over or under the protection.
In such cases, we might see lower effective protection but the subsequent
charge/uncharge will correct it.

To evaluate the impact of this optimization, on a 72 CPUs machine, we ran
the following workload in a three level of cgroup hierarchy with top level
having min and low setup appropriately to see if this optimization is
effective for the mentioned case.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)	10482.7 Mbps
With patch		14542.5 Mbps (38.7% improvement)

With the patch, the throughput improved by 38.7%

Link: https://lkml.kernel.org/r/20220825000506.239406-1-shakeelb@google.com
Link: https://lkml.kernel.org/r/20220825000506.239406-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Reported-by: Nkernel test robot <oliver.sang@intel.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: NFeng Tang <feng.tang@intel.com>
Acked-by: NRoman Gushchin <roman.gushchin@linux.dev>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oliver Sang <oliver.sang@intel.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

08216776

30 5月, 2023 7 次提交

userswap: fix variable uninitialized in uswap_unmap_anon_page() · 98fe1222

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

When the pte is none and old_pte is not NULL, _old_pte is uninitialized
and will be passed out of uswap_unmap_anon_page().
To fix this, add a return value to uswap_unmap_anon_page to indicate
whether the pte is none before unmapping.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

98fe1222

userswap: mark swap-out buffer PTE as writable · f087df84

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Add a VM_WRITE check for the swap-out buffer. If the swap-out buffer VMA
contains VM_WRITE, the PTE should be marked as writable.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

f087df84

userswap: add VMA check for uswap registration · bdc54503

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Add VMA check for uswap registration to make sure that swap-in VA is of
the same type as swap-out VA.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

bdc54503

userswap: add handling of ZERO_PAGE · 239fdf8b

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

We call follow_page() with FOLL_DUMP to handle ZERO_PAGE. Although
FOLL_DUMP is intended for get_dump_page(), it just so happens that its
special treatment of the ZERO_PAGE (returning an error instead of doing
get_page) suits uswap very well. If somehow an abnormal page has sneaked
into the range, we won't oops here.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

239fdf8b

userswap: add page_count() check for swap-out VA · 92751dfb

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Add page_count() check for swap-out VA to make sure that no other kernel
mechanism is using this physical page.
Add lru_add_drain_all() before swap-out VA to correct page_count() of
swap-out pages.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

92751dfb

userswap: add VMA check for swap-in and swap-out buffer · 4ca4ac49

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Add VMA check for swap-in and swap-out buffer to make sure that swap-in
and swap-out buffer is of the same type as swap-out VA.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

4ca4ac49

userswap: fix BUG_ON in __mcopy_atomic() · b9d83c98

由 ZhangPeng 提交于 5月 30, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

When the swap-in buffer contains no physical pages, the errno in
mfill_atomic_pte_nocopy() will be ENOENT. A BUG_ON error will occur
because the userswap feature does not use the struct page *page and page
is set to NULL.
To fix this issue, the errno should be changed from ENOENT to EINVAL.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

b9d83c98

19 5月, 2023 6 次提交

memcg: support ksm merge any mode per cgroup · 0f6fb357

由 Nanyong Sun 提交于 5月 19, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

----------------------------------------------------------------------

Add control file "memory.ksm" to enable ksm per cgroup.
Echo to 1 will set all tasks currently in the cgroup to ksm merge
any mode, which means ksm gets enabled for all vma's of a process.
Meanwhile echo to 0 will disable ksm for them and unmerge the
merged pages.
Cat the file will show the above state and ksm related profits
of this cgroup.
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

0f6fb357

mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0 · 351ceedb

由 David Hildenbrand 提交于 5月 19, 2023

mainline inclusion
from mainline-v6.4-rc1
commit 24139c07
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=24139c07f413ef4b555482c758343d71392a19bc

----------------------------------------------------------------------

Patch series "mm/ksm: improve PR_SET_MEMORY_MERGE=0 handling and cleanup
disabling KSM", v2.

(1) Make PR_SET_MEMORY_MERGE=0 unmerge pages like setting MADV_UNMERGEABLE
does, (2) add a selftest for it and (3) factor out disabling of KSM from
s390/gmap code.

This patch (of 3):

Let's unmerge any KSM pages when setting PR_SET_MEMORY_MERGE=0, and clear
the VM_MERGEABLE flag from all VMAs -- just like KSM would.  Of course,
only do that if we previously set PR_SET_MEMORY_MERGE=1.

Link: https://lkml.kernel.org/r/20230422205420.30372-1-david@redhat.com
Link: https://lkml.kernel.org/r/20230422205420.30372-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NStefan Roesch <shr@devkernel.io>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	mm/ksm.c
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

351ceedb

mm: add new KSM process and sysfs knobs · a098d41e

由 Stefan Roesch 提交于 5月 19, 2023

mainline inclusion
from mainline-v6.4-rc1
commit d21077fb
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d21077fbc2fc987c2e593c34dc3b4d84e546dc9f

----------------------------------------------------------------------

This adds the general_profit KSM sysfs knob and the process profit metric
knobs to ksm_stat.

1) expose general_profit metric

   The documentation mentions a general profit metric, however this
   metric is not calculated.  In addition the formula depends on the size
   of internal structures, which makes it more difficult for an
   administrator to make the calculation.  Adding the metric for a better
   user experience.

2) document general_profit sysfs knob

3) calculate ksm process profit metric

   The ksm documentation mentions the process profit metric and how to
   calculate it.  This adds the calculation of the metric.

4) mm: expose ksm process profit metric in ksm_stat

   This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
   The documentation mentions the formula for the ksm process profit
   metric, however it does not calculate it.  In addition the formula
   depends on the size of internal structures.  So it makes sense to
   expose it.

5) document new procfs ksm knobs

Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io>
Reviewed-by: NBagas Sanjaya <bagasdotme@gmail.com>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

a098d41e

mm: add new api to enable ksm per process · 2cd2cdfe

由 Stefan Roesch 提交于 5月 19, 2023

mainline inclusion
from mainline-v6.4-rc1
commit d7597f59
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d7597f59d1d33e9efbffa7060deb9ee5bd119e62

----------------------------------------------------------------------

Patch series "mm: process/cgroup ksm support", v9.

So far KSM can only be enabled by calling madvise for memory regions.  To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.

Use case 1:
  The madvise call is not available in the programming language.  An
  example for this are programs with forked workloads using a garbage
  collected language without pointers.  In such a language madvise cannot
  be made available.

  In addition the addresses of objects get moved around as they are
  garbage collected.  KSM sharing needs to be enabled "from the outside"
  for these type of workloads.

Use case 2:
  The same interpreter can also be used for workloads where KSM brings
  no benefit or even has overhead.  We'd like to be able to enable KSM on
  a workload by workload basis.

Use case 3:
  With the madvise call sharing opportunities are only enabled for the
  current process: it is a workload-local decision.  A considerable number
  of sharing opportunities may exist across multiple workloads or jobs (if
  they are part of the same security domain).  Only a higler level entity
  like a job scheduler or container can know for certain if its running
  one or more instances of a job.  That job scheduler however doesn't have
  the necessary internal workload knowledge to make targeted madvise
  calls.

Security concerns:

  In previous discussions security concerns have been brought up.  The
  problem is that an individual workload does not have the knowledge about
  what else is running on a machine.  Therefore it has to be very
  conservative in what memory areas can be shared or not.  However, if the
  system is dedicated to running multiple jobs within the same security
  domain, its the job scheduler that has the knowledge that sharing can be
  safely enabled and is even desirable.

Performance:

  Experiments with using UKSM have shown a capacity increase of around 20%.

  Here are the metrics from an instagram workload (taken from a machine
  with 64GB main memory):

   full_scans: 445
   general_profit: 20158298048
   max_page_sharing: 256
   merge_across_nodes: 1
   pages_shared: 129547
   pages_sharing: 5119146
   pages_to_scan: 4000
   pages_unshared: 1760924
   pages_volatile: 10761341
   run: 1
   sleep_millisecs: 20
   stable_node_chains: 167
   stable_node_chains_prune_millisecs: 2000
   stable_node_dups: 2751
   use_zero_pages: 0
   zero_pages_sharing: 0

After the service is running for 30 minutes to an hour, 4 to 5 million
shared pages are common for this workload when using KSM.

Detailed changes:

1. New options for prctl system command
   This patch series adds two new options to the prctl system call.
   The first one allows to enable KSM at the process level and the second
   one to query the setting.

The setting will be inherited by child processes.

With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing
   When KSM is enabled at the process level, the KSM code will iterate
   over all the VMA's and enable KSM for the eligible VMA's.

   When forking a process that has KSM enabled, the setting will be
   inherited by the new child process.

3. Add general_profit metric
   The general_profit metric of KSM is specified in the documentation,
   but not calculated.  This adds the general profit metric to
   /sys/kernel/debug/mm/ksm.

4. Add more metrics to ksm_stat
   This adds the process profit metric to /proc/<pid>/ksm_stat.

5. Add more tests to ksm_tests and ksm_functional_tests
   This adds an option to specify the merge type to the ksm_tests.
   This allows to test madvise and prctl KSM.

   It also adds a two new tests to ksm_functional_tests: one to test
   the new prctl options and the other one is a fork test to verify that
   the KSM process setting is inherited by client processes.

This patch (of 3):

So far KSM can only be enabled by calling madvise for memory regions.  To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.

1. New options for prctl system command

   This patch series adds two new options to the prctl system call.
   The first one allows to enable KSM at the process level and the second
   one to query the setting.

   The setting will be inherited by child processes.

   With the above setting, KSM can be enabled for the seed process of a
   cgroup and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing

   When KSM is enabled at the process level, the KSM code will iterate
   over all the VMA's and enable KSM for the eligible VMA's.

   When forking a process that has KSM enabled, the setting will be
   inherited by the new child process.

  1) Introduce new MMF_VM_MERGE_ANY flag

     This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
     is set, kernel samepage merging (ksm) gets enabled for all vma's of a
     process.

  2) Setting VM_MERGEABLE on VMA creation

     When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
     VM_MERGEABLE flag will be set for this VMA.

  3) support disabling of ksm for a process

     This adds the ability to disable ksm for a process if ksm has been
     enabled for the process with prctl.

  4) add new prctl option to get and set ksm for a process

     This adds two new options to the prctl system call
     - enable ksm for all vmas of a process (if the vmas support it).
     - query if ksm has been enabled for a process.

3. Disabling MMF_VM_MERGE_ANY for storage keys in s390

   In the s390 architecture when storage keys are used, the
   MMF_VM_MERGE_ANY will be disabled.

Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	kernel/sys.c mm/ksm.c mm/mmap.c
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

2cd2cdfe

ksm: count allocated ksm rmap_items for each process · 8c3ecf85

由 xu xin 提交于 5月 19, 2023

mainline inclusion
from mainline-v6.1-rc1
commit cb4df4ca
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cb4df4cae4f2bd8cf7a32eff81178fce31600f7c

----------------------------------------------------------------------

Patch series "ksm: count allocated rmap_items and update documentation",
v5.

KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information.

To determine how beneficial the ksm-policy (like madvise), they are using
brings, so we add a new interface /proc/<pid>/ksm_stat for each process
The value "ksm_rmap_items" in it indicates the total allocated ksm
rmap_items of this process.

The detailed description can be seen in the following patches' commit
message.

This patch (of 2):

KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information.  Some of these pages may be merged,
but some may not be abled to be merged after being checked several times,
which are unprofitable memory consumed.

The information about whether KSM save memory or consume memory in
system-wide range can be determined by the comprehensive calculation of
pages_sharing, pages_shared, pages_unshared and pages_volatile.  A simple
approximate calculation:

	profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
	         sizeof(rmap_item);

where all_rmap_items equals to the sum of pages_sharing, pages_shared,
pages_unshared and pages_volatile.

But we cannot calculate this kind of ksm profit inner single-process wide
because the information of ksm rmap_item's number of a process is lacked.
For user applications, if this kind of information could be obtained, it
helps upper users know how beneficial the ksm-policy (like madvise) they
are using brings, and then optimize their app code.  For example, one
application madvise 1000 pages as MERGEABLE, while only a few pages are
really merged, then it's not cost-efficient.

So we add a new interface /proc/<pid>/ksm_stat for each process in which
the value of ksm_rmap_itmes is only shown now and so more values can be
added in future.

So similarly, we can calculate the ksm profit approximately for a single
process by:

	profit =~ ksm_merging_pages * sizeof(page) - ksm_rmap_items *
		 sizeof(rmap_item);

where ksm_merging_pages is shown at /proc/<pid>/ksm_merging_pages, and
ksm_rmap_items is shown in /proc/<pid>/ksm_stat.

Link: https://lkml.kernel.org/r/20220830143731.299702-1-xu.xin16@zte.com.cn
Link: https://lkml.kernel.org/r/20220830143838.299758-1-xu.xin16@zte.com.cnSigned-off-by: Nxu xin <xu.xin16@zte.com.cn>
Reviewed-by: NXiaokai Ran <ran.xiaokai@zte.com.cn>
Reviewed-by: NYang Yang <yang.yang29@zte.com.cn>
Signed-off-by: NCGEL ZTE <cgel.zte@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	include/linux/mm_types.h
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

8c3ecf85

ksm: count ksm merging pages for each process · 44acbc78

由 xu xin 提交于 5月 19, 2023

mainline inclusion
from mainline-v5.19-rc1
commit 76093853
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7609385337a4feb6236e42dcd0df2185683ce839

----------------------------------------------------------------------

Some applications or containers want to use KSM by calling madvise() to
advise areas of address space to be MERGEABLE.  But they may not know
which applications are more likely to cause real merges in the
deployment.  If this patch is applied, it helps them know their
corresponding number of merged pages, and then optimize their app code.

As current KSM only counts the number of KSM merging pages(e.g.
ksm_pages_sharing and ksm_pages_shared) of the whole system, we cannot see
the more fine-grained KSM merging, for the upper application optimization,
the merging area cannot be set easily according to the KSM page merging
probability of each process.  Therefore, it is necessary to add extra
statistical means so that the upper level users can know the detailed KSM
merging information of each process.

We add a new proc file named as ksm_merging_pages under /proc/<pid>/ to
indicate the involved ksm merging pages of this process.

[akpm@linux-foundation.org: fix comment typo, remove BUG_ON()s]
Link: https://lkml.kernel.org/r/20220325082318.2352853-1-xu.xin16@zte.com.cnSigned-off-by: Nxu xin <xu.xin16@zte.com.cn>
Reported-by: Nkernel test robot <lkp@intel.com>
Reviewed-by: NYang Yang <yang.yang29@zte.com.cn>
Reviewed-by: NRan Xiaokai <ran.xiaokai@zte.com.cn>
Reported-by: NZeal Robot <zealci@zte.com.cn>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Ohhoon Kwon <ohoono.kwon@samsung.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	include/linux/mm_types.h
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

44acbc78

18 5月, 2023 7 次提交

userswap: add user mode check for swap-out VA · d042e603

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Add user mode check for swap-out VA to make sure that swap-out VA is
user mode address.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

d042e603

userswap: check read and write permissions for swap-out pages · 790b46fa

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Check the VM_READ and VM_WRITE flags of vma->vm_flags to determine
whether the read and write permissions of the swap-out page VA are
consistent with those of the swap-out buffer VA. If they are
inconsistent, the swap operation will fail.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

790b46fa

userswap: add VMA checks for register address · 62d6e76a

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Add VMA checks for register address to make sure that
register address has the corresponding VMA.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

62d6e76a

userswap: add checks for input addresses · 03714218

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Add checks for new_addr in uswap_mremap() and src_addr in
uswap_check_copy_mode(), including user mode checks, overlapping
checks, etc.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

03714218

userswap: fix some type and logical bugs · 74c0e7cd

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

As follows, fix some type and logical bugs.
1) The type of index variable is changed from int to unsigned long to
support large memory registration.
2) Fix the bug that USWAP_PAGES_DIRTY does not take effect.
3) Take the mmap_read_lock() when using the VMA in
uswap_adjust_uffd_range().
4) Do some code refactoring and cleancode.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

74c0e7cd

userswap: split uswap_register() to validate address ranges · 8c509665

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Split uswap_register() into uswap_register() and uswap_adjust_uffd_range().
Before validate_range(), use uswap_register() to handle uswap mode.
After validate_range(), use uswap_adjust_uffd_range() to change address
range to VMA range, which could reduce fragmentation caused by VMA
splitting.
By splitting uswap_register(), we could prevent the userswap registration
of invalid input address ranges.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

8c509665

userswap: fix NULL pointer dereference in uswap_unmap_anon_page() · 2e04865a

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

If old_pte is NULL, *old_pte will result in a null pointer dereference.
Fix this by adding a NULL check for old_pte.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

2e04865a

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功