- 27 7月, 2023 4 次提交
-
-
由 Xu Qiang 提交于
hulk inclusion category: other bugzilla: https://gitee.com/openeuler/kernel/issues/I6GI0X ---------------------------------------------- In sp_update_process_stat, traverse node_list and lock protection is required. Signed-off-by: NXu Qiang <xuqiang36@huawei.com>
-
由 Xu Qiang 提交于
hulk inclusion category: other bugzilla: https://gitee.com/openeuler/kernel/issues/I6GI0X ---------------------------------------------- SPG_FLAG_NON_DVPP is no longer used in downstream systems. Signed-off-by: NXu Qiang <xuqiang36@huawei.com>
-
由 Xu Qiang 提交于
hulk inclusion category: other bugzilla: https://gitee.com/openeuler/kernel/issues/I6GI0X ---------------------------------------------- Member of sp_spa_stat is changed to atomic64_t to solve the concurrency problem.Spa_stat no longer needs sp_area_lock protection. Signed-off-by: NXu Qiang <xuqiang36@huawei.com>
-
由 Chen Jun 提交于
hulk inclusion category: feature bugzilla: N/A -------------------------------- Support alloc memory from nodes. mg_sp_alloc allow to alloc memory from one node. If the node have no enough memory, the caller would pick a next node. But that has a lot of overhead. To improve performance, we support a new interface to alloc memory from nodes. Signed-off-by: NChen Jun <chenjun102@huawei.com>
-
- 25 7月, 2023 1 次提交
-
-
由 Johannes Weiner 提交于
stable inclusion from stable-v5.10.157 commit d925dd3e444cb7f0fab0208fed82673fd61f9765 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7MU59 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=d925dd3e444cb7f0fab0208fed82673fd61f9765 -------------------------------- commit f53af428 upstream. During proactive reclaim, we sometimes observe severe overreclaim, with several thousand times more pages reclaimed than requested. This trace was obtained from shrink_lruvec() during such an instance: prio:0 anon_cost:1141521 file_cost:7767 nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190) nr=[7161123 345 578 1111] While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it by swapping. These requests take over a minute, during which the write() to memory.reclaim is unkillably stuck inside the kernel. Digging into the source, this is caused by the proportional reclaim bailout logic. This code tries to resolve a fundamental conflict: to reclaim roughly what was requested, while also aging all LRUs fairly and in accordance to their size, swappiness, refault rates etc. The way it attempts fairness is that once the reclaim goal has been reached, it stops scanning the LRUs with the smaller remaining scan targets, and adjusts the remainder of the bigger LRUs according to how much of the smaller LRUs was scanned. It then finishes scanning that remainder regardless of the reclaim goal. This works fine if priority levels are low and the LRU lists are comparable in size. However, in this instance, the cgroup that is targeted by proactive reclaim has almost no files left - they've already been squeezed out by proactive reclaim earlier - and the remaining anon pages are hot. Anon rotations cause the priority level to drop to 0, which results in reclaim targeting all of anon (a lot) and all of file (almost nothing). By the time reclaim decides to bail, it has scanned most or all of the file target, and therefor must also scan most or all of the enormous anon target. This target is thousands of times larger than the reclaim goal, thus causing the overreclaim. The bailout code hasn't changed in years, why is this failing now? The most likely explanations are two other recent changes in anon reclaim: 1. Before the series starting with commit 5df74196 ("mm: fix LRU balancing effect of new transparent huge pages"), the VM was overall relatively reluctant to swap at all, even if swap was configured. This means the LRU balancing code didn't come into play as often as it does now, and mostly in high pressure situations where pronounced swap activity wouldn't be as surprising. 2. For historic reasons, shrink_lruvec() loops on the scan targets of all LRU lists except the active anon one, meaning it would bail if the only remaining pages to scan were active anon - even if there were a lot of them. Before the series starting with commit ccc5dc67 ("mm/vmscan: make active/inactive ratio as 1:1 for anon lru"), most anon pages would live on the active LRU; the inactive one would contain only a handful of preselected reclaim candidates. After the series, anon gets aged similarly to file, and the inactive list is the default for new anon pages as well, making it often the much bigger list. As a result, the VM is now more likely to actually finish large anon targets than before. Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the larger LRU lists is made before bailing out on a met reclaim goal. This fixes the extreme overreclaim problem. Fairness is more subtle and harder to evaluate. No obvious misbehavior was observed on the test workload, in any case. Conceptually, fairness should primarily be a cumulative effect from regular, lower priority scans. Once the VM is in trouble and needs to escalate scan targets to make forward progress, fairness needs to take a backseat. This is also acknowledged by the myriad exceptions in get_scan_count(). This patch makes fairness decrease gradually, as it keeps fairness work static over increasing priority levels with growing scan targets. This should make more sense - although we may have to re-visit the exact values. Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRik van Riel <riel@surriel.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Nsanglipeng <sanglipeng1@jd.com>
-
- 20 7月, 2023 2 次提交
-
-
由 Alexander Potapenko 提交于
stable inclusion from stable-v5.10.156 commit 294ef12dccc6de01de3322b21a0c235474952b63 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7MCG1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=294ef12dccc6de01de3322b21a0c235474952b63 -------------------------------- commit 1468c6f4 upstream. Functions implementing the a_ops->write_end() interface accept the `void *fsdata` parameter that is supposed to be initialized by the corresponding a_ops->write_begin() (which accepts `void **fsdata`). However not all a_ops->write_begin() implementations initialize `fsdata` unconditionally, so it may get passed uninitialized to a_ops->write_end(), resulting in undefined behavior. Fix this by initializing fsdata with NULL before the call to write_begin(), rather than doing so in all possible a_ops implementations. This patch covers only the following cases found by running x86 KMSAN under syzkaller: - generic_perform_write() - cont_expand_zero() and generic_cont_expand_simple() - page_symlink() Other cases of passing uninitialized fsdata may persist in the codebase. Link: https://lkml.kernel.org/r/20220915150417.722975-43-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Nsanglipeng <sanglipeng1@jd.com>
-
由 Alban Crequy 提交于
stable inclusion from stable-v5.10.156 commit db744288af730abb66312f40b087d1dbf794c5f4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7MCG1 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=db744288af730abb66312f40b087d1dbf794c5f4 -------------------------------- commit 8678ea06 upstream. If a page fault occurs while copying the first byte, this function resets one byte before dst. As a consequence, an address could be modified and leaded to kernel crashes if case the modified address was accessed later. Fixes: b58294ea ("maccess: allow architectures to provide kernel probing directly") Signed-off-by: NAlban Crequy <albancrequy@linux.microsoft.com> Signed-off-by: NAndrii Nakryiko <andrii@kernel.org> Tested-by: NFrancis Laniel <flaniel@linux.microsoft.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> [5.8] Link: https://lore.kernel.org/bpf/20221110085614.111213-2-albancrequy@linux.microsoft.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Nsanglipeng <sanglipeng1@jd.com>
-
- 18 7月, 2023 1 次提交
-
-
由 Pankaj Gupta 提交于
stable inclusion from stable-v5.10.155 commit 0b692d41ee5c88097ecf5dbb37c59083044c996a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7M5F4 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0b692d41ee5c88097ecf5dbb37c59083044c996a -------------------------------- commit 867400af upstream. virtio_pmem use devm_memremap_pages() to map the device memory. By default this memory is mapped as encrypted with SEV. Guest reboot changes the current encryption key and guest no longer properly decrypts the FSDAX device meta data. Mark the corresponding device memory region for FSDAX devices (mapped with memremap_pages) as decrypted to retain the persistent memory property. Link: https://lkml.kernel.org/r/20221102160728.3184016-1-pankaj.gupta@amd.com Fixes: b7b3c01b ("mm/memremap_pages: support multiple ranges per invocation") Signed-off-by: NPankaj Gupta <pankaj.gupta@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Nsanglipeng <sanglipeng1@jd.com>
-
- 10 7月, 2023 1 次提交
-
-
由 liubo 提交于
euleros inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7JI6K CVE: NA ---------------------------------------------------- In the swapcache recycling process, the number of pages to be reclaimed on each node is obtained as follows: nr_to_reclaim[nid_num] = (swapcache_to_reclaim / (swapcache_total_reclaimable / nr[nid_num])); However, nr[nid_num] is obtained by traversing the number of swapcache pages on each node. If there are multiple nodes in the environment and no swap process occurs on a node, no swapcache page exists. The value of nr[nid_num] may be 0. Therefore, division by zero errors may occur. Signed-off-by: Nliubo <liubo254@huawei.com>
-
- 26 6月, 2023 3 次提交
-
-
由 Joao Martins 提交于
mainline inclusion from mainline-v6.2-rc1 commit 11aad263 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6SROX CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11aad2631bf74b3c811dee76154702aab855a323 -------------------------------- Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed back to page allocator is as following: for a 2M hugetlb page it will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially, that means that it breaks the first 4K of a potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For this reason the memory that it's free back to page allocator cannot be used for hugetlb to allocate huge pages of the same size, but rather only of a smaller huge page size: Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node having 64G): * Before allocation: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Movable 340 100 32 15 1 2 0 0 0 1 15558 $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 31987 * After: Node 0, zone Normal, type Movable 30893 32006 31515 7 0 0 0 0 0 0 0 Notice how the memory freed back are put back into 4K / 8K / 16K page pools. And it allocates a total of 31987 pages (63974M). To fix this behaviour rather than remapping second vmemmap page (thus breaking the contiguous block of memory backing the struct pages) repopulate the first vmemmap page with a new one. We allocate and copy from the currently mapped vmemmap page, and then remap it later on. The same algorithm works if there's a pre initialized walk::reuse_page and the head page doesn't need to be skipped and instead we remap it when the @addr being changed is the @reuse_addr. The new head page is allocated in vmemmap_remap_free() given that on restore there's no need for functional change. Note that, because right now one hugepage is remapped at a time, thus only one free 4K page at a time is needed to remap the head page. Should it fail to allocate said new page, it reuses the one that's already mapped just like before. As a result, for every 64G of contiguous hugepages it can give back 1G more of contiguous memory per 64G, while needing in total 128M new 4K pages (for 2M hugetlb) or 256k (for 1G hugetlb). After the changes, try to assign a 64G node to hugetlb (on a 128G 2node guest, each node with 64G): * Before allocation Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Movable 1 1 1 0 0 1 0 0 1 1 15564 $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 32394 * After: Node 0, zone Normal, type Movable 0 50 97 108 96 81 70 46 18 0 0 In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out of the 32394 (64788M) allocated. So the memory freed back is indeed being used back in hugetlb and there's no massive order-0..order-2 pages accumulated unused. [joao.m.martins@oracle.com: v3] Link: https://lkml.kernel.org/r/20221109200623.96867-1-joao.m.martins@oracle.com [joao.m.martins@oracle.com: add smp_wmb() to ensure page contents are visible prior to PTE write] Link: https://lkml.kernel.org/r/20221110121214.6297-1-joao.m.martins@oracle.com Link: https://lkml.kernel.org/r/20221107153922.77094-1-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: mm/hugetlb_vmemmap.c Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Peng Liu 提交于
mainline inclusion from mainline-v5.19-rc1 commit f87442f4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6OWV4 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f87442f407af80dac4dc81c8a7772b71b36b2e09 -------------------------------- Hugepages can be specified to pernode since "hugetlbfs: extend the definition of hugepages parameter to support node allocation", but the following problem is observed. Confusing behavior is observed when both 1G and 2M hugepage is set after "numa=off". cmdline hugepage settings: hugepagesz=1G hugepages=0:3,1:3 hugepagesz=2M hugepages=0:1024,1:1024 results: HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages HugeTLB registered 2.00 MiB page size, pre-allocated 1024 pages Furthermore, confusing behavior can be also observed when an invalid node behind a valid node. To fix this, never allocate any typical hugepage when an invalid parameter is received. Link: https://lkml.kernel.org/r/20220413032915.251254-3-liupeng256@huawei.com Fixes: b5389086 ("hugetlbfs: extend the definition of hugepages parameter to support node allocation") Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Yuntao <liuyuntao10@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Peng Liu 提交于
mainline inclusion from mainline-v5.19-rc1 commit 0a7a0f6f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6OWV4 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a7a0f6f7f3679c906fc55e3805c1d5e2c566f55 -------------------------------- Patch series "hugetlb: Fix some incorrect behavior", v3. This series fix three bugs of hugetlb: 1) Invalid use of nr_online_nodes; 2) Inconsistency between 1G hugepage and 2M hugepage; 3) Useless information in dmesg. This patch (of 4): Certain systems are designed to have sparse/discontiguous nodes. In this case, nr_online_nodes can not be used to walk through numa node. Also, a valid node may be greater than nr_online_nodes. However, in hugetlb, it is assumed that nodes are contiguous. For sparse/discontiguous nodes, the current code may treat a valid node as invalid, and will fail to allocate all hugepages on a valid node that "nid >= nr_online_nodes". As David suggested: if (tmp >= nr_online_nodes) goto invalid; Just imagine node 0 and node 2 are online, and node 1 is offline. Assuming that "node < 2" is valid is wrong. Recheck all the places that use nr_online_nodes, and repair them one by one. [liupeng256@huawei.com: v4] Link: https://lkml.kernel.org/r/20220416103526.3287348-1-liupeng256@huawei.com Link: https://lkml.kernel.org/r/20220413032915.251254-1-liupeng256@huawei.com Link: https://lkml.kernel.org/r/20220413032915.251254-2-liupeng256@huawei.com Fixes: 4178158e ("hugetlbfs: fix issue of preallocation of gigantic pages can't work") Fixes: b5389086 ("hugetlbfs: extend the definition of hugepages parameter to support node allocation") Fixes: e79ce983 ("hugetlbfs: fix a truncation issue in hugepages parameter") Fixes: f9317f77 ("hugetlb: clean up potential spectre issue warnings") Signed-off-by: NPeng Liu <liupeng256@huawei.com> Suggested-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: NDavidlohr Bueso <dave@stgolabs.net> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NDavid Hildenbrand <david@redhat.com> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Liu Yuntao <liuyuntao10@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: mm/hugetlb.c Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
- 25 6月, 2023 8 次提交
-
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA -------------------------------- Since we support per-memcg swapfile control, we need per-type slot cache to optimize performance. To reduce memory waste, allocate per-type slot cache when enable feature or online the corresponding swap device. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA -------------------------------- With memory.swapfile interface, the avail swap device can be limit for memcg. The acceptable parameters are 'all', 'none' and valid swap device. Usage: echo /dev/zram0 > memory.swapfile If the swap device is offline, the swapfile will be fallback to 'none'. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA -------------------------------- The memsw can't limit the usage of swap space. Add memory.swap.max interface to limit the difference value of memsw.usage and memory.usage. Since a page may occupy both swap entry and a swap cache page, this value is not exactly equal to swap.usage. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA -------------------------------- Add a new per-memcg swapin interface to load data into memory in advance to improve access efficiency. Usage: # echo 0 > memory.force_swapin Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA -------------------------------- Introduce memcg swap qos including subsequent sub-features. Add CONFIG_MEMCG_SWAP_QOS and static key memcg_swap_qos_key. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA -------------------------------- Add anon/file to memory.reclaim interface to limit only reclaim one type pages. The lru algorithm can reclaim cold pages and balance between file and anon. But it didn't consider the speed of backend device. For example, if there is zram device, reclaim anon pages might has less impact on performance. So extend memory.reclaim interface to reclaim one type pages. Usage: "echo <size> type=anon > memory.reclaim" "echo <size> type=file > memory.reclaim" Also compatible with the previous format. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Yosry Ahmed 提交于
mainline inclusion from mainline-v6.0-rc1 commit 73b73bac category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=73b73bac90d97400e29e585c678c4d0ebfd2680d -------------------------------- memory.reclaim is a cgroup v2 interface that allows users to proactively reclaim memory from a memcg, without real memory pressure. Reclaim operations invoke vmpressure, which is used: (a) To notify userspace of reclaim efficiency in cgroup v1, and (b) As a signal for a memcg being under memory pressure for networking (see mem_cgroup_under_socket_pressure()). For (a), vmpressure notifications in v1 are not affected by this change since memory.reclaim is a v2 feature. For (b), the effects of the vmpressure signal (according to Shakeel [1]) are as follows: 1. Reducing send and receive buffers of the current socket. 2. May drop packets on the rx path. 3. May throttle current thread on the tx path. Since proactive reclaim is invoked directly by userspace, not by memory pressure, it makes sense not to throttle networking. Hence, this change makes sure that proactive reclaim caused by memory.reclaim does not trigger vmpressure. [1] https://lore.kernel.org/lkml/CALvZod68WdrXEmBpOkadhB5GPYmCXaDZzXH=yyGOCAjFRn4NDQ@mail.gmail.com/ [yosryahmed@google.com: update documentation] Link: https://lkml.kernel.org/r/20220721173015.2643248-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20220714064918.2576464-1-yosryahmed@google.comSigned-off-by: NYosry Ahmed <yosryahmed@google.com> Acked-by: NShakeel Butt <shakeelb@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: NeilBrown <neilb@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 David Hildenbrand 提交于
mainline inclusion from mainline-v5.11-rc1 commit 8dc4bb58 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7F3HQ CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8dc4bb58a146655eb057247d7c9d19e73928715b -------------------------------- virtio-mem soon wants to use offline_and_remove_memory() memory that exceeds a single Linux memory block (memory_block_size_bytes()). Let's remove that restriction. Let's remember the old state and try to restore that if anything goes wrong. While re-onlining can, in general, fail, it's highly unlikely to happen (usually only when a notifier fails to allocate memory, and these are rather rare). This will be used by virtio-mem to offline+remove memory ranges that are bigger than a single memory block - for example, with a device block size of 1 GiB (e.g., gigantic pages in the hypervisor) and a Linux memory block size of 128MB. While we could compress the state into 2 bit, using 8 bit is much easier. This handling is similar, but different to acpi_scan_try_to_offline(): a) We don't try to offline twice. I am not sure if this CONFIG_MEMCG optimization is still relevant - it should only apply to ZONE_NORMAL (where we have no guarantees). If relevant, we can always add it. b) acpi_scan_try_to_offline() simply onlines all memory in case something goes wrong. It doesn't restore previous online type. Let's do that, so we won't overwrite what e.g., user space configured. Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NDavid Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20201112133815.13332-28-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
-
- 21 6月, 2023 1 次提交
-
-
由 Ma Wupeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I77BDW CVE: NA -------------------------------- During copy_present_pte, rss counter is increased but the corresponding reliable page counter is not updated. This will lead to reliable page counter mismatch. Fix this by adding reliable page counter. Fixes: d81e9624 ("proc: Count reliable memory usage of reliable tasks") Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: NNanyong Sun <sunnanyong@huawei.com>
-
- 16 6月, 2023 1 次提交
-
-
由 Kang Chen 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NYW4 CVE: NA -------------------------------- raw call flow: oom_kill_process -> mem_cgroup_scan_tasks(.., .., message) -> memcg_print_bad_task(message, ..) message is "const char*" type, and incorrectly cast to "oom_control*" type in memcg_print_bad_task. Fix it by moving memcg_print_bad_task out of mem_cgroup_scan_tasks and call it in select_bad_process and dump_tasks. Furthermore, use struct oom_control* directly and remove the useless parm `ret`. Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NKang Chen <void0red@hust.edu.cn> (cherry picked from commit 789038c7)
-
- 13 6月, 2023 1 次提交
-
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- If the swapped-out memory is large, such as tens of gigabytes, we will allocate a large management structure, which may be tens of megabytes or hundreds of megabytes. So if we use kmalloc to allocate management structures it may fail. Fix this by changing kmalloc to kvzalloc and kfree to kvfree. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
- 07 6月, 2023 6 次提交
-
-
由 Liu Shixin 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE CVE: NA -------------------------------- The type of pfn is int, which can result in truncation. Change its type to unsigned long to fix the problem. Fixes: eef7b4fd ("mm/dynamic_hugetlb: use pfn to traverse subpages") Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE CVE: NA -------------------------------- Before discard the bad page, set PagePool flag to distinguish from free page. And increase used_pages to guarantee used + freed = total. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6MH03 CVE: NA -------------------------------- When memory is fragmented, update_reserve_pages() may call migrate_pages() to collect continuous memory. This function can sleep, so we should use mutex lock instead of spin lock. Use KABI_EXTEND to fix kabi broken. Fixes: 0c06a1c0 ("mm/dynamic_hugetlb: add interface to configure the count of hugepages") Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE CVE: NA -------------------------------- The memory hotplug and memory failure will dissolve freed hugepages to buddy system, this is not the expected behavior for dynamic hugetlb. Skip the dissolve operation for hugepages belonging to dynamic hugetlb. For memory hotplug, the hotplug operation is not allowed, if dhugetlb pool existed. For memory failure, the hugepage will be discard directly. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE CVE: NA -------------------------------- To support dynamic hugetlb on arm64, we need to do two more things. The first one is to fix kabi broken in mem_cgroup, we use kabi_reserve_5 to fix it in previous patch. The second one is to check cont-bit hugetlb since this feature only support for PMD-size and PUD-size hugepage. This feature only support for 4KB pagesize, not support for 16KB and 64KB. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Shakeel Butt 提交于
mainline inclusion from mainline-v6.1-rc1 commit cfdab60b category: perf bugzilla: https://gitee.com/openeuler/kernel/issues/I7BHGR CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cfdab60bfa66b2dc0391c9e405b8af6039924cd4 -------------------------------- Patch series "memcg: optimize charge codepath", v2. Recently Linux networking stack has moved from a very old per socket pre-charge caching to per-cpu caching to avoid pre-charge fragmentation and unwarranted OOMs. One impact of this change is that for network traffic workloads, memcg charging codepath can become a bottleneck. The kernel test robot has also reported this regression[1]. This patch series tries to improve the memcg charging for such workloads. This patch series implement three optimizations: (A) Reduce atomic ops in page counter update path. (B) Change layout of struct page_counter to eliminate false sharing between usage and high. (C) Increase the memcg charge batch to 64. To evaluate the impact of these optimizations, on a 72 CPUs machine, we ran the following workload in root memcg and then compared with scenario where the workload is run in a three level of cgroup hierarchy with top level having min and low setup appropriately. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): 1. root memcg 21694.8 Mbps 2. 6.0-rc1 10482.7 Mbps (-51.6%) 3. 6.0-rc1 + (A) 14542.5 Mbps (-32.9%) 4. 6.0-rc1 + (B) 12413.7 Mbps (-42.7%) 5. 6.0-rc1 + (C) 17063.7 Mbps (-21.3%) 6. 6.0-rc1 + (A+B+C) 20120.3 Mbps (-7.2%) With all three optimizations, the memcg overhead of this workload has been reduced from 51.6% to just 7.2%. [1] https://lore.kernel.org/linux-mm/20220619150456.GB34471@xsang-OptiPlex-9020/ This patch (of 3): For cgroups using low or min protections, the function propagate_protected_usage() was doing an atomic xchg() operation irrespectively. We can optimize out this atomic operation for one specific scenario where the workload is using the protection (i.e. min > 0) and the usage is above the protection (i.e. usage > min). This scenario is actually very common where the users want a part of their workload to be protected against the external reclaim. Though this optimization does introduce a race when the usage is around the protection and concurrent charges and uncharged trip it over or under the protection. In such cases, we might see lower effective protection but the subsequent charge/uncharge will correct it. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy with top level having min and low setup appropriately to see if this optimization is effective for the mentioned case. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 14542.5 Mbps (38.7% improvement) With the patch, the throughput improved by 38.7% Link: https://lkml.kernel.org/r/20220825000506.239406-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20220825000506.239406-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Reported-by: Nkernel test robot <oliver.sang@intel.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Reviewed-by: NFeng Tang <feng.tang@intel.com> Acked-by: NRoman Gushchin <roman.gushchin@linux.dev> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oliver Sang <oliver.sang@intel.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
- 30 5月, 2023 7 次提交
-
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- When the pte is none and old_pte is not NULL, _old_pte is uninitialized and will be passed out of uswap_unmap_anon_page(). To fix this, add a return value to uswap_unmap_anon_page to indicate whether the pte is none before unmapping. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- Add a VM_WRITE check for the swap-out buffer. If the swap-out buffer VMA contains VM_WRITE, the PTE should be marked as writable. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- Add VMA check for uswap registration to make sure that swap-in VA is of the same type as swap-out VA. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- We call follow_page() with FOLL_DUMP to handle ZERO_PAGE. Although FOLL_DUMP is intended for get_dump_page(), it just so happens that its special treatment of the ZERO_PAGE (returning an error instead of doing get_page) suits uswap very well. If somehow an abnormal page has sneaked into the range, we won't oops here. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- Add page_count() check for swap-out VA to make sure that no other kernel mechanism is using this physical page. Add lru_add_drain_all() before swap-out VA to correct page_count() of swap-out pages. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- Add VMA check for swap-in and swap-out buffer to make sure that swap-in and swap-out buffer is of the same type as swap-out VA. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- When the swap-in buffer contains no physical pages, the errno in mfill_atomic_pte_nocopy() will be ENOENT. A BUG_ON error will occur because the userswap feature does not use the struct page *page and page is set to NULL. To fix this issue, the errno should be changed from ENOENT to EINVAL. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
- 19 5月, 2023 4 次提交
-
-
由 Nanyong Sun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B CVE: NA ---------------------------------------------------------------------- Add control file "memory.ksm" to enable ksm per cgroup. Echo to 1 will set all tasks currently in the cgroup to ksm merge any mode, which means ksm gets enabled for all vma's of a process. Meanwhile echo to 0 will disable ksm for them and unmerge the merged pages. Cat the file will show the above state and ksm related profits of this cgroup. Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from mainline-v6.4-rc1 commit 24139c07 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=24139c07f413ef4b555482c758343d71392a19bc ---------------------------------------------------------------------- Patch series "mm/ksm: improve PR_SET_MEMORY_MERGE=0 handling and cleanup disabling KSM", v2. (1) Make PR_SET_MEMORY_MERGE=0 unmerge pages like setting MADV_UNMERGEABLE does, (2) add a selftest for it and (3) factor out disabling of KSM from s390/gmap code. This patch (of 3): Let's unmerge any KSM pages when setting PR_SET_MEMORY_MERGE=0, and clear the VM_MERGEABLE flag from all VMAs -- just like KSM would. Of course, only do that if we previously set PR_SET_MEMORY_MERGE=1. Link: https://lkml.kernel.org/r/20230422205420.30372-1-david@redhat.com Link: https://lkml.kernel.org/r/20230422205420.30372-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NStefan Roesch <shr@devkernel.io> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: mm/ksm.c Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
-
由 Stefan Roesch 提交于
mainline inclusion from mainline-v6.4-rc1 commit d21077fb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d21077fbc2fc987c2e593c34dc3b4d84e546dc9f ---------------------------------------------------------------------- This adds the general_profit KSM sysfs knob and the process profit metric knobs to ksm_stat. 1) expose general_profit metric The documentation mentions a general profit metric, however this metric is not calculated. In addition the formula depends on the size of internal structures, which makes it more difficult for an administrator to make the calculation. Adding the metric for a better user experience. 2) document general_profit sysfs knob 3) calculate ksm process profit metric The ksm documentation mentions the process profit metric and how to calculate it. This adds the calculation of the metric. 4) mm: expose ksm process profit metric in ksm_stat This exposes the ksm process profit metric in /proc/<pid>/ksm_stat. The documentation mentions the formula for the ksm process profit metric, however it does not calculate it. In addition the formula depends on the size of internal structures. So it makes sense to expose it. 5) document new procfs ksm knobs Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io> Reviewed-by: NBagas Sanjaya <bagasdotme@gmail.com> Acked-by: NDavid Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
-
由 Stefan Roesch 提交于
mainline inclusion from mainline-v6.4-rc1 commit d7597f59 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d7597f59d1d33e9efbffa7060deb9ee5bd119e62 ---------------------------------------------------------------------- Patch series "mm: process/cgroup ksm support", v9. So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. Use case 1: The madvise call is not available in the programming language. An example for this are programs with forked workloads using a garbage collected language without pointers. In such a language madvise cannot be made available. In addition the addresses of objects get moved around as they are garbage collected. KSM sharing needs to be enabled "from the outside" for these type of workloads. Use case 2: The same interpreter can also be used for workloads where KSM brings no benefit or even has overhead. We'd like to be able to enable KSM on a workload by workload basis. Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. Security concerns: In previous discussions security concerns have been brought up. The problem is that an individual workload does not have the knowledge about what else is running on a machine. Therefore it has to be very conservative in what memory areas can be shared or not. However, if the system is dedicated to running multiple jobs within the same security domain, its the job scheduler that has the knowledge that sharing can be safely enabled and is even desirable. Performance: Experiments with using UKSM have shown a capacity increase of around 20%. Here are the metrics from an instagram workload (taken from a machine with 64GB main memory): full_scans: 445 general_profit: 20158298048 max_page_sharing: 256 merge_across_nodes: 1 pages_shared: 129547 pages_sharing: 5119146 pages_to_scan: 4000 pages_unshared: 1760924 pages_volatile: 10761341 run: 1 sleep_millisecs: 20 stable_node_chains: 167 stable_node_chains_prune_millisecs: 2000 stable_node_dups: 2751 use_zero_pages: 0 zero_pages_sharing: 0 After the service is running for 30 minutes to an hour, 4 to 5 million shared pages are common for this workload when using KSM. Detailed changes: 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 3. Add general_profit metric The general_profit metric of KSM is specified in the documentation, but not calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm. 4. Add more metrics to ksm_stat This adds the process profit metric to /proc/<pid>/ksm_stat. 5. Add more tests to ksm_tests and ksm_functional_tests This adds an option to specify the merge type to the ksm_tests. This allows to test madvise and prctl KSM. It also adds a two new tests to ksm_functional_tests: one to test the new prctl options and the other one is a fork test to verify that the KSM process setting is inherited by client processes. This patch (of 3): So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 1) Introduce new MMF_VM_MERGE_ANY flag This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is set, kernel samepage merging (ksm) gets enabled for all vma's of a process. 2) Setting VM_MERGEABLE on VMA creation When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the VM_MERGEABLE flag will be set for this VMA. 3) support disabling of ksm for a process This adds the ability to disable ksm for a process if ksm has been enabled for the process with prctl. 4) add new prctl option to get and set ksm for a process This adds two new options to the prctl system call - enable ksm for all vmas of a process (if the vmas support it). - query if ksm has been enabled for a process. 3. Disabling MMF_VM_MERGE_ANY for storage keys in s390 In the s390 architecture when storage keys are used, the MMF_VM_MERGE_ANY will be disabled. Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io> Acked-by: NDavid Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: kernel/sys.c mm/ksm.c mm/mmap.c Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
-