- 26 6月, 2023 3 次提交
-
-
由 Joao Martins 提交于
mainline inclusion from mainline-v6.2-rc1 commit 11aad263 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6SROX CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11aad2631bf74b3c811dee76154702aab855a323 -------------------------------- Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed back to page allocator is as following: for a 2M hugetlb page it will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially, that means that it breaks the first 4K of a potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For this reason the memory that it's free back to page allocator cannot be used for hugetlb to allocate huge pages of the same size, but rather only of a smaller huge page size: Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node having 64G): * Before allocation: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Movable 340 100 32 15 1 2 0 0 0 1 15558 $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 31987 * After: Node 0, zone Normal, type Movable 30893 32006 31515 7 0 0 0 0 0 0 0 Notice how the memory freed back are put back into 4K / 8K / 16K page pools. And it allocates a total of 31987 pages (63974M). To fix this behaviour rather than remapping second vmemmap page (thus breaking the contiguous block of memory backing the struct pages) repopulate the first vmemmap page with a new one. We allocate and copy from the currently mapped vmemmap page, and then remap it later on. The same algorithm works if there's a pre initialized walk::reuse_page and the head page doesn't need to be skipped and instead we remap it when the @addr being changed is the @reuse_addr. The new head page is allocated in vmemmap_remap_free() given that on restore there's no need for functional change. Note that, because right now one hugepage is remapped at a time, thus only one free 4K page at a time is needed to remap the head page. Should it fail to allocate said new page, it reuses the one that's already mapped just like before. As a result, for every 64G of contiguous hugepages it can give back 1G more of contiguous memory per 64G, while needing in total 128M new 4K pages (for 2M hugetlb) or 256k (for 1G hugetlb). After the changes, try to assign a 64G node to hugetlb (on a 128G 2node guest, each node with 64G): * Before allocation Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Movable 1 1 1 0 0 1 0 0 1 1 15564 $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 32394 * After: Node 0, zone Normal, type Movable 0 50 97 108 96 81 70 46 18 0 0 In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out of the 32394 (64788M) allocated. So the memory freed back is indeed being used back in hugetlb and there's no massive order-0..order-2 pages accumulated unused. [joao.m.martins@oracle.com: v3] Link: https://lkml.kernel.org/r/20221109200623.96867-1-joao.m.martins@oracle.com [joao.m.martins@oracle.com: add smp_wmb() to ensure page contents are visible prior to PTE write] Link: https://lkml.kernel.org/r/20221110121214.6297-1-joao.m.martins@oracle.com Link: https://lkml.kernel.org/r/20221107153922.77094-1-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: mm/hugetlb_vmemmap.c Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Peng Liu 提交于
mainline inclusion from mainline-v5.19-rc1 commit f87442f4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6OWV4 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f87442f407af80dac4dc81c8a7772b71b36b2e09 -------------------------------- Hugepages can be specified to pernode since "hugetlbfs: extend the definition of hugepages parameter to support node allocation", but the following problem is observed. Confusing behavior is observed when both 1G and 2M hugepage is set after "numa=off". cmdline hugepage settings: hugepagesz=1G hugepages=0:3,1:3 hugepagesz=2M hugepages=0:1024,1:1024 results: HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages HugeTLB registered 2.00 MiB page size, pre-allocated 1024 pages Furthermore, confusing behavior can be also observed when an invalid node behind a valid node. To fix this, never allocate any typical hugepage when an invalid parameter is received. Link: https://lkml.kernel.org/r/20220413032915.251254-3-liupeng256@huawei.com Fixes: b5389086 ("hugetlbfs: extend the definition of hugepages parameter to support node allocation") Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Yuntao <liuyuntao10@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Peng Liu 提交于
mainline inclusion from mainline-v5.19-rc1 commit 0a7a0f6f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6OWV4 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a7a0f6f7f3679c906fc55e3805c1d5e2c566f55 -------------------------------- Patch series "hugetlb: Fix some incorrect behavior", v3. This series fix three bugs of hugetlb: 1) Invalid use of nr_online_nodes; 2) Inconsistency between 1G hugepage and 2M hugepage; 3) Useless information in dmesg. This patch (of 4): Certain systems are designed to have sparse/discontiguous nodes. In this case, nr_online_nodes can not be used to walk through numa node. Also, a valid node may be greater than nr_online_nodes. However, in hugetlb, it is assumed that nodes are contiguous. For sparse/discontiguous nodes, the current code may treat a valid node as invalid, and will fail to allocate all hugepages on a valid node that "nid >= nr_online_nodes". As David suggested: if (tmp >= nr_online_nodes) goto invalid; Just imagine node 0 and node 2 are online, and node 1 is offline. Assuming that "node < 2" is valid is wrong. Recheck all the places that use nr_online_nodes, and repair them one by one. [liupeng256@huawei.com: v4] Link: https://lkml.kernel.org/r/20220416103526.3287348-1-liupeng256@huawei.com Link: https://lkml.kernel.org/r/20220413032915.251254-1-liupeng256@huawei.com Link: https://lkml.kernel.org/r/20220413032915.251254-2-liupeng256@huawei.com Fixes: 4178158e ("hugetlbfs: fix issue of preallocation of gigantic pages can't work") Fixes: b5389086 ("hugetlbfs: extend the definition of hugepages parameter to support node allocation") Fixes: e79ce983 ("hugetlbfs: fix a truncation issue in hugepages parameter") Fixes: f9317f77 ("hugetlb: clean up potential spectre issue warnings") Signed-off-by: NPeng Liu <liupeng256@huawei.com> Suggested-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: NDavidlohr Bueso <dave@stgolabs.net> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NDavid Hildenbrand <david@redhat.com> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Liu Yuntao <liuyuntao10@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: mm/hugetlb.c Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
- 25 6月, 2023 8 次提交
-
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA -------------------------------- Since we support per-memcg swapfile control, we need per-type slot cache to optimize performance. To reduce memory waste, allocate per-type slot cache when enable feature or online the corresponding swap device. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA -------------------------------- With memory.swapfile interface, the avail swap device can be limit for memcg. The acceptable parameters are 'all', 'none' and valid swap device. Usage: echo /dev/zram0 > memory.swapfile If the swap device is offline, the swapfile will be fallback to 'none'. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA -------------------------------- The memsw can't limit the usage of swap space. Add memory.swap.max interface to limit the difference value of memsw.usage and memory.usage. Since a page may occupy both swap entry and a swap cache page, this value is not exactly equal to swap.usage. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA -------------------------------- Add a new per-memcg swapin interface to load data into memory in advance to improve access efficiency. Usage: # echo 0 > memory.force_swapin Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA -------------------------------- Introduce memcg swap qos including subsequent sub-features. Add CONFIG_MEMCG_SWAP_QOS and static key memcg_swap_qos_key. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA -------------------------------- Add anon/file to memory.reclaim interface to limit only reclaim one type pages. The lru algorithm can reclaim cold pages and balance between file and anon. But it didn't consider the speed of backend device. For example, if there is zram device, reclaim anon pages might has less impact on performance. So extend memory.reclaim interface to reclaim one type pages. Usage: "echo <size> type=anon > memory.reclaim" "echo <size> type=file > memory.reclaim" Also compatible with the previous format. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Yosry Ahmed 提交于
mainline inclusion from mainline-v6.0-rc1 commit 73b73bac category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGGT CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=73b73bac90d97400e29e585c678c4d0ebfd2680d -------------------------------- memory.reclaim is a cgroup v2 interface that allows users to proactively reclaim memory from a memcg, without real memory pressure. Reclaim operations invoke vmpressure, which is used: (a) To notify userspace of reclaim efficiency in cgroup v1, and (b) As a signal for a memcg being under memory pressure for networking (see mem_cgroup_under_socket_pressure()). For (a), vmpressure notifications in v1 are not affected by this change since memory.reclaim is a v2 feature. For (b), the effects of the vmpressure signal (according to Shakeel [1]) are as follows: 1. Reducing send and receive buffers of the current socket. 2. May drop packets on the rx path. 3. May throttle current thread on the tx path. Since proactive reclaim is invoked directly by userspace, not by memory pressure, it makes sense not to throttle networking. Hence, this change makes sure that proactive reclaim caused by memory.reclaim does not trigger vmpressure. [1] https://lore.kernel.org/lkml/CALvZod68WdrXEmBpOkadhB5GPYmCXaDZzXH=yyGOCAjFRn4NDQ@mail.gmail.com/ [yosryahmed@google.com: update documentation] Link: https://lkml.kernel.org/r/20220721173015.2643248-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20220714064918.2576464-1-yosryahmed@google.comSigned-off-by: NYosry Ahmed <yosryahmed@google.com> Acked-by: NShakeel Butt <shakeelb@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: NeilBrown <neilb@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 David Hildenbrand 提交于
mainline inclusion from mainline-v5.11-rc1 commit 8dc4bb58 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I7F3HQ CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8dc4bb58a146655eb057247d7c9d19e73928715b -------------------------------- virtio-mem soon wants to use offline_and_remove_memory() memory that exceeds a single Linux memory block (memory_block_size_bytes()). Let's remove that restriction. Let's remember the old state and try to restore that if anything goes wrong. While re-onlining can, in general, fail, it's highly unlikely to happen (usually only when a notifier fails to allocate memory, and these are rather rare). This will be used by virtio-mem to offline+remove memory ranges that are bigger than a single memory block - for example, with a device block size of 1 GiB (e.g., gigantic pages in the hypervisor) and a Linux memory block size of 128MB. While we could compress the state into 2 bit, using 8 bit is much easier. This handling is similar, but different to acpi_scan_try_to_offline(): a) We don't try to offline twice. I am not sure if this CONFIG_MEMCG optimization is still relevant - it should only apply to ZONE_NORMAL (where we have no guarantees). If relevant, we can always add it. b) acpi_scan_try_to_offline() simply onlines all memory in case something goes wrong. It doesn't restore previous online type. Let's do that, so we won't overwrite what e.g., user space configured. Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NDavid Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20201112133815.13332-28-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
-
- 21 6月, 2023 1 次提交
-
-
由 Ma Wupeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I77BDW CVE: NA -------------------------------- During copy_present_pte, rss counter is increased but the corresponding reliable page counter is not updated. This will lead to reliable page counter mismatch. Fix this by adding reliable page counter. Fixes: d81e9624 ("proc: Count reliable memory usage of reliable tasks") Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: NNanyong Sun <sunnanyong@huawei.com>
-
- 16 6月, 2023 1 次提交
-
-
由 Kang Chen 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NYW4 CVE: NA -------------------------------- raw call flow: oom_kill_process -> mem_cgroup_scan_tasks(.., .., message) -> memcg_print_bad_task(message, ..) message is "const char*" type, and incorrectly cast to "oom_control*" type in memcg_print_bad_task. Fix it by moving memcg_print_bad_task out of mem_cgroup_scan_tasks and call it in select_bad_process and dump_tasks. Furthermore, use struct oom_control* directly and remove the useless parm `ret`. Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NKang Chen <void0red@hust.edu.cn> (cherry picked from commit 789038c7)
-
- 13 6月, 2023 1 次提交
-
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- If the swapped-out memory is large, such as tens of gigabytes, we will allocate a large management structure, which may be tens of megabytes or hundreds of megabytes. So if we use kmalloc to allocate management structures it may fail. Fix this by changing kmalloc to kvzalloc and kfree to kvfree. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
- 07 6月, 2023 6 次提交
-
-
由 Liu Shixin 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE CVE: NA -------------------------------- The type of pfn is int, which can result in truncation. Change its type to unsigned long to fix the problem. Fixes: eef7b4fd ("mm/dynamic_hugetlb: use pfn to traverse subpages") Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE CVE: NA -------------------------------- Before discard the bad page, set PagePool flag to distinguish from free page. And increase used_pages to guarantee used + freed = total. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6MH03 CVE: NA -------------------------------- When memory is fragmented, update_reserve_pages() may call migrate_pages() to collect continuous memory. This function can sleep, so we should use mutex lock instead of spin lock. Use KABI_EXTEND to fix kabi broken. Fixes: 0c06a1c0 ("mm/dynamic_hugetlb: add interface to configure the count of hugepages") Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE CVE: NA -------------------------------- The memory hotplug and memory failure will dissolve freed hugepages to buddy system, this is not the expected behavior for dynamic hugetlb. Skip the dissolve operation for hugepages belonging to dynamic hugetlb. For memory hotplug, the hotplug operation is not allowed, if dhugetlb pool existed. For memory failure, the hugepage will be discard directly. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6XOIE CVE: NA -------------------------------- To support dynamic hugetlb on arm64, we need to do two more things. The first one is to fix kabi broken in mem_cgroup, we use kabi_reserve_5 to fix it in previous patch. The second one is to check cont-bit hugetlb since this feature only support for PMD-size and PUD-size hugepage. This feature only support for 4KB pagesize, not support for 16KB and 64KB. Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
-
由 Shakeel Butt 提交于
mainline inclusion from mainline-v6.1-rc1 commit cfdab60b category: perf bugzilla: https://gitee.com/openeuler/kernel/issues/I7BHGR CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cfdab60bfa66b2dc0391c9e405b8af6039924cd4 -------------------------------- Patch series "memcg: optimize charge codepath", v2. Recently Linux networking stack has moved from a very old per socket pre-charge caching to per-cpu caching to avoid pre-charge fragmentation and unwarranted OOMs. One impact of this change is that for network traffic workloads, memcg charging codepath can become a bottleneck. The kernel test robot has also reported this regression[1]. This patch series tries to improve the memcg charging for such workloads. This patch series implement three optimizations: (A) Reduce atomic ops in page counter update path. (B) Change layout of struct page_counter to eliminate false sharing between usage and high. (C) Increase the memcg charge batch to 64. To evaluate the impact of these optimizations, on a 72 CPUs machine, we ran the following workload in root memcg and then compared with scenario where the workload is run in a three level of cgroup hierarchy with top level having min and low setup appropriately. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): 1. root memcg 21694.8 Mbps 2. 6.0-rc1 10482.7 Mbps (-51.6%) 3. 6.0-rc1 + (A) 14542.5 Mbps (-32.9%) 4. 6.0-rc1 + (B) 12413.7 Mbps (-42.7%) 5. 6.0-rc1 + (C) 17063.7 Mbps (-21.3%) 6. 6.0-rc1 + (A+B+C) 20120.3 Mbps (-7.2%) With all three optimizations, the memcg overhead of this workload has been reduced from 51.6% to just 7.2%. [1] https://lore.kernel.org/linux-mm/20220619150456.GB34471@xsang-OptiPlex-9020/ This patch (of 3): For cgroups using low or min protections, the function propagate_protected_usage() was doing an atomic xchg() operation irrespectively. We can optimize out this atomic operation for one specific scenario where the workload is using the protection (i.e. min > 0) and the usage is above the protection (i.e. usage > min). This scenario is actually very common where the users want a part of their workload to be protected against the external reclaim. Though this optimization does introduce a race when the usage is around the protection and concurrent charges and uncharged trip it over or under the protection. In such cases, we might see lower effective protection but the subsequent charge/uncharge will correct it. To evaluate the impact of this optimization, on a 72 CPUs machine, we ran the following workload in a three level of cgroup hierarchy with top level having min and low setup appropriately to see if this optimization is effective for the mentioned case. $ netserver -6 # 36 instances of netperf with following params $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Results (average throughput of netperf): Without (6.0-rc1) 10482.7 Mbps With patch 14542.5 Mbps (38.7% improvement) With the patch, the throughput improved by 38.7% Link: https://lkml.kernel.org/r/20220825000506.239406-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20220825000506.239406-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Reported-by: Nkernel test robot <oliver.sang@intel.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Reviewed-by: NFeng Tang <feng.tang@intel.com> Acked-by: NRoman Gushchin <roman.gushchin@linux.dev> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oliver Sang <oliver.sang@intel.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
- 30 5月, 2023 7 次提交
-
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- When the pte is none and old_pte is not NULL, _old_pte is uninitialized and will be passed out of uswap_unmap_anon_page(). To fix this, add a return value to uswap_unmap_anon_page to indicate whether the pte is none before unmapping. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- Add a VM_WRITE check for the swap-out buffer. If the swap-out buffer VMA contains VM_WRITE, the PTE should be marked as writable. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- Add VMA check for uswap registration to make sure that swap-in VA is of the same type as swap-out VA. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- We call follow_page() with FOLL_DUMP to handle ZERO_PAGE. Although FOLL_DUMP is intended for get_dump_page(), it just so happens that its special treatment of the ZERO_PAGE (returning an error instead of doing get_page) suits uswap very well. If somehow an abnormal page has sneaked into the range, we won't oops here. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- Add page_count() check for swap-out VA to make sure that no other kernel mechanism is using this physical page. Add lru_add_drain_all() before swap-out VA to correct page_count() of swap-out pages. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- Add VMA check for swap-in and swap-out buffer to make sure that swap-in and swap-out buffer is of the same type as swap-out VA. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- When the swap-in buffer contains no physical pages, the errno in mfill_atomic_pte_nocopy() will be ENOENT. A BUG_ON error will occur because the userswap feature does not use the struct page *page and page is set to NULL. To fix this issue, the errno should be changed from ENOENT to EINVAL. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
- 19 5月, 2023 6 次提交
-
-
由 Nanyong Sun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B CVE: NA ---------------------------------------------------------------------- Add control file "memory.ksm" to enable ksm per cgroup. Echo to 1 will set all tasks currently in the cgroup to ksm merge any mode, which means ksm gets enabled for all vma's of a process. Meanwhile echo to 0 will disable ksm for them and unmerge the merged pages. Cat the file will show the above state and ksm related profits of this cgroup. Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from mainline-v6.4-rc1 commit 24139c07 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=24139c07f413ef4b555482c758343d71392a19bc ---------------------------------------------------------------------- Patch series "mm/ksm: improve PR_SET_MEMORY_MERGE=0 handling and cleanup disabling KSM", v2. (1) Make PR_SET_MEMORY_MERGE=0 unmerge pages like setting MADV_UNMERGEABLE does, (2) add a selftest for it and (3) factor out disabling of KSM from s390/gmap code. This patch (of 3): Let's unmerge any KSM pages when setting PR_SET_MEMORY_MERGE=0, and clear the VM_MERGEABLE flag from all VMAs -- just like KSM would. Of course, only do that if we previously set PR_SET_MEMORY_MERGE=1. Link: https://lkml.kernel.org/r/20230422205420.30372-1-david@redhat.com Link: https://lkml.kernel.org/r/20230422205420.30372-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NStefan Roesch <shr@devkernel.io> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: mm/ksm.c Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
-
由 Stefan Roesch 提交于
mainline inclusion from mainline-v6.4-rc1 commit d21077fb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d21077fbc2fc987c2e593c34dc3b4d84e546dc9f ---------------------------------------------------------------------- This adds the general_profit KSM sysfs knob and the process profit metric knobs to ksm_stat. 1) expose general_profit metric The documentation mentions a general profit metric, however this metric is not calculated. In addition the formula depends on the size of internal structures, which makes it more difficult for an administrator to make the calculation. Adding the metric for a better user experience. 2) document general_profit sysfs knob 3) calculate ksm process profit metric The ksm documentation mentions the process profit metric and how to calculate it. This adds the calculation of the metric. 4) mm: expose ksm process profit metric in ksm_stat This exposes the ksm process profit metric in /proc/<pid>/ksm_stat. The documentation mentions the formula for the ksm process profit metric, however it does not calculate it. In addition the formula depends on the size of internal structures. So it makes sense to expose it. 5) document new procfs ksm knobs Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io> Reviewed-by: NBagas Sanjaya <bagasdotme@gmail.com> Acked-by: NDavid Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
-
由 Stefan Roesch 提交于
mainline inclusion from mainline-v6.4-rc1 commit d7597f59 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d7597f59d1d33e9efbffa7060deb9ee5bd119e62 ---------------------------------------------------------------------- Patch series "mm: process/cgroup ksm support", v9. So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. Use case 1: The madvise call is not available in the programming language. An example for this are programs with forked workloads using a garbage collected language without pointers. In such a language madvise cannot be made available. In addition the addresses of objects get moved around as they are garbage collected. KSM sharing needs to be enabled "from the outside" for these type of workloads. Use case 2: The same interpreter can also be used for workloads where KSM brings no benefit or even has overhead. We'd like to be able to enable KSM on a workload by workload basis. Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. Security concerns: In previous discussions security concerns have been brought up. The problem is that an individual workload does not have the knowledge about what else is running on a machine. Therefore it has to be very conservative in what memory areas can be shared or not. However, if the system is dedicated to running multiple jobs within the same security domain, its the job scheduler that has the knowledge that sharing can be safely enabled and is even desirable. Performance: Experiments with using UKSM have shown a capacity increase of around 20%. Here are the metrics from an instagram workload (taken from a machine with 64GB main memory): full_scans: 445 general_profit: 20158298048 max_page_sharing: 256 merge_across_nodes: 1 pages_shared: 129547 pages_sharing: 5119146 pages_to_scan: 4000 pages_unshared: 1760924 pages_volatile: 10761341 run: 1 sleep_millisecs: 20 stable_node_chains: 167 stable_node_chains_prune_millisecs: 2000 stable_node_dups: 2751 use_zero_pages: 0 zero_pages_sharing: 0 After the service is running for 30 minutes to an hour, 4 to 5 million shared pages are common for this workload when using KSM. Detailed changes: 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 3. Add general_profit metric The general_profit metric of KSM is specified in the documentation, but not calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm. 4. Add more metrics to ksm_stat This adds the process profit metric to /proc/<pid>/ksm_stat. 5. Add more tests to ksm_tests and ksm_functional_tests This adds an option to specify the merge type to the ksm_tests. This allows to test madvise and prctl KSM. It also adds a two new tests to ksm_functional_tests: one to test the new prctl options and the other one is a fork test to verify that the KSM process setting is inherited by client processes. This patch (of 3): So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 1) Introduce new MMF_VM_MERGE_ANY flag This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is set, kernel samepage merging (ksm) gets enabled for all vma's of a process. 2) Setting VM_MERGEABLE on VMA creation When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the VM_MERGEABLE flag will be set for this VMA. 3) support disabling of ksm for a process This adds the ability to disable ksm for a process if ksm has been enabled for the process with prctl. 4) add new prctl option to get and set ksm for a process This adds two new options to the prctl system call - enable ksm for all vmas of a process (if the vmas support it). - query if ksm has been enabled for a process. 3. Disabling MMF_VM_MERGE_ANY for storage keys in s390 In the s390 architecture when storage keys are used, the MMF_VM_MERGE_ANY will be disabled. Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io> Acked-by: NDavid Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: kernel/sys.c mm/ksm.c mm/mmap.c Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
-
由 xu xin 提交于
mainline inclusion from mainline-v6.1-rc1 commit cb4df4ca category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cb4df4cae4f2bd8cf7a32eff81178fce31600f7c ---------------------------------------------------------------------- Patch series "ksm: count allocated rmap_items and update documentation", v5. KSM can save memory by merging identical pages, but also can consume additional memory, because it needs to generate rmap_items to save each scanned page's brief rmap information. To determine how beneficial the ksm-policy (like madvise), they are using brings, so we add a new interface /proc/<pid>/ksm_stat for each process The value "ksm_rmap_items" in it indicates the total allocated ksm rmap_items of this process. The detailed description can be seen in the following patches' commit message. This patch (of 2): KSM can save memory by merging identical pages, but also can consume additional memory, because it needs to generate rmap_items to save each scanned page's brief rmap information. Some of these pages may be merged, but some may not be abled to be merged after being checked several times, which are unprofitable memory consumed. The information about whether KSM save memory or consume memory in system-wide range can be determined by the comprehensive calculation of pages_sharing, pages_shared, pages_unshared and pages_volatile. A simple approximate calculation: profit =~ pages_sharing * sizeof(page) - (all_rmap_items) * sizeof(rmap_item); where all_rmap_items equals to the sum of pages_sharing, pages_shared, pages_unshared and pages_volatile. But we cannot calculate this kind of ksm profit inner single-process wide because the information of ksm rmap_item's number of a process is lacked. For user applications, if this kind of information could be obtained, it helps upper users know how beneficial the ksm-policy (like madvise) they are using brings, and then optimize their app code. For example, one application madvise 1000 pages as MERGEABLE, while only a few pages are really merged, then it's not cost-efficient. So we add a new interface /proc/<pid>/ksm_stat for each process in which the value of ksm_rmap_itmes is only shown now and so more values can be added in future. So similarly, we can calculate the ksm profit approximately for a single process by: profit =~ ksm_merging_pages * sizeof(page) - ksm_rmap_items * sizeof(rmap_item); where ksm_merging_pages is shown at /proc/<pid>/ksm_merging_pages, and ksm_rmap_items is shown in /proc/<pid>/ksm_stat. Link: https://lkml.kernel.org/r/20220830143731.299702-1-xu.xin16@zte.com.cn Link: https://lkml.kernel.org/r/20220830143838.299758-1-xu.xin16@zte.com.cnSigned-off-by: Nxu xin <xu.xin16@zte.com.cn> Reviewed-by: NXiaokai Ran <ran.xiaokai@zte.com.cn> Reviewed-by: NYang Yang <yang.yang29@zte.com.cn> Signed-off-by: NCGEL ZTE <cgel.zte@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: include/linux/mm_types.h Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
-
由 xu xin 提交于
mainline inclusion from mainline-v5.19-rc1 commit 76093853 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7609385337a4feb6236e42dcd0df2185683ce839 ---------------------------------------------------------------------- Some applications or containers want to use KSM by calling madvise() to advise areas of address space to be MERGEABLE. But they may not know which applications are more likely to cause real merges in the deployment. If this patch is applied, it helps them know their corresponding number of merged pages, and then optimize their app code. As current KSM only counts the number of KSM merging pages(e.g. ksm_pages_sharing and ksm_pages_shared) of the whole system, we cannot see the more fine-grained KSM merging, for the upper application optimization, the merging area cannot be set easily according to the KSM page merging probability of each process. Therefore, it is necessary to add extra statistical means so that the upper level users can know the detailed KSM merging information of each process. We add a new proc file named as ksm_merging_pages under /proc/<pid>/ to indicate the involved ksm merging pages of this process. [akpm@linux-foundation.org: fix comment typo, remove BUG_ON()s] Link: https://lkml.kernel.org/r/20220325082318.2352853-1-xu.xin16@zte.com.cnSigned-off-by: Nxu xin <xu.xin16@zte.com.cn> Reported-by: Nkernel test robot <lkp@intel.com> Reviewed-by: NYang Yang <yang.yang29@zte.com.cn> Reviewed-by: NRan Xiaokai <ran.xiaokai@zte.com.cn> Reported-by: NZeal Robot <zealci@zte.com.cn> Cc: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Ohhoon Kwon <ohoono.kwon@samsung.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Feng Tang <feng.tang@intel.com> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Zeal Robot <zealci@zte.com.cn> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: include/linux/mm_types.h Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
-
- 18 5月, 2023 7 次提交
-
-
由 ZhangPeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- Add user mode check for swap-out VA to make sure that swap-out VA is user mode address. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- Check the VM_READ and VM_WRITE flags of vma->vm_flags to determine whether the read and write permissions of the swap-out page VA are consistent with those of the swap-out buffer VA. If they are inconsistent, the swap operation will fail. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- Add VMA checks for register address to make sure that register address has the corresponding VMA. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- Add checks for new_addr in uswap_mremap() and src_addr in uswap_check_copy_mode(), including user mode checks, overlapping checks, etc. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- As follows, fix some type and logical bugs. 1) The type of index variable is changed from int to unsigned long to support large memory registration. 2) Fix the bug that USWAP_PAGES_DIRTY does not take effect. 3) Take the mmap_read_lock() when using the VMA in uswap_adjust_uffd_range(). 4) Do some code refactoring and cleancode. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- Split uswap_register() into uswap_register() and uswap_adjust_uffd_range(). Before validate_range(), use uswap_register() to handle uswap mode. After validate_range(), use uswap_adjust_uffd_range() to change address range to VMA range, which could reduce fragmentation caused by VMA splitting. By splitting uswap_register(), we could prevent the userswap registration of invalid input address ranges. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-
由 ZhangPeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM -------------------------------- If old_pte is NULL, *old_pte will result in a null pointer dereference. Fix this by adding a NULL check for old_pte. Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
-