- 14 7月, 2021 40 次提交
-
-
由 Zheng Zucheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZX4D CVE: NA -------------------------------- Enable CONFIG_QOS_SCHED to support qos scheduler. Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zheng Zucheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZX4D CVE: NA -------------------------------- In a co-location scenario, we usually deploy online and offline task groups in the same server. The online tasks are more important than offline tasks and to avoid offline tasks affects online tasks, we will throttle the offline tasks group when some online task groups running in the same cpu and unthrottle offline tasks when the cpu is about to enter idle state. Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com> Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zheng Zucheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZX4D CVE: NA -------------------------------- In cloud-native hybrid deployment scenarios, online tasks must preempt offline tasks in a timely and offline tasks can't affect the QoS of online tasks, so we introduce the idea of qos level to scheduler, which now is supported with different scheduler policies. The qos scheduler will change the policy of correlative tasks when the qos level of a task group is modified with cpu.qos_level cpu cgroup file. In this way we are able to satisfy different needs of tasks in different qos levels. The value of qos_level can be 0 or -1, default value is 0. If qos_level is 0, the group is an online group. otherwise it is an offline group. Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com> Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Nadav Amit 提交于
mainline inclusion from mainline-v5.13-rc1 commit a5aa5ce3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZV2C CVE: NA ------------------------------------------------- Simplify the code and avoid having an additional function on the stack by inlining on_each_cpu_cond() and on_each_cpu(). Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NNadav Amit <namit@vmware.com> [ Minor edits. ] Signed-off-by: NIngo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210220231712.2475218-10-namit@vmware.com conflict: include/linux/smp.h Signed-off-by: NTong Tiangen <tongtiangen@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Nadav Amit 提交于
mainline inclusion from mainline-v5.13-rc1 commit 1608e4cf category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZV2C CVE: NA ------------------------------------------------- The compiler is smart enough without these hints. Suggested-by: NDave Hansen <dave.hansen@linux.intel.com> Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-9-namit@vmware.comSigned-off-by: NTong Tiangen <tongtiangen@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Nadav Amit 提交于
mainline inclusion from mainline-v5.13-rc1 commit 291c4011 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZV2C CVE: NA ------------------------------------------------- cpumask_next_and() and cpumask_any_but() are pure, and marking them as such seems to generate different and presumably better code for native_flush_tlb_multi(). Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-8-namit@vmware.comSigned-off-by: NTong Tiangen <tongtiangen@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Nadav Amit 提交于
mainline inclusion from mainline-v5.13-rc1 commit 09c5272e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZV2C CVE: NA ------------------------------------------------- Blindly writing to is_lazy for no reason, when the written value is identical to the old value, makes the cacheline dirty for no reason. Avoid making such writes to prevent cache coherency traffic for no reason. Suggested-by: NDave Hansen <dave.hansen@linux.intel.com> Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-7-namit@vmware.comSigned-off-by: NTong Tiangen <tongtiangen@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Nadav Amit 提交于
mainline inclusion from mainline-v5.13-rc1 commit 2f4305b1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZV2C CVE: NA ------------------------------------------------- cpu_tlbstate is mostly private and only the variable is_lazy is shared. This causes some false-sharing when TLB flushes are performed. Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark each one accordingly. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-6-namit@vmware.comSigned-off-by: NTong Tiangen <tongtiangen@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Nadav Amit 提交于
mainline inclusion from mainline-v5.13-rc1 commit 4ce94eab category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZV2C CVE: NA ------------------------------------------------- To improve TLB shootdown performance, flush the remote and local TLBs concurrently. Introduce flush_tlb_multi() that does so. Introduce paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen and hyper-v are only compile-tested). While the updated smp infrastructure is capable of running a function on a single local core, it is not optimized for this case. The multiple function calls and the indirect branch introduce some overhead, and might make local TLB flushes slower than they were before the recent changes. Before calling the SMP infrastructure, check if only a local TLB flush is needed to restore the lost performance in this common case. This requires to check mm_cpumask() one more time, but unless this mask is updated very frequently, this should impact performance negatively. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> # Hyper-v parts Reviewed-by: Juergen Gross <jgross@suse.com> # Xen and paravirt parts Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-5-namit@vmware.comSigned-off-by: NTong Tiangen <tongtiangen@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Nadav Amit 提交于
mainline inclusion from mainline-v5.13-rc1 commit 6035152d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZV2C CVE: NA ------------------------------------------------- Open-code on_each_cpu_cond_mask() in native_flush_tlb_others() to optimize the code. Open-coding eliminates the need for the indirect branch that is used to call is_lazy(), and in CPUs that are vulnerable to Spectre v2, it eliminates the retpoline. In addition, it allows to use a preallocated cpumask to compute the CPUs that should be. This would later allow us not to adapt on_each_cpu_cond_mask() to support local and remote functions. Note that calling tlb_is_not_lazy() for every CPU that needs to be flushed, as done in native_flush_tlb_multi() might look ugly, but it is equivalent to what is currently done in on_each_cpu_cond_mask(). Actually, native_flush_tlb_multi() does it more efficiently since it avoids using an indirect branch for the matter. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-4-namit@vmware.comSigned-off-by: NTong Tiangen <tongtiangen@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Nadav Amit 提交于
mainline inclusion from mainline-v5.13-rc1 commit 4c1ba392 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZV2C CVE: NA ------------------------------------------------- The unification of these two functions allows to use them in the updated SMP infrastrucutre. To do so, remove the reason argument from flush_tlb_func_local(), add a member to struct tlb_flush_info that says which CPU initiated the flush and act accordingly. Optimize the size of flush_tlb_info while we are at it. Unfortunately, this prevents us from using a constant tlb_flush_info for arch_tlbbatch_flush(), but in a later stage we may be able to inline tlb_flush_info into the IPI data, so it should not have an impact eventually. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-3-namit@vmware.comSigned-off-by: NTong Tiangen <tongtiangen@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Nadav Amit 提交于
mainline inclusion from mainline-v5.13-rc1 commit a32a4d8a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZV2C CVE: NA ------------------------------------------------- Currently, on_each_cpu() and similar functions do not exploit the potential of concurrency: the function is first executed remotely and only then it is executed locally. Functions such as TLB flush can take considerable time, so this provides an opportunity for performance optimization. To do so, modify smp_call_function_many_cond(), to allows the callers to provide a function that should be executed (remotely/locally), and run them concurrently. Keep other smp_call_function_many() semantic as it is today for backward compatibility: the called function is not executed in this case locally. smp_call_function_many_cond() does not use the optimized version for a single remote target that smp_call_function_single() implements. For synchronous function call, smp_call_function_single() keeps a call_single_data (which is used for synchronization) on the stack. Interestingly, it seems that not using this optimization provides greater performance improvements (greater speedup with a single remote target than with multiple ones). Presumably, holding data structures that are intended for synchronization on the stack can introduce overheads due to TLB misses and false-sharing when the stack is used for other purposes. Signed-off-by: NNadav Amit <namit@vmware.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-2-namit@vmware.comSigned-off-by: NTong Tiangen <tongtiangen@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Peter Zijlstra 提交于
mainline inclusion from mainline-v5.11-rc1 commit 545b8c8d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZV2C CVE: NA ------------------------------------------------- Get rid of the __call_single_node union and cleanup the API a little to avoid external code relying on the structure layout as much. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NFrederic Weisbecker <frederic@kernel.org> conflict: kernel/debug/debug_core.c kernel/sched/core.c kernel/smp.c: fix csd_lock_wait_getcpu() csd->node.dst Signed-off-by: NTong Tiangen <tongtiangen@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.14 commit 6acfb5ba category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6acfb5ba150cf75005ce85e0e25d79ef2fec287c ------------------------------------------------- Since commit d6995da3 ("hugetlb: use page.private for hugetlb specific page flags") converts page.private for hugetlb specific page flags. We should use hugetlb_page_subpool() to get the subpool pointer instead of page_private(). This 'could' prevent the migration of hugetlb pages. page_private(hpage) is now used for hugetlb page specific flags. At migration time, the only flag which could be set is HPageVmemmapOptimized. This flag will only be set if the new vmemmap reduction feature is enabled. In addition, !page_mapping() implies an anonymous mapping. So, this will prevent migration of hugetb pages in anonymous mappings if the vmemmap reduction feature is enabled. In addition, that if statement checked for the rare race condition of a page being migrated while in the process of being freed. Since that check is now wrong, we could leak hugetlb subpool usage counts. The commit forgot to update it in the page migration routine. So fix it. [songmuchun@bytedance.com: fix compiler error when !CONFIG_HUGETLB_PAGE reported by Randy] Link: https://lkml.kernel.org/r/20210521022747.35736-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210520025949.1866-1-songmuchun@bytedance.com Fixes: d6995da3 ("hugetlb: use page.private for hugetlb specific page flags") Signed-off-by: NMuchun Song <songmuchun@bytedance.com> Reported-by: NAnshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NMichal Hocko <mhocko@suse.com> Tested-by: Anshuman Khandual <anshuman.khandual@arm.com> [arm64] Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: include/linux/hugetlb.h Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA ------------------------------------------------- The preparation of supporting freeing vmemmap associated with each HugeTLB page is ready, so we can support this feature for arm64. Signed-off-by: NMuchun Song <songmuchun@bytedance.com> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.14 commit e6d41f12 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e6d41f12df0efcaa6e30b575d40f2529024cfce9 ------------------------------------------------- When using HUGETLB_PAGE_FREE_VMEMMAP, the freeing unused vmemmap pages associated with each HugeTLB page is default off. Now the vmemmap is PMD mapped. So there is no side effect when this feature is enabled with no HugeTLB pages in the system. Someone may want to enable this feature in the compiler time instead of using boot command line. So add a config to make it default on when someone do not want to enable it via command line. Link: https://lkml.kernel.org/r/20210616094915.34432-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: Documentation/admin-guide/kernel-parameters.txt Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.14 commit 2d7a2171 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2d7a21715f25122779e2bed17db8c57aa01e922f ------------------------------------------------- The preparation of splitting huge PMD mapping of vmemmap pages is ready, so switch the mapping from PTE to PMD. Link: https://lkml.kernel.org/r/20210616094915.34432-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: arch/x86/mm/init_64.c mm/memory_hotplug.c Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.14 commit 3bc2b6a7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3bc2b6a725963bb1b441356873da890e397c1a3f ------------------------------------------------- Patch series "Split huge PMD mapping of vmemmap pages", v4. In order to reduce the difficulty of code review in series[1]. We disable huge PMD mapping of vmemmap pages when that feature is enabled. In this series, we do not disable huge PMD mapping of vmemmap pages anymore. We will split huge PMD mapping when needed. When HugeTLB pages are freed from the pool we do not attempt coalasce and move back to a PMD mapping because it is much more complex. [1] https://lore.kernel.org/linux-doc/20210510030027.56044-1-songmuchun@bytedance.com/ This patch (of 3): In [1], PMD mappings of vmemmap pages were disabled if the the feature hugetlb_free_vmemmap was enabled. This was done to simplify the initial implementation of vmmemap freeing for hugetlb pages. Now, remove this simplification by allowing PMD mapping and switching to PTE mappings as needed for allocated hugetlb pages. When a hugetlb page is allocated, the vmemmap page tables are walked to free vmemmap pages. During this walk, split huge PMD mappings to PTE mappings as required. In the unlikely case PTE pages can not be allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb page. When HugeTLB pages are freed from the pool, we do not attempt to coalesce and move back to a PMD mapping because it is much more complex. [1] https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210616094915.34432-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210616094915.34432-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.14 commit 77490587 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=774905878fc9b0b9a5ee4a889b97f773a077aeee ------------------------------------------------- All the infrastructure is ready, so we introduce nr_free_vmemmap_pages field in the hstate to indicate how many vmemmap pages associated with a HugeTLB page that can be freed to buddy allocator. And initialize it in the hugetlb_vmemmap_init(). This patch is actual enablement of the feature. There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct page structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, so add a BUILD_BUG_ON to catch invalid usage of the tail struct page. Link: https://lkml.kernel.org/r/20210510030027.56044-10-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Tested-by: NChen Huang <chenhuang5@huawei.com> Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.14 commit e9fdff87 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e9fdff87e893ec5b7c32836675db80cf691b2a8b ------------------------------------------------- Add a kernel parameter hugetlb_free_vmemmap to enable the feature of freeing unused vmemmap pages associated with each hugetlb page on boot. We disable PMD mapping of vmemmap pages for x86-64 arch when this feature is enabled. Because vmemmap_remap_free() depends on vmemmap being base page mapped. Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NBarry Song <song.bao.hua@hisilicon.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Tested-by: NChen Huang <chenhuang5@huawei.com> Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: arch/x86/mm/init_64.c Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.14 commit ad2fa371 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad2fa3717b74994a22519dbe045757135db00dbb ------------------------------------------------- When we free a HugeTLB page to the buddy allocator, we need to allocate the vmemmap pages associated with it. However, we may not be able to allocate the vmemmap pages when the system is under memory pressure. In this case, we just refuse to free the HugeTLB page. This changes behavior in some corner cases as listed below: 1) Failing to free a huge page triggered by the user (decrease nr_pages). User needs to try again later. 2) Failing to free a surplus huge page when freed by the application. Try again later when freeing a huge page next time. 3) Failing to dissolve a free huge page on ZONE_MOVABLE via offline_pages(). This can happen when we have plenty of ZONE_MOVABLE memory, but not enough kernel memory to allocate vmemmmap pages. We may even be able to migrate huge page contents, but will not be able to dissolve the source huge page. This will prevent an offline operation and is unfortunate as memory offlining is expected to succeed on movable zones. Users that depend on memory hotplug to succeed for movable zones should carefully consider whether the memory savings gained from this feature are worth the risk of possibly not being able to offline memory in certain situations. 4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via alloc_contig_range() - once we have that handling in place. Mainly affects CMA and virtio-mem. Similar to 3). virito-mem will handle migration errors gracefully. CMA might be able to fallback on other free areas within the CMA region. Vmemmap pages are allocated from the page freeing context. In order for those allocations to be not disruptive (e.g. trigger oom killer) __GFP_NORETRY is used. hugetlb_lock is dropped for the allocation because a non sleeping allocation would be too fragile and it could fail too easily under memory pressure. GFP_ATOMIC or other modes to access memory reserves is not used because we want to prevent consuming reserves under heavy hugetlb freeing. [mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page] Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com [willy@infradead.org: fix alloc_vmemmap_page_list documentation warning] Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com> Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Chen Huang <chenhuang5@huawei.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: Documentation/admin-guide/mm/memory-hotplug.rst include/linux/hugetlb.h mm/hugetlb.c Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.14 commit b65d4adb category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b65d4adbc0f0d4619f61ee9d8126bc5005b78802 ------------------------------------------------- In the subsequent patch, we should allocate the vmemmap pages when freeing a HugeTLB page. But update_and_free_page() can be called under any context, so we cannot use GFP_KERNEL to allocate vmemmap pages. However, we can defer the actual freeing in a kworker to prevent from using GFP_ATOMIC to allocate the vmemmap pages. The __update_and_free_page() is where the call to allocate vmemmmap pages will be inserted. Link: https://lkml.kernel.org/r/20210510030027.56044-6-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Chen Huang <chenhuang5@huawei.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/hugetlb.c Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion rom mainline-v5.14 commit f41f2ed4 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f41f2ed43ca5258d70d53290d1951a21621f95c8 ------------------------------------------------- Every HugeTLB has more than one struct page structure. We __know__ that we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to store metadata associated with each HugeTLB. There are a lot of struct page structures associated with each HugeTLB page. For tail pages, the value of compound_head is the same. So we can reuse first page of tail page structures. We map the virtual addresses of the remaining pages of tail page structures to the first tail page struct, and then free these page frames. Therefore, we need to reserve two pages as vmemmap areas. When we allocate a HugeTLB page from the buddy, we can free some vmemmap pages associated with each HugeTLB page. It is more appropriate to do it in the prep_new_huge_page(). The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages associated with a HugeTLB page can be freed, returns zero for now, which means the feature is disabled. We will enable it once all the infrastructure is there. [willy@infradead.org: fix documentation warning] Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NOscar Salvador <osalvador@suse.de> Tested-by: NChen Huang <chenhuang5@huawei.com> Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/hugetlb.c Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.14 commit cd39d4e9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cd39d4e9e71c5437b67c819c3d53032145bf2879 ------------------------------------------------- For HugeTLB page, there are more metadata to save in the struct page. But the head struct page cannot meet our needs, so we have to abuse other tail struct page to store the metadata. In order to avoid conflicts caused by subsequent use of more tail struct pages, we can gather these discrete indexes of tail struct page. In this case, it will be easier to add a new tail page index later. Link: https://lkml.kernel.org/r/20210510030027.56044-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Tested-by: NChen Huang <chenhuang5@huawei.com> Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.14 commit 6be24bed category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6be24bed9da367c29b04e6fba8c9f27db39aa665 ------------------------------------------------- The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of some vmemmap pages associated with pre-allocated HugeTLB pages. For example, on X86_64 6 vmemmap pages of size 4KB each can be saved for each 2MB HugeTLB page. 4094 vmemmap pages of size 4KB each can be saved for each 1GB HugeTLB page. When a HugeTLB page is allocated or freed, the vmemmap array representing the range associated with the page will need to be remapped. When a page is allocated, vmemmap pages are freed after remapping. When a page is freed, previously discarded vmemmap pages must be allocated before remapping. The config option is introduced early so that supporting code can be written to depend on the option. The initial version of the code only provides support for x86-64. If config HAVE_BOOTMEM_INFO_NODE is enabled, the freeing vmemmap page code denpend on it to free vmemmap pages. Otherwise, just use free_reserved_page() to free vmemmmap pages. The routine register_page_bootmem_info() is used to register bootmem info. Therefore, make sure register_page_bootmem_info is enabled if HUGETLB_PAGE_FREE_VMEMMAP is defined. Link: https://lkml.kernel.org/r/20210510030027.56044-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Acked-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Tested-by: NChen Huang <chenhuang5@huawei.com> Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com> Reviewed-by: NBalbir Singh <bsingharora@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.14 commit 426e5c42 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=426e5c429d16e4cd5ded46e21ff8e939bf8abd0f ------------------------------------------------- Patch series "Free some vmemmap pages of HugeTLB page", v23. This patch series will free some vmemmap pages(struct page structures) associated with each HugeTLB page when preallocated to save memory. In order to reduce the difficulty of the first version of code review. In this version, we disable PMD/huge page mapping of vmemmap if this feature was enabled. This acutely eliminates a bunch of the complex code doing page table manipulation. When this patch series is solid, we cam add the code of vmemmap page table manipulation in the future. The struct page structures (page structs) are used to describe a physical page frame. By default, there is an one-to-one mapping from a page frame to it's corresponding page struct. The HugeTLB pages consist of multiple base page size pages and is supported by many architectures. See hugetlbpage.rst in the Documentation directory for more details. On the x86 architecture, HugeTLB pages of size 2MB and 1GB are currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of 4096 base pages. For each base page, there is a corresponding page struct. Within the HugeTLB subsystem, only the first 4 page structs are used to contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER provides this upper limit. The only 'useful' information in the remaining page structs is the compound_head field, and this field is the same for all tail pages. By removing redundant page structs for HugeTLB pages, memory can returned to the buddy allocator for other uses. When the system boot up, every 2M HugeTLB has 512 struct page structs which size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE). HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | -------------> | 2 | | | +-----------+ +-----------+ | | | 3 | -------------> | 3 | | | +-----------+ +-----------+ | | | 4 | -------------> | 4 | | 2MB | +-----------+ +-----------+ | | | 5 | -------------> | 5 | | | +-----------+ +-----------+ | | | 6 | -------------> | 6 | | | +-----------+ +-----------+ | | | 7 | -------------> | 7 | | | +-----------+ +-----------+ | | | | | | +-----------+ The value of page->compound_head is the same for all tail pages. The first page of page structs (page 0) associated with the HugeTLB page contains the 4 page structs necessary to describe the HugeTLB. The only use of the remaining pages of page structs (page 1 to page 7) is to point to page->compound_head. Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs will be used for each HugeTLB page. This will allow us to free the remaining 6 pages to the buddy allocator. Here is how things look after remapping. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | ----------------------+ | | | | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+ When a HugeTLB is freed to the buddy system, we should allocate 6 pages for vmemmap pages and restore the previous mapping relationship. Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar to the 2MB HugeTLB page. We also can use this approach to free the vmemmap pages. In this case, for the 1GB HugeTLB page, we can save 4094 pages. This is a very substantial gain. On our server, run some SPDK/QEMU applications which will use 1024GB HugeTLB page. With this feature enabled, we can save ~16GB (1G hugepage)/~12GB (2MB hugepage) memory. Because there are vmemmap page tables reconstruction on the freeing/allocating path, it increases some overhead. Here are some overhead analysis. 1) Allocating 10240 2MB HugeTLB pages. a) With this patch series applied: # time echo 10240 > /proc/sys/vm/nr_hugepages real 0m0.166s user 0m0.000s sys 0m0.166s # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [8K, 16K) 5476 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 4760 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 4 | | b) Without this patch series: # time echo 10240 > /proc/sys/vm/nr_hugepages real 0m0.067s user 0m0.000s sys 0m0.067s # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [4K, 8K) 10147 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 93 | | Summarize: this feature is about ~2x slower than before. 2) Freeing 10240 2MB HugeTLB pages. a) With this patch series applied: # time echo 0 > /proc/sys/vm/nr_hugepages real 0m0.213s user 0m0.000s sys 0m0.213s # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; } kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [8K, 16K) 6 | | [16K, 32K) 10227 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32K, 64K) 7 | | b) Without this patch series: # time echo 0 > /proc/sys/vm/nr_hugepages real 0m0.081s user 0m0.000s sys 0m0.081s # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; } kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [4K, 8K) 6805 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 3427 |@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 8 | | Summary: The overhead of __free_hugepage is about ~2-3x slower than before. Although the overhead has increased, the overhead is not significant. Like Mike said, "However, remember that the majority of use cases create HugeTLB pages at or shortly after boot time and add them to the pool. So, additional overhead is at pool creation time. There is no change to 'normal run time' operations of getting a page from or returning a page to the pool (think page fault/unmap)". Despite the overhead and in addition to the memory gains from this series. The following data is obtained by Joao Martins. Very thanks to his effort. There's an additional benefit which is page (un)pinners will see an improvement and Joao presumes because there are fewer memmap pages and thus the tail/head pages are staying in cache more often. Out of the box Joao saw (when comparing linux-next against linux-next + this series) with gup_test and pinning a 16G HugeTLB file (with 1G pages): get_user_pages(): ~32k -> ~9k unpin_user_pages(): ~75k -> ~70k Usually any tight loop fetching compound_head(), or reading tail pages data (e.g. compound_head) benefit a lot. There's some unpinning inefficiencies Joao was fixing[2], but with that in added it shows even more: unpin_user_pages(): ~27k -> ~3.8k [1] https://lore.kernel.org/linux-mm/20210409205254.242291-1-mike.kravetz@oracle.com/ [2] https://lore.kernel.org/linux-mm/20210204202500.26474-1-joao.m.martins@oracle.com/ This patch (of 9): Move bootmem info registration common API to individual bootmem_info.c. And we will use {get,put}_page_bootmem() to initialize the page for the vmemmap pages or free the vmemmap pages to buddy in the later patch. So move them out of CONFIG_MEMORY_HOTPLUG_SPARSE. This is just code movement without any functional change. Link: https://lkml.kernel.org/r/20210510030027.56044-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210510030027.56044-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Tested-by: NChen Huang <chenhuang5@huawei.com> Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Oliver Neukum <oneukum@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Mina Almasry <almasrymina@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/Makefile Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mike Kravetz 提交于
mainline inclusion from mainline-v5.13-rc1 commit 9487ca60 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9487ca60fd7fa2c259f0daba8e2e01e51a64da05 ------------------------------------------------- After making hugetlb lock irq safe and separating some functionality done under the lock, add some lockdep_assert_held to help verify locking. Link: https://lkml.kernel.org/r/20210409205254.242291-9-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mike Kravetz 提交于
mainline inclusion from mainline-v5.13-rc1 commit db71ef79 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=db71ef79b59bb2e78dc4df83d0e4bf6beaa5c82d ------------------------------------------------- Commit c77c0a8a ("mm/hugetlb: defer freeing of huge pages if in non-task context") was added to address the issue of free_huge_page being called from irq context. That commit hands off free_huge_page processing to a workqueue if !in_task. However, this doesn't cover all the cases as pointed out by 0day bot lockdep report [1]. : Possible interrupt unsafe locking scenario: : : CPU0 CPU1 : ---- ---- : lock(hugetlb_lock); : local_irq_disable(); : lock(slock-AF_INET); : lock(hugetlb_lock); : <Interrupt> : lock(slock-AF_INET); Shakeel has later explained that this is very likely TCP TX zerocopy from hugetlb pages scenario when the networking code drops a last reference to hugetlb page while having IRQ disabled. Hugetlb freeing path doesn't disable IRQ while holding hugetlb_lock so a lock dependency chain can lead to a deadlock. This commit addresses the issue by doing the following: - Make hugetlb_lock irq safe. This is mostly a simple process of changing spin_*lock calls to spin_*lock_irq* calls. - Make subpool lock irq safe in a similar manner. - Revert the !in_task check and workqueue handoff. [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/ Link: https://lkml.kernel.org/r/20210409205254.242291-8-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/hugetlb_cgroup.c mm/hugetlb.c Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mike Kravetz 提交于
mainline inclusion from mainline-v5.13-rc1 commit 10c6ec49 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10c6ec49802b1779c01fc029cfd92ea20ae33c06 ------------------------------------------------- free_pool_huge_page was called with hugetlb_lock held. It would remove a hugetlb page, and then free the corresponding pages to the lower level allocators such as buddy. free_pool_huge_page was called in a loop to remove hugetlb pages and these loops could hold the hugetlb_lock for a considerable time. Create new routine remove_pool_huge_page to replace free_pool_huge_page. remove_pool_huge_page will remove the hugetlb page, and it must be called with the hugetlb_lock held. It will return the removed page and it is the responsibility of the caller to free the page to the lower level allocators. The hugetlb_lock is dropped before freeing to these allocators which results in shorter lock hold times. Add new helper routine to call update_and_free_page for a list of pages. Note: Some changes to the routine return_unused_surplus_pages are in need of explanation. Commit e5bbc8a6 ("mm/hugetlb.c: fix reservation race when freeing surplus pages") modified this routine to address a race which could occur when dropping the hugetlb_lock in the loop that removes pool pages. Accounting changes introduced in that commit were subtle and took some thought to understand. This commit removes the cond_resched_lock() and the potential race. Therefore, remove the subtle code and restore the more straight forward accounting effectively reverting the commit. Link: https://lkml.kernel.org/r/20210409205254.242291-7-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mike Kravetz 提交于
mainline inclusion from mainline-v5.13-rc1 commit 1121828a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1121828a0c213caa55ddd5ee23ee78e99cbdd33e ------------------------------------------------- With the introduction of remove_hugetlb_page(), there is no need for update_and_free_page to hold the hugetlb lock. Change all callers to drop the lock before calling. With additional code modifications, this will allow loops which decrease the huge page pool to drop the hugetlb_lock with each page to reduce long hold times. The ugly unlock/lock cycle in free_pool_huge_page will be removed in a subsequent patch which restructures free_pool_huge_page. Link: https://lkml.kernel.org/r/20210409205254.242291-6-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mike Kravetz 提交于
mainline inclusion from mainline-v5.13-rc1 commit 6eb4e88a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6eb4e88a6d27022ea8aff424d47a0a5dfc9fcb34 ------------------------------------------------- The new remove_hugetlb_page() routine is designed to remove a hugetlb page from hugetlbfs processing. It will remove the page from the active or free list, update global counters and set the compound page destructor to NULL so that PageHuge() will return false for the 'page'. After this call, the 'page' can be treated as a normal compound page or a collection of base size pages. update_and_free_page no longer decrements h->nr_huge_pages{_node} as this is performed in remove_hugetlb_page. The only functionality performed by update_and_free_page is to free the base pages to the lower level allocators. update_and_free_page is typically called after remove_hugetlb_page. remove_hugetlb_page is to be called with the hugetlb_lock held. Creating this routine and separating functionality is in preparation for restructuring code to reduce lock hold times. This commit should not introduce any changes to functionality. Link: https://lkml.kernel.org/r/20210409205254.242291-5-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mike Kravetz 提交于
mainline inclusion from mainline-v5.13-rc1 commit 29383967 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2938396771c8fd0870b5284319f9e78b4b552a79 ------------------------------------------------- The helper routine hstate_next_node_to_alloc accesses and modifies the hstate variable next_nid_to_alloc. The helper is used by the routines alloc_pool_huge_page and adjust_pool_surplus. adjust_pool_surplus is called with hugetlb_lock held. However, alloc_pool_huge_page can not be called with the hugetlb lock held as it will call the page allocator. Two instances of alloc_pool_huge_page could be run in parallel or alloc_pool_huge_page could run in parallel with adjust_pool_surplus which may result in the variable next_nid_to_alloc becoming invalid for the caller and pages being allocated on the wrong node. Both alloc_pool_huge_page and adjust_pool_surplus are only called from the routine set_max_huge_pages after boot. set_max_huge_pages is only called as the reusult of a user writing to the proc/sysfs nr_hugepages, or nr_hugepages_mempolicy file to adjust the number of hugetlb pages. It makes little sense to allow multiple adjustment to the number of hugetlb pages in parallel. Add a mutex to the hstate and use it to only allow one hugetlb page adjustment at a time. This will synchronize modifications to the next_nid_to_alloc variable. Link: https://lkml.kernel.org/r/20210409205254.242291-4-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mike Kravetz 提交于
mainline inclusion from mainline-v5.13-rc1 commit 262443c0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=262443c0421e832e5312d2b14e0a2640a9f064d7 ------------------------------------------------- Now that cma_release is non-blocking and irq safe, there is no need to drop hugetlb_lock before calling. Link: https://lkml.kernel.org/r/20210409205254.242291-3-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mike Kravetz 提交于
mainline inclusion from mainline-v5.13-rc1 commit 0ef7dcac category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0ef7dcac998fefc4767b7f10eb3b6df150c38a4e ------------------------------------------------- Patch series "make hugetlb put_page safe for all calling contexts", v5. This effort is the result a recent bug report [1]. Syzbot found a potential deadlock in the hugetlb put_page/free_huge_page_path. WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected Since the free_huge_page_path already has code to 'hand off' page free requests to a workqueue, a suggestion was proposed to make the in_irq() detection accurate by always enabling PREEMPT_COUNT [2]. The outcome of that discussion was that the hugetlb put_page path (free_huge_page) path should be properly fixed and safe for all calling contexts. [1] https://lore.kernel.org/linux-mm/000000000000f1c03b05bc43aadc@google.com/ [2] http://lkml.kernel.org/r/20210311021321.127500-1-mike.kravetz@oracle.com This patch (of 8): cma_release is currently a sleepable operatation because the bitmap manipulation is protected by cma->lock mutex. Hugetlb code which relies on cma_release for CMA backed (giga) hugetlb pages, however, needs to be irq safe. The lock doesn't protect any sleepable operation so it can be changed to a (irq aware) spin lock. The bitmap processing should be quite fast in typical case but if cma sizes grow to TB then we will likely need to replace the lock by a more optimized bitmap implementation. Link: https://lkml.kernel.org/r/20210409205254.242291-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20210409205254.242291-2-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NRoman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <songmuchun@bytedance.com> Cc: David Rientjes <rientjes@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Waiman Long <longman@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mike Kravetz 提交于
mainline inclusion from mainline-v5.12-rc1 commit 6c037149 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6c037149014027d50175da5be4ae4531374dcbe0 ------------------------------------------------- Use new hugetlb specific HPageFreed flag to replace the PageHugeFreed interfaces. Link: https://lkml.kernel.org/r/20210122195231.324857-6-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mike Kravetz 提交于
mainline inclusion from mainline-v5.12-rc1 commit 9157c311 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9157c31186c358c5750dea50ac5705d61d7fc917 ------------------------------------------------- Use new hugetlb specific HPageTemporary flag to replace the PageHugeTemporary() interfaces. PageHugeTemporary does contain a PageHuge() check. However, this interface is only used within hugetlb code where we know we are dealing with a hugetlb page. Therefore, the check can be eliminated. Link: https://lkml.kernel.org/r/20210122195231.324857-5-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mike Kravetz 提交于
mainline inclusion from mainline-v5.12-rc1 commit 8f251a3d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8f251a3d5ce3bdea73bd045ed35db64f32e0d0d9 ------------------------------------------------- Use the new hugetlb page specific flag HPageMigratable to replace the page_huge_active interfaces. By it's name, page_huge_active implied that a huge page was on the active list. However, that is not really what code checking the flag wanted to know. It really wanted to determine if the huge page could be migrated. This happens when the page is actually added to the page cache and/or task page table. This is the reasoning behind the name change. The VM_BUG_ON_PAGE() calls in the *_huge_active() interfaces are not really necessary as we KNOW the page is a hugetlb page. Therefore, they are removed. The routine page_huge_active checked for PageHeadHuge before testing the active bit. This is unnecessary in the case where we hold a reference or lock and know it is a hugetlb head page. page_huge_active is also called without holding a reference or lock (scan_movable_pages), and can race with code freeing the page. The extra check in page_huge_active shortened the race window, but did not prevent the race. Offline code calling scan_movable_pages already deals with these races, so removing the check is acceptable. Add comment to racy code. [songmuchun@bytedance.com: remove set_page_huge_active() declaration from include/linux/hugetlb.h] Link: https://lkml.kernel.org/r/CAMZfGtUda+KoAZscU0718TN61cSFwp4zy=y2oZ=+6Z2TAZZwng@mail.gmail.com Link: https://lkml.kernel.org/r/20210122195231.324857-3-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mike Kravetz 提交于
mainline inclusion from mainline-v5.12-rc1 commit d6995da3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZCW9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6995da311221a05c8aef3bda2629e5cb14c7302 ------------------------------------------------- Patch series "create hugetlb flags to consolidate state", v3. While discussing a series of hugetlb fixes in [1], it became evident that the hugetlb specific page state information is stored in a somewhat haphazard manner. Code dealing with state information would be easier to read, understand and maintain if this information was stored in a consistent manner. This series uses page.private of the hugetlb head page for storing a set of hugetlb specific page flags. Routines are priovided for test, set and clear of the flags. [1] https://lore.kernel.org/r/20210106084739.63318-1-songmuchun@bytedance.com This patch (of 4): As hugetlbfs evolved, state information about hugetlb pages was added. One 'convenient' way of doing this was to use available fields in tail pages. Over time, it has become difficult to know the meaning or contents of fields simply by looking at a small bit of code. Sometimes, the naming is just confusing. For example: The PagePrivate flag indicates a huge page reservation was consumed and needs to be restored if an error is encountered and the page is freed before it is instantiated. The page.private field contains the pointer to a subpool if the page is associated with one. In an effort to make the code more readable, use page.private to contain hugetlb specific page flags. These flags will have test, set and clear functions similar to those used for 'normal' page flags. More importantly, an enum of flag values will be created with names that actually reflect their purpose. In this patch, - Create infrastructure for hugetlb specific page flag functions - Move subpool pointer to page[1].private to make way for flags Create routines with meaningful names to modify subpool field - Use new HPageRestoreReserve flag instead of PagePrivate Conversion of other state information will happen in subsequent patches. Link: https://lkml.kernel.org/r/20210122195231.324857-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20210122195231.324857-2-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Huang <chenhuang5@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Nicholas Piggin 提交于
mainline inclusion from mainline-5.13-rc5 commit 5362a4b6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZGKZ CVE: NA ------------------------------------------------- real_vmalloc_addr() does not currently work for huge vmalloc, which is what the reverse map can be allocated with for radix host, hash guest. Extract the hugepage aware equivalent from eeh code into a helper, and convert existing sites including this one to use it. Fixes: 8abddd96 ("powerpc/64s/radix: Enable huge vmalloc mappings") Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210526120005.3432222-1-npiggin@gmail.comSigned-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Daniel Axtens 提交于
mainline inclusion from mainline-5.13 commit 7ca3027b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZGKZ CVE: NA ------------------------------------------------- In commit 121e6f32 ("mm/vmalloc: hugepage vmalloc mappings"), __vmalloc_node_range was changed such that __get_vm_area_node was no longer called with the requested/real size of the vmalloc allocation, but rather with a rounded-up size. This means that __get_vm_area_node called kasan_unpoision_vmalloc() with a rounded up size rather than the real size. This led to it allowing access to too much memory and so missing vmalloc OOBs and failing the kasan kunit tests. Pass the real size and the desired shift into __get_vm_area_node. This allows it to round up the size for the underlying allocators while still unpoisioning the correct quantity of shadow memory. Adjust the other call-sites to pass in PAGE_SHIFT for the shift value. Link: https://lkml.kernel.org/r/20210617081330.98629-1-dja@axtens.net Link: https://bugzilla.kernel.org/show_bug.cgi?id=213335 Fixes: 121e6f32 ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: NDaniel Axtens <dja@axtens.net> Tested-by: NDavid Gow <davidgow@google.com> Reviewed-by: NNicholas Piggin <npiggin@gmail.com> Reviewed-by: NUladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: NAndrey Konovalov <andreyknvl@gmail.com> Acked-by: NAndrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-