- 14 3月, 2022 1 次提交
-
-
由 Chen Wandun 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA ---------------------------------------------- It is inaccurate when accumulate percpu variable without lock, it will result in pagecache_reliable_pages to be negative in sometime, and will prevent pagecache using reliable memory. For more accurate statistic, replace percpu variable by percpu_counter. The additional percpu_counter will be access in alloc_pages, the init of these two percpu_conter is too late in late_initcall, allocations that use alloc_pages should have to check if the two counter has been inited, that will introduce latency, in order to sovle this, init the two percpu_counter in advance. Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com> Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
-
- 04 3月, 2022 1 次提交
-
-
由 Ma Wupeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA ------------------------------------------ Memory allocation from buddy system will fallback to unmirrored memory if mirrored memory is not enough and reliable memory fallback is closed. ___GFP_NOFAIL flag will be checked in prepare_before_alloc(). Fixes: 45bd608e ("mm: Introduce watermark check for memory reliable") Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 01 3月, 2022 3 次提交
-
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA ---------------------------------------------- Introuce min watermark to ensure kernel can use this reserved memory. This watermark will prevent special user tasks, pagecache, tmpfs user from using too many mirrored memory which will lead to kernel oom. Memory allocation with ___GFP_RELIABILITY or special user tasks will be blocked after this watermark is reached. This memory allocation will fallback to non-mirrored region or trigger oom depends on memory reliable fallback's status. This limit can be set or access via /proc/sys/vm/reliable_reserve_size. The default value of this limit is 256M. This limit can be set from 256M to the total size of mirrored memory. Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: Kefeng Wang<wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Add a counter to count mirrored pages in buddy system. Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: Kefeng Wang<wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- With this patch, kernel will check mirrored_kernelcore before calling efi_find_mirror() which will enable basic mirrored feature. If system have some mirrored memory and mirrored feature is not spcified in boot parameter, the basic mirrored feature will be enabled and this will lead to the following situations: - memblock memory allocation perfers mirrored region. This may have some unexpected influence on numa affinity. - contiguous memory will be splited into server parts if parts of them is mirrored memroy via memblock_mark_mirror(). Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: Kefeng Wang<wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 09 2月, 2022 8 次提交
-
-
由 Ding Hui 提交于
mainline inclusion from linux-v5.13-rc5 commit bac9c6fa category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA -------------------------------- Recently we found that there is a lot MemFree left in /proc/meminfo after do a lot of pages soft offline, it's not quite correct. Before Oscar's rework of soft offline for free pages [1], if we soft offline free pages, these pages are left in buddy with HWPoison flag, and NR_FREE_PAGES is not updated immediately. So the difference between NR_FREE_PAGES and real number of available free pages is also even big at the beginning. However, with the workload running, when we catch HWPoison page in any alloc functions subsequently, we will remove it from buddy, meanwhile update the NR_FREE_PAGES and try again, so the NR_FREE_PAGES will get more and more closer to the real number of available free pages. (regardless of unpoison_memory()) Now, for offline free pages, after a successful call take_page_off_buddy(), the page is no longer belong to buddy allocator, and will not be used any more, but we missed accounting NR_FREE_PAGES in this situation, and there is no chance to be updated later. Do update in take_page_off_buddy() like rmqueue() does, but avoid double counting if some one already set_migratetype_isolate() on the page. [1]: commit 06be6ff3 ("mm,hwpoison: rework soft offline for free pages") Link: https://lkml.kernel.org/r/20210526075247.11130-1-dinghui@sangfor.com.cn Fixes: 06be6ff3 ("mm,hwpoison: rework soft offline for free pages") Signed-off-by: NDing Hui <dinghui@sangfor.com.cn> Suggested-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Acked-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Oscar Salvador 提交于
mainline inclusion from linux-v5.10-rc1 commit 79f5f8fa category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA -------------------------------- keep set_hwpoison_free_buddy_page exported to avoid kapi change. This patch changes the way we set and handle in-use poisoned pages. Until now, poisoned pages were released to the buddy allocator, trusting that the checks that take place at allocation time would act as a safe net and would skip that page. This has proved to be wrong, as we got some pfn walkers out there, like compaction, that all they care is the page to be in a buddy freelist. Although this might not be the only user, having poisoned pages in the buddy allocator seems a bad idea as we should only have free pages that are ready and meant to be used as such. Before explaining the taken approach, let us break down the kind of pages we can soft offline. - Anonymous THP (after the split, they end up being 4K pages) - Hugetlb - Order-0 pages (that can be either migrated or invalited) * Normal pages (order-0 and anon-THP) - If they are clean and unmapped page cache pages, we invalidate then by means of invalidate_inode_page(). - If they are mapped/dirty, we do the isolate-and-migrate dance. Either way, do not call put_page directly from those paths. Instead, we keep the page and send it to page_handle_poison to perform the right handling. page_handle_poison sets the HWPoison flag and does the last put_page. Down the chain, we placed a check for HWPoison page in free_pages_prepare, that just skips any poisoned page, so those pages do not end up in any pcplist/freelist. After that, we set the refcount on the page to 1 and we increment the poisoned pages counter. If we see that the check in free_pages_prepare creates trouble, we can always do what we do for free pages: - wait until the page hits buddy's freelists - take it off, and flag it The downside of the above approach is that we could race with an allocation, so by the time we want to take the page off the buddy, the page has been already allocated so we cannot soft offline it. But the user could always retry it. * Hugetlb pages - We isolate-and-migrate them After the migration has been successful, we call dissolve_free_huge_page, and we set HWPoison on the page if we succeed. Hugetlb has a slightly different handling though. While for non-hugetlb pages we cared about closing the race with an allocation, doing so for hugetlb pages requires quite some additional and intrusive code (we would need to hook in free_huge_page and some other places). So I decided to not make the code overly complicated and just fail normally if the page we allocated in the meantime. We can always build on top of this. As a bonus, because of the way we handle now in-use pages, we no longer need the put-as-isolation-migratetype dance, that was guarding for poisoned pages to end up in pcplists. Signed-off-by: NOscar Salvador <osalvador@suse.de> Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Conflicts: mm/page_alloc.c Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Oscar Salvador 提交于
mainline inclusion from linux-v5.10-rc1 commit 06be6ff3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22 CVE: NA -------------------------------- When trying to soft-offline a free page, we need to first take it off the buddy allocator. Once we know is out of reach, we can safely flag it as poisoned. take_page_off_buddy will be used to take a page meant to be poisoned off the buddy allocator. take_page_off_buddy calls break_down_buddy_pages, which splits a higher-order page in case our page belongs to one. Once the page is under our control, we call page_handle_poison to set it as poisoned and grab a refcount on it. Signed-off-by: NOscar Salvador <osalvador@suse.de> Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Conflicts: mm/page_alloc.c Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chen Wandun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA ------------------------------------------------------- In function cache_limit_ratio_sysctl_handler and cache_limit_mbytes_sysctl_handler, it will shrink page cache even if vm_cache_reclaim_enable is false, it is unexpected. Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Introduce fallback mechanism for memory reliable. The following process will fallback to non-mirrored region if their allocation from mirrored region failed - User tasks with reliable flag - thp collapse pages - init tasks - pagecache - tmpfs In order to achieve this goals. Buddy system will fallback to non-mirrored in the following situations. - if __GFP_THISNODE is set in gfp_mask and dest nodes do not have any zones available - high_zoneidx will be set to ZONE_MOVABLE to alloc memory before oom This mechanism is enabled by defalut and can be disabled by adding "reliable_debug=F" to the kernel parameters. This mechanism rely on CONFIG_MEMORY_RELIABLE and need "kernelcore=reliable" in the kernel parameters. Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Peng Wu 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA ---------------------------------------------- there is a upper limit for special user tasks's memory allocation. special user task means user task with reliable flag. Init tasks will alloc memory from non-mirrored region if their allocation trigger limit. The limit can be set or access via /proc/sys/vm/task_reliable_limit This limit's default value is ULONG_MAX. Signed-off-by: NPeng Wu <wupeng58@huawei.com> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Peng Wu 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA ------------------------------------------ Adding reliable flag for user task. User task with reliable flag can only alloc memory from mirrored region. PF_RELIABLE is added to represent the task's reliable flag. - For init task, which is regarded as as special task which alloc memory from mirrored region. - For normal user tasks, The reliable flag can be set via procfs interface shown as below and can be inherited via fork(). User can change a user task's reliable flag by $ echo [0/1] > /proc/<pid>/reliable and check a user task's reliable flag by $ cat /proc/<pid>/reliable Note, global init task's reliable file can not be accessed. Signed-off-by: NPeng Wu <wupeng58@huawei.com> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Introduction Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> ============ Memory reliable feature is a memory tiering mechanism. It is based on kernel mirror feature, which splits memory into two sperate regions, mirrored(reliable) region and non-mirrored (non-reliable) region. for kernel mirror feature: - allocate kernel memory from mirrored region by default - allocate user memory from non-mirrored region by default non-mirrored region will be arranged into ZONE_MOVABLE. for kernel reliable feature, it has additional features below: - normal user tasks never alloc memory from mirrored region with userspace apis(malloc, mmap, etc.) - special user tasks will allocate memory from mirrored region by default - tmpfs/pagecache allocate memory from mirrored region by default - upper limit of mirrored region allcated for user tasks, tmpfs and pagecache Support Reliable fallback mechanism which allows special user tasks, tmpfs and pagecache can fallback to alloc non-mirrored region, it's the default setting. In order to fulfil the goal - ___GFP_RELIABILITY flag added for alloc memory from mirrored region. - the high_zoneidx for special user tasks/tmpfs/pagecache is set to ZONE_NORMAL. - normal user tasks could only alloc from ZONE_MOVABLE. This patch is just the main framework, memory reliable support for special user tasks, pagecache and tmpfs has own patches. To enable this function, mirrored(reliable) memory is needed and "kernelcore=reliable" should be added to kernel parameters. Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 30 12月, 2021 1 次提交
-
-
由 Peng Liu 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4EG0R CVE: NA ----------------------------------------------- Add cmdline to disable "place pages to tail" when online memory. Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 20 10月, 2021 6 次提交
-
-
由 David Hildenbrand 提交于
mainline inclusion from mainline-v5.10-rc1 commit 7fef431b category: feature bugzilla: 182882 CVE: NA __free_pages_core() is used when exposing fresh memory to the buddy during system boot and when onlining memory in generic_online_page(). generic_online_page() is used in two cases: 1. Direct memory onlining in online_pages(). 2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV balloon and virtio-mem), when parts of a section are kept fake-offline to be fake-onlined later on. In 1, we already place pages to the tail of the freelist. Pages will be freed to MIGRATE_ISOLATE lists first and moved to the tail of the freelists via undo_isolate_page_range(). In 2, we currently don't implement a proper rule. In case of virtio-mem, where we currently always online MAX_ORDER - 1 pages, the pages will be placed to the HEAD of the freelist - undesireable. While the hyper-v balloon calls generic_online_page() with single pages, usually it will call it on successive single pages in a larger block. The pages are fresh, so place them to the tail of the freelist and avoid the PCP. In __free_pages_core(), remove the now superflouos call to set_page_refcounted() and add a comment regarding page initialization and the refcount. Note: In 2. we currently don't shuffle. If ever relevant (page shuffling is usually of limited use in virtualized environments), we might want to shuffle after a sequence of generic_online_page() calls in the relevant callers. Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Link: https://lkml.kernel.org/r/20201005121534.15649-5-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/page_alloc.c [Peng Liu: adjust context] Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from mainline-v5.10-rc1 commit 293ffa5e category: feature bugzilla: 182882 CVE: NA ----------------------------------------------- Whenever we move pages between freelists via move_to_free_list()/ move_freepages_block(), we don't actually touch the pages: 1. Page isolation doesn't actually touch the pages, it simply isolates pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist. When undoing isolation, we move the pages back to the target list. 2. Page stealing (steal_suitable_fallback()) moves free pages directly between lists without touching them. 3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves free pages directly between freelists without touching them. We already place pages to the tail of the freelists when undoing isolation via __putback_isolated_page(), let's do it in any case (e.g., if order <= pageblock_order) and document the behavior. To simplify, let's move the pages to the tail for all move_to_free_list()/move_freepages_block() users. In 2., the target list is empty, so there should be no change. In 3., we might observe a change, however, highatomic is more concerned about allocations succeeding than cache hotness - if we ever realize this change degrades a workload, we can special-case this instance and add a proper comment. This change results in all pages getting onlined via online_pages() to be placed to the tail of the freelist. Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Link: https://lkml.kernel.org/r/20201005121534.15649-4-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/page_alloc.c [Peng Liu: cherry-pick from 293ffa5e] Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from mainline-v5.10-rc1 commit 47b6a24a category: feature bugzilla: 182882 CVE: NA ----------------------------------------------- __putback_isolated_page() already documents that pages will be placed to the tail of the freelist - this is, however, not the case for "order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be the case for all existing users. This change affects two users: - free page reporting - page isolation, when undoing the isolation (including memory onlining). This behavior is desirable for pages that haven't really been touched lately, so exactly the two users that don't actually read/write page content, but rather move untouched pages. The new behavior is especially desirable for memory onlining, where we allow allocation of newly onlined pages via undo_isolate_page_range() in online_pages(). Right now, we always place them to the head of the freelist, resulting in undesireable behavior: Assume we add individual memory chunks via add_memory() and online them right away to the NORMAL zone. We create a dependency chain of unmovable allocations e.g., via the memmap. The memmap of the next chunk will be placed onto previous chunks - if the last block cannot get offlined+removed, all dependent ones cannot get offlined+removed. While this can already be observed with individual DIMMs, it's more of an issue for virtio-mem (and I suspect also ppc DLPAR). Document that this should only be used for optimizations, and no code should rely on this behavior for correction (if the order of the freelists ever changes). We won't care about page shuffling: memory onlining already properly shuffles after onlining. free page reporting doesn't care about physically contiguous ranges, and there are already cases where page isolation will simply move (physically close) free pages to (currently) the head of the freelists via move_freepages_block() instead of shuffling. If this becomes ever relevant, we should shuffle the whole zone when undoing isolation of larger ranges, and after free_contig_range(). Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com> Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Link: https://lkml.kernel.org/r/20201005121534.15649-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/page_alloc.c [Peng Liu: cherry-pick from 47b6a24a] Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from mainline-v5.10-rc1 commit f04a5d5d category: feature bugzilla: 182882 CVE: NA ----------------------------------------------- Patch series "mm: place pages to the freelist tail when onlining and undoing isolation", v2. When adding separate memory blocks via add_memory*() and onlining them immediately, the metadata (especially the memmap) of the next block will be placed onto one of the just added+onlined block. This creates a chain of unmovable allocations: If the last memory block cannot get offlined+removed() so will all dependent ones. We directly have unmovable allocations all over the place. This can be observed quite easily using virtio-mem, however, it can also be observed when using DIMMs. The freshly onlined pages will usually be placed to the head of the freelists, meaning they will be allocated next, turning the just-added memory usually immediately un-removable. The fresh pages are cold, prefering to allocate others (that might be hot) also feels to be the natural thing to do. It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when adding separate, successive memory blocks, each memory block will have unmovable allocations on them - for example gigantic pages will fail to allocate. While the ZONE_NORMAL doesn't provide any guarantees that memory can get offlined+removed again (any kind of fragmentation with unmovable allocations is possible), there are many scenarios (hotplugging a lot of memory, running workload, hotunplug some memory/as much as possible) where we can offline+remove quite a lot with this patchset. a) To visualize the problem, a very simple example: Start a VM with 4GB and 8GB of virtio-mem memory: [root@localhost ~]# lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x00000000bfffffff 3G online yes 0-23 0x0000000100000000-0x000000033fffffff 9G online yes 32-103 Memory block size: 128M Total online memory: 12G Total offline memory: 0B Then try to unplug as much as possible using virtio-mem. Observe which memory blocks are still around. Without this patch set: [root@localhost ~]# lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x00000000bfffffff 3G online yes 0-23 0x0000000100000000-0x000000013fffffff 1G online yes 32-39 0x0000000148000000-0x000000014fffffff 128M online yes 41 0x0000000158000000-0x000000015fffffff 128M online yes 43 0x0000000168000000-0x000000016fffffff 128M online yes 45 0x0000000178000000-0x000000017fffffff 128M online yes 47 0x0000000188000000-0x0000000197ffffff 256M online yes 49-50 0x00000001a0000000-0x00000001a7ffffff 128M online yes 52 0x00000001b0000000-0x00000001b7ffffff 128M online yes 54 0x00000001c0000000-0x00000001c7ffffff 128M online yes 56 0x00000001d0000000-0x00000001d7ffffff 128M online yes 58 0x00000001e0000000-0x00000001e7ffffff 128M online yes 60 0x00000001f0000000-0x00000001f7ffffff 128M online yes 62 0x0000000200000000-0x0000000207ffffff 128M online yes 64 0x0000000210000000-0x0000000217ffffff 128M online yes 66 0x0000000220000000-0x0000000227ffffff 128M online yes 68 0x0000000230000000-0x0000000237ffffff 128M online yes 70 0x0000000240000000-0x0000000247ffffff 128M online yes 72 0x0000000250000000-0x0000000257ffffff 128M online yes 74 0x0000000260000000-0x0000000267ffffff 128M online yes 76 0x0000000270000000-0x0000000277ffffff 128M online yes 78 0x0000000280000000-0x0000000287ffffff 128M online yes 80 0x0000000290000000-0x0000000297ffffff 128M online yes 82 0x00000002a0000000-0x00000002a7ffffff 128M online yes 84 0x00000002b0000000-0x00000002b7ffffff 128M online yes 86 0x00000002c0000000-0x00000002c7ffffff 128M online yes 88 0x00000002d0000000-0x00000002d7ffffff 128M online yes 90 0x00000002e0000000-0x00000002e7ffffff 128M online yes 92 0x00000002f0000000-0x00000002f7ffffff 128M online yes 94 0x0000000300000000-0x0000000307ffffff 128M online yes 96 0x0000000310000000-0x0000000317ffffff 128M online yes 98 0x0000000320000000-0x0000000327ffffff 128M online yes 100 0x0000000330000000-0x000000033fffffff 256M online yes 102-103 Memory block size: 128M Total online memory: 8.1G Total offline memory: 0B With this patch set: [root@localhost ~]# lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x00000000bfffffff 3G online yes 0-23 0x0000000100000000-0x000000013fffffff 1G online yes 32-39 Memory block size: 128M Total online memory: 4G Total offline memory: 0B All memory can get unplugged, all memory block can get removed. Of course, no workload ran and the system was basically idle, but it highlights the issue - the fairly deterministic chain of unmovable allocations. When a huge page for the 2MB memmap is needed, a just-onlined 4MB page will be split. The remaining 2MB page will be used for the memmap of the next memory block. So one memory block will hold the memmap of the two following memory blocks. Finally the pages of the last-onlined memory block will get used for the next bigger allocations - if any allocation is unmovable, all dependent memory blocks cannot get unplugged and removed until that allocation is gone. Note that with bigger memory blocks (e.g., 256MB), *all* memory blocks are dependent and none can get unplugged again! b) Experiment with memory intensive workload I performed an experiment with an older version of this patch set (before we used undo_isolate_page_range() in online_pages(): Hotplug 56GB to a VM with an initial 4GB, onlining all memory to ZONE_NORMAL right from the kernel when adding it. I then run various memory intensive workloads that consume most system memory for a total of 45 minutes. Once finished, I try to unplug as much memory as possible. With this change, I am able to remove via virtio-mem (adding individual 128MB memory blocks) 413 out of 448 added memory blocks. Via individual (256MB) DIMMs 380 out of 448 added memory blocks. (I don't have any numbers without this patchset, but looking at the above example, it's at most half of the 448 memory blocks for virtio-mem, and most probably none for DIMMs). Again, there are workloads that might behave very differently due to the nature of ZONE_NORMAL. This change also affects (besides memory onlining): - Other users of undo_isolate_page_range(): Pages are always placed to the tail. -- When memory offlining fails -- When memory isolation fails after having isolated some pageblocks -- When alloc_contig_range() either succeeds or fails - Other users of __putback_isolated_page(): Pages are always placed to the tail. -- Free page reporting - Other users of __free_pages_core() -- AFAIKs, any memory that is getting exposed to the buddy during boot. IIUC we will now usually allocate memory from lower addresses within a zone first (especially during boot). - Other users of generic_online_page() -- Hyper-V balloon This patch (of 5): Let's prepare for additional flags and avoid long parameter lists of bools. Follow-up patches will also make use of the flags in __free_pages_ok(). Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com> Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Link: https://lkml.kernel.org/r/20201005121534.15649-1-david@redhat.com Link: https://lkml.kernel.org/r/20201005121534.15649-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/page_alloc.c [Peng Liu: cherry-pick from f04a5d5d] Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Alexander Duyck 提交于
mainline inclusion from mainline-v5.7-rc1 commit 624f58d8 category: feature bugzilla: 182882 CVE: NA ----------------------------------------------- There are cases where we would benefit from avoiding having to go through the allocation and free cycle to return an isolated page. Examples for this might include page poisoning in which we isolate a page and then put it back in the free list without ever having actually allocated it. This will enable us to also avoid notifiers for the future free page reporting which will need to avoid retriggering page reporting when returning pages that have been reported on. Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/internal.h [Peng Liu: cherry-pick from 624f58d8] Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Arun KS 提交于
mainline inclusion from mainline-v5.1-rc1 commit a9cd410a category: feature bugzilla: 182882 CVE: NA ----------------------------------------------- When freeing pages are done with higher order, time spent on coalescing pages by buddy allocator can be reduced. With section size of 256MB, hot add latency of a single section shows improvement from 50-60 ms to less than 1 ms, hence improving the hot add latency by 60 times. Modify external providers of online callback to align with the change. [arunks@codeaurora.org: v11] Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org [akpm@linux-foundation.org: remove unused local, per Arun] [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar] [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch] [arunks@codeaurora.org: v8] Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org [arunks@codeaurora.org: v9] Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mathieu Malaterre <malat@debian.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/page_alloc.c mm/memory_hotplug.c [Peng Liu: adjust context] Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 03 9月, 2021 1 次提交
-
-
由 Jaewon Kim 提交于
mainline inclusion from mainline-v5.9-rc1 commit f27ce0e1 category: bugfix bugzilla: 41086 CVE: NA ----------------------------------------------- zone_watermark_fast was introduced by commit 48ee5f36 ("mm, page_alloc: shortcut watermark checks for order-0 pages"). The commit simply checks if free pages is bigger than watermark without additional calculation such like reducing watermark. It considered free cma pages but it did not consider highatomic reserved. This may incur exhaustion of free pages except high order atomic free pages. Assume that reserved_highatomic pageblock is bigger than watermark min, and there are only few free pages except high order atomic free. Because zone_watermark_fast passes the allocation without considering high order atomic free, normal reclaimable allocation like GFP_HIGHUSER will consume all the free pages. Then finally order-0 atomic allocation may fail on allocation. This means watermark min is not protected against non-atomic allocation. The order-0 atomic allocation with ALLOC_HARDER unwantedly can be failed. Additionally the __GFP_MEMALLOC allocation with ALLOC_NO_WATERMARKS also can be failed. To avoid the problem, zone_watermark_fast should consider highatomic reserve. If the actual size of high atomic free is counted accurately like cma free, we may use it. On this patch just use nr_reserved_highatomic. Additionally introduce __zone_watermark_unusable_free to factor out common parts between zone_watermark_fast and __zone_watermark_ok. This is an example of ALLOC_HARDER allocation failure using v4.19 based kernel. Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null) Call trace: [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0 [<ffffff8008223320>] warn_alloc+0xd8/0x12c [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250 [<ffffff800827f6e8>] new_slab+0x128/0x604 [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670 [<ffffff800827ba00>] __kmalloc+0x2f8/0x310 [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144 [<ffffff80084ad880>] security_sid_to_context+0x10/0x18 [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28 [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70 [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c Mem-Info: active_anon:102061 inactive_anon:81551 isolated_anon:0 active_file:59102 inactive_file:68924 isolated_file:64 unevictable:611 dirty:63 writeback:0 unstable:0 slab_reclaimable:13324 slab_unreclaimable:44354 mapped:83015 shmem:4858 pagetables:26316 bounce:0 free:2727 free_pcp:1035 free_cma:178 Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB lowmem_reserve[]: 0 0 Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB 138826 total pagecache pages 5460 pages in swap cache Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142 This is an example of ALLOC_NO_WATERMARKS allocation failure using v4.14 based kernel. kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null) kswapd0 cpuset=/ mems_allowed=0 CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1 Call trace: [<0000000000000000>] dump_backtrace+0x0/0x248 [<0000000000000000>] show_stack+0x18/0x20 [<0000000000000000>] __dump_stack+0x20/0x28 [<0000000000000000>] dump_stack+0x68/0x90 [<0000000000000000>] warn_alloc+0x104/0x198 [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0 [<0000000000000000>] zs_malloc+0x148/0x3d0 [<0000000000000000>] zram_bvec_rw+0x410/0x798 [<0000000000000000>] zram_rw_page+0x88/0xdc [<0000000000000000>] bdev_write_page+0x70/0xbc [<0000000000000000>] __swap_writepage+0x58/0x37c [<0000000000000000>] swap_writepage+0x40/0x4c [<0000000000000000>] shrink_page_list+0xc30/0xf48 [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c [<0000000000000000>] shrink_node_memcg+0x23c/0x618 [<0000000000000000>] shrink_node+0x1c8/0x304 [<0000000000000000>] kswapd+0x680/0x7c4 [<0000000000000000>] kthread+0x110/0x120 [<0000000000000000>] ret_from_fork+0x10/0x18 Mem-Info: active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0 Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB [ 4738.329607] lowmem_reserve[]: 0 0 Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB This is trace log which shows GFP_HIGHUSER consumes free pages right before ALLOC_NO_WATERMARKS. <...>-22275 [006] .... 889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO kswapd0-1207 [005] ...1 889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE [jaewon31.kim@samsung.com: remove redundant code for high-order] Link: http://lkml.kernel.org/r/20200623035242.27232-1-jaewon31.kim@samsung.comReported-by: NYong-Taek Lee <ytk.lee@samsung.com> Suggested-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NJaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NBaoquan He <bhe@redhat.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200619235958.11283-1-jaewon31.kim@samsung.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/page_alloc.c [Peng Liu: cherry-pick from f27ce0e1] Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Ntong tiangen <tongtiangen@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 29 7月, 2021 1 次提交
-
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.11-rc1 commit 7ad69832 category: bugfix bugzilla: NA CVE: NA ----------------------------------------------- When we free a page whose order is very close to MAX_ORDER and greater than pageblock_order, it wastes some CPU cycles to increase max_order to MAX_ORDER one by one and check the pageblock migratetype of that page repeatedly especially when MAX_ORDER is much larger than pageblock_order. We also should not be checking migratetype of buddy when "order == MAX_ORDER - 1" as the buddy pfn may be invalid, so adjust the condition. With the new check, we don't need the max_order check anymore, so we replace it. Also adjust max_order initialization so that it's lower by one than previously, which makes the code hopefully more clear. Link: https://lkml.kernel.org/r/20201204155109.55451-1-songmuchun@bytedance.com Fixes: d9dddbf5 ("mm/page_alloc: prevent merging between isolated and other pageblocks") Signed-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> conflict: mm/page_alloc.c Signed-off-by: NTong Tiangen <tongtiangen@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 30 6月, 2021 1 次提交
-
-
由 yuzhoujian 提交于
mainline inclusion from mainline-5.0-rc1 commit ef8444ea category: bugfix bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I CVE: NA ------------------------------------------------- OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.comSigned-off-by: Nyuzhoujian <yuzhoujian@didichuxing.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit ef8444ea) Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> (cherry picked from commit 985eab72d54b5ac73189d609486526b5e30125ac) Signed-off-by: NLu Jialin <lujialin4@huawei.com> Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 17 5月, 2021 1 次提交
-
-
由 Guo Hui 提交于
uniontech inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3RFV8 CVE: NA ---------------------------------------------------------------- Commit eb761d65 ("mm: parallelize deferred struct page initialization within each node") the code "++zone" in follow code: /* Sanity check that the next zone really is unpopulated */ WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone)); VM_BUG_ON(nr_init != nr_free); zone->managed_pages += nr_free; makes the managed_pages statistics of the current zone incorrect and the zone may have out-of-bounds memory when CONFIG_DEFERRED_STRUCT_PAGE_INIT=y, causing the Virtual machine system startup to fail when the Virtual machine system current allocated memory is set to half of the Virtual machine maximum memory using virt-manager tool Fix it by putting the code “zone->managed_pages += nr_free;” before “++zone” code Fixes: eb761d65 ("mm: parallelize deferred struct page initialization within each node") Reported-by: NPeng Yuanbo <pengyuanbo@uniontech.com> Signed-off-by: NGuo Hui <guohui@uniontech.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 22 2月, 2021 5 次提交
-
-
由 Liu Shixin 提交于
hulk inclusion category: bugfix bugzilla: 47240 CVE: NA ------------------------------------------------- Fix kabi broken for struct pglist_data and struct mem_cgroup. Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.4-rc1 commit 7ae88534 category: bugfix bugzilla: 47240 CVE: NA ------------------------------------------------- A later patch makes THP deferred split shrinker memcg aware, but it needs page->mem_cgroup information in THP destructor, which is called after mem_cgroup_uncharge() now. So move mem_cgroup_uncharge() from __page_cache_release() to compound page destructor, which is called by both THP and other compound pages except HugeTLB. And call it in __put_single_page() for single order page. Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Suggested-by: N"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.4-rc1 commit 364c1eeb category: bugfix bugzilla: 47240 CVE: NA ------------------------------------------------- Patch series "Make deferred split shrinker memcg aware", v6. Currently THP deferred split shrinker is not memcg aware, this may cause premature OOM with some configuration. For example the below test would run into premature OOM easily: $ cgcreate -g memory:thp $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes $ cgexec -g memory:thp transhuge-stress 4000 transhuge-stress comes from kernel selftest. It is easy to hit OOM, but there are still a lot THP on the deferred split queue, memcg direct reclaim can't touch them since the deferred split shrinker is not memcg aware. Convert deferred split shrinker memcg aware by introducing per memcg deferred split queue. The THP should be on either per node or per memcg deferred split queue if it belongs to a memcg. When the page is immigrated to the other memcg, it will be immigrated to the target memcg's deferred split queue too. Reuse the second tail page's deferred_list for per memcg list since the same THP can't be on multiple deferred split queues. Make deferred split shrinker not depend on memcg kmem since it is not slab. It doesn't make sense to not shrink THP even though memcg kmem is disabled. With the above change the test demonstrated above doesn't trigger OOM even though with cgroup.memory=nokmem. This patch (of 4): Put split_queue, split_queue_lock and split_queue_len into a struct in order to reduce code duplication when we convert deferred_split to memcg aware in the later patches. Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Suggested-by: N"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Dongli Zhang 提交于
stable inclusion from linux-4.19.160 commit 082b21d03479b566681f43b74677c249f6aad337 -------------------------------- [ Upstream commit d8c19014 ] The ethernet driver may allocate skb (and skb->data) via napi_alloc_skb(). This ends up to page_frag_alloc() to allocate skb->data from page_frag_cache->va. During the memory pressure, page_frag_cache->va may be allocated as pfmemalloc page. As a result, the skb->pfmemalloc is always true as skb->data is from page_frag_cache->va. The skb will be dropped if the sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour under memory pressure. However, once kernel is not under memory pressure any longer (suppose large amount of memory pages are just reclaimed), the page_frag_alloc() may still re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a result, the skb->pfmemalloc is always true unless page_frag_cache->va is re-allocated, even if the kernel is not under memory pressure any longer. Here is how kernel runs into issue. 1. The kernel is under memory pressure and allocation of PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead, the pfmemalloc page is allocated for page_frag_cache->va. 2: All skb->data from page_frag_cache->va (pfmemalloc) will have skb->pfmemalloc=true. The skb will always be dropped by sock without SOCK_MEMALLOC. This is an expected behaviour. 3. Suppose a large amount of pages are reclaimed and kernel is not under memory pressure any longer. We expect skb->pfmemalloc drop will not happen. 4. Unfortunately, page_frag_alloc() does not proactively re-allocate page_frag_alloc->va and will always re-use the prior pfmemalloc page. The skb->pfmemalloc is always true even kernel is not under memory pressure any longer. Fix this by freeing and re-allocating the page instead of recycling it. References: https://lore.kernel.org/lkml/20201103193239.1807-1-dongli.zhang@oracle.com/ References: https://lore.kernel.org/linux-mm/20201105042140.5253-1-willy@infradead.org/Suggested-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Cc: Bert Barbe <bert.barbe@oracle.com> Cc: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Cc: Manjunath Patil <manjunath.b.patil@oracle.com> Cc: Joe Jin <joe.jin@oracle.com> Cc: SRINIVAS <srinivas.eeda@oracle.com> Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve") Signed-off-by: NDongli Zhang <dongli.zhang@oracle.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201115201029.11903-1-dongli.zhang@oracle.comSigned-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Matthew Wilcox (Oracle) 提交于
mainline inclusion from mainline-v5.10-rc1 commit e320d301 category: bugfix bugzilla: 43588 CVE: NA ------------------------------------------------- Here is a very rare race which leaks memory: Page P0 is allocated to the page cache. Page P1 is free. Thread A Thread B Thread C find_get_entry(): xas_load() returns P0 Removes P0 from page cache P0 finds its buddy P1 alloc_pages(GFP_KERNEL, 1) returns P0 P0 has refcount 1 page_cache_get_speculative(P0) P0 has refcount 2 __free_pages(P0) P0 has refcount 1 put_page(P0) P1 is not freed Fix this by freeing all the pages in __free_pages() that won't be freed by the call to put_page(). It's usually not a good idea to split a page, but this is a very unlikely scenario. Fixes: e286781d ("mm: speculative page references") Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMike Rapoport <rppt@linux.ibm.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200926213919.26642-1-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
- 04 11月, 2020 2 次提交
-
-
由 Vijay Balakrishna 提交于
stable inclusion from linux-4.19.151 commit 94c51675811267a1ccaa7f6dc336714a02e20246 -------------------------------- commit 4aab2be0 upstream. When memory is hotplug added or removed the min_free_kbytes should be recalculated based on what is expected by khugepaged. Currently after hotplug, min_free_kbytes will be set to a lower default and higher default set when THP enabled is lost. This change restores min_free_kbytes as expected for THP consumers. [vijayb@linux.microsoft.com: v5] Link: https://lkml.kernel.org/r/1601398153-5517-1-git-send-email-vijayb@linux.microsoft.com Fixes: f000565a ("thp: set recommended min free kbytes") Signed-off-by: NVijay Balakrishna <vijayb@linux.microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Allen Pais <apais@microsoft.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/1600305709-2319-2-git-send-email-vijayb@linux.microsoft.com Link: https://lkml.kernel.org/r/1600204258-13683-1-git-send-email-vijayb@linux.microsoft.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Conflicts: mm/page_alloc.c [yyl: adjust context] Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Laurent Dufour 提交于
stable inclusion from linux-4.19.150 commit 25eaea1b33f2569f69a82dfddb3fb05384143bd0 -------------------------------- commit c1d0da83 upstream. Patch series "mm: fix memory to node bad links in sysfs", v3. Sometimes, firmware may expose interleaved memory layout like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] In that case, we can see memory blocks assigned to multiple nodes in sysfs: $ ls -l /sys/devices/system/memory/memory21 total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 -rw-r--r-- 1 root root 65536 Aug 24 05:27 online -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index drwxr-xr-x 2 root root 0 Aug 24 05:27 power -r--r--r-- 1 root root 65536 Aug 24 05:27 removable -rw-r--r-- 1 root root 65536 Aug 24 05:27 state lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones The same applies in the node's directory with a memory21 link in both the node1 and node2's directory. This is wrong but doesn't prevent the system to run. However when later, one of these memory blocks is hot-unplugged and then hot-plugged, the system is detecting an inconsistency in the sysfs layout and a BUG_ON() is raised: kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 Call Trace: add_memory_resource+0x23c/0x340 (unreliable) __add_memory+0x5c/0xf0 dlpar_add_lmb+0x1b4/0x500 dlpar_memory+0x1f8/0xb80 handle_dlpar_errorlog+0xc0/0x190 dlpar_store+0x198/0x4a0 kobj_attr_store+0x30/0x50 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 vfs_write+0xe8/0x290 ksys_write+0xdc/0x130 system_call_exception+0x160/0x270 system_call_common+0xf0/0x27c This has been seen on PowerPC LPAR. The root cause of this issue is that when node's memory is registered, the range used can overlap another node's range, thus the memory block is registered to multiple nodes in sysfs. There are two issues here: (a) The sysfs memory and node's layouts are broken due to these multiple links (b) The link errors in link_mem_sections() should not lead to a system panic. To address (a) register_mem_sect_under_node should not rely on the system state to detect whether the link operation is triggered by a hot plug operation or not. This is addressed by the patches 1 and 2 of this series. Issue (b) will be addressed separately. This patch (of 2): The memmap_context enum is used to detect whether a memory operation is due to a hot-add operation or happening at boot time. Make it general to the hotplug operation and rename it as meminit_context. There is no functional change introduced by this patch Suggested-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 15 10月, 2020 2 次提交
-
-
由 Charan Teja Reddy 提交于
stable inclusion from linux-4.19.142 commit c666936d8d8b0ace4f3260d71a4eedefd53011d9 -------------------------------- commit 88e8ac11 upstream. The following race is observed with the repeated online, offline and a delay between two successive online of memory blocks of movable zone. P1 P2 Online the first memory block in the movable zone. The pcp struct values are initialized to default values,i.e., pcp->high = 0 & pcp->batch = 1. Allocate the pages from the movable zone. Try to Online the second memory block in the movable zone thus it entered the online_pages() but yet to call zone_pcp_update(). This process is entered into the exit path thus it tries to release the order-0 pages to pcp lists through free_unref_page_commit(). As pcp->high = 0, pcp->count = 1 proceed to call the function free_pcppages_bulk(). Update the pcp values thus the new pcp values are like, say, pcp->high = 378, pcp->batch = 63. Read the pcp's batch value using READ_ONCE() and pass the same to free_pcppages_bulk(), pcp values passed here are, batch = 63, count = 1. Since num of pages in the pcp lists are less than ->batch, then it will stuck in while(list_empty(list)) loop with interrupts disabled thus a core hung. Avoid this by ensuring free_pcppages_bulk() is called with proper count of pcp list pages. The mentioned race is some what easily reproducible without [1] because pcp's are not updated for the first memory block online and thus there is a enough race window for P2 between alloc+free and pcp struct values update through onlining of second memory block. With [1], the race still exists but it is very narrow as we update the pcp struct values for the first memory block online itself. This is not limited to the movable zone, it could also happen in cases with the normal zone (e.g., hotplug to a node that only has DMA memory, or no other memory yet). [1]: https://patchwork.kernel.org/patch/11696389/ Fixes: 5f8dcc21 ("page-allocator: split per-cpu list into one-list-per-migrate-type") Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: <stable@vger.kernel.org> [2.6+] Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Doug Berger 提交于
stable inclusion from linux-4.19.142 commit 84b8dc232afadf3aab425a104def45a1e7346a58 -------------------------------- commit e08d3fdf upstream. The lowmem_reserve arrays provide a means of applying pressure against allocations from lower zones that were targeted at higher zones. Its values are a function of the number of pages managed by higher zones and are assigned by a call to the setup_per_zone_lowmem_reserve() function. The function is initially called at boot time by the function init_per_zone_wmark_min() and may be called later by accesses of the /proc/sys/vm/lowmem_reserve_ratio sysctl file. The function init_per_zone_wmark_min() was moved up from a module_init to a core_initcall to resolve a sequencing issue with khugepaged. Unfortunately this created a sequencing issue with CMA page accounting. The CMA pages are added to the managed page count of a zone when cma_init_reserved_areas() is called at boot also as a core_initcall. This makes it uncertain whether the CMA pages will be added to the managed page counts of their zones before or after the call to init_per_zone_wmark_min() as it becomes dependent on link order. With the current link order the pages are added to the managed count after the lowmem_reserve arrays are initialized at boot. This means the lowmem_reserve values at boot may be lower than the values used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the ratio values are unchanged. In many cases the difference is not significant, but for example an ARM platform with 1GB of memory and the following memory layout cma: Reserved 256 MiB at 0x0000000030000000 Zone ranges: DMA [mem 0x0000000000000000-0x000000002fffffff] Normal empty HighMem [mem 0x0000000030000000-0x000000003fffffff] would result in 0 lowmem_reserve for the DMA zone. This would allow userspace to deplete the DMA zone easily. Funnily enough $ cat /proc/sys/vm/lowmem_reserve_ratio would fix up the situation because as a side effect it forces setup_per_zone_lowmem_reserve. This commit breaks the link order dependency by invoking init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages have the chance to be properly accounted in their zone(s) and allowing the lowmem_reserve arrays to receive consistent values. Fixes: bc22af74 ("mm: update min_free_kbytes from khugepaged after core initialization") Signed-off-by: NDoug Berger <opendmb@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Jason Baron <jbaron@akamai.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 22 9月, 2020 6 次提交
-
-
由 Pavel Tatashin 提交于
stable inclusion from linux-4.19.129 commit 88afa532c14135528b905015f1d9a5e740a95136 -------------------------------- commit 3d060856 upstream. Initializing struct pages is a long task and keeping interrupts disabled for the duration of this operation introduces a number of problems. 1. jiffies are not updated for long period of time, and thus incorrect time is reported. See proposed solution and discussion here: lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com 2. It prevents farther improving deferred page initialization by allowing intra-node multi-threading. We are keeping interrupts disabled to solve a rather theoretical problem that was never observed in real world (See 3a2d7fa8). Let's keep interrupts enabled. In case we ever encounter a scenario where an interrupt thread wants to allocate large amount of memory this early in boot we can deal with that by growing zone (see deferred_grow_zone()) by the needed amount before starting deferred_init_memmap() threads. Before: [ 1.232459] node 0 initialised, 12058412 pages in 1ms After: [ 1.632580] node 0 initialised, 12051227 pages in 436ms Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages") Reported-by: NShile Zhang <shile.zhang@linux.alibaba.com> Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Sasha Levin <sashal@kernel.org> Cc: Yiqian Wei <yiwei@redhat.com> Cc: <stable@vger.kernel.org> [4.17+] Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
stable inclusion from linux-4.19.123 commit dfe810bd92be7a50f491abd381f5a742d9844675 -------------------------------- commit e84fe99b upstream. Without CONFIG_PREEMPT, it can happen that we get soft lockups detected, e.g., while booting up. watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 RIP: __pageblock_pfn_to_page+0x134/0x1c0 Call Trace: set_zone_contiguous+0x56/0x70 page_alloc_init_late+0x166/0x176 kernel_init_freeable+0xfa/0x255 kernel_init+0xa/0x106 ret_from_fork+0x35/0x40 The issue becomes visible when having a lot of memory (e.g., 4TB) assigned to a single NUMA node - a system that can easily be created using QEMU. Inside VMs on a hypervisor with quite some memory overcommit, this is fairly easy to trigger. Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com> Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: NBaoquan He <bhe@redhat.com> Reviewed-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Shile Zhang <shile.zhang@linux.alibaba.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Alexander Duyck 提交于
stable inclusion from linux-4.19.116 commit 695986163d66a9f55daf13aba5976d4b03a23cc9 -------------------------------- commit 86447726 upstream. This patch replaces the size + 1 value introduced with the recent fix for 1 byte allocs with a constant value. The idea here is to reduce code overhead as the previous logic would have to read size into a register, then increment it, and write it back to whatever field was being used. By using a constant we can avoid those memory reads and arithmetic operations in favor of just encoding the maximum value into the operation itself. Fixes: 2c2ade81 ("mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs") Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
stable inclusion from linux-4.19.103 commit 0a69047d8235c60d88c6ca488d8dccc7c60d4d3c -------------------------------- [ Upstream commit e822969c ] Patch series "mm: fix max_pfn not falling on section boundary", v2. Playing with different memory sizes for a x86-64 guest, I discovered that some memmaps (highest section if max_mem does not fall on the section boundary) are marked as being valid and online, but contain garbage. We have to properly initialize these memmaps. Looking at /proc/kpageflags and friends, I found some more issues, partially related to this. This patch (of 3): If max_pfn is not aligned to a section boundary, we can easily run into BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a memory size that is not a multiple of 128MB (e.g., 4097MB, but also 4160MB). I was told that on real HW, we can easily have this scenario (esp., one of the main reasons sub-section hotadd of devmem was added). The issue is, that we have a valid memmap (pfn_valid()) for the whole section, and the whole section will be marked "online". pfn_to_online_page() will succeed, but the memmap contains garbage. E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m 4160M" - (see tools/vm/page-types.c): [ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe [ 200.477500] #PF: supervisor read access in kernel mode [ 200.478334] #PF: error_code(0x0000) - not-present page [ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0 [ 200.479557] Oops: 0000 [#4] SMP NOPTI [ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93 [ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410 [ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f [ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202 [ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000 [ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246 [ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000 [ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001 [ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08 [ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000 [ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0 [ 200.488897] Call Trace: [ 200.489115] kpageflags_read+0xe9/0x140 [ 200.489447] proc_reg_read+0x3c/0x60 [ 200.489755] vfs_read+0xc2/0x170 [ 200.490037] ksys_pread64+0x65/0xa0 [ 200.490352] do_syscall_64+0x5c/0xa0 [ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe But it can be triggered much easier via "cat /proc/kpageflags > /dev/null" after cold/hot plugging a DIMM to such a system: [root@localhost ~]# cat /proc/kpageflags > /dev/null [ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe [ 111.517907] #PF: supervisor read access in kernel mode [ 111.518333] #PF: error_code(0x0000) - not-present page [ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0 This patch fixes that by at least zero-ing out that memmap (so e.g., page_to_pfn() will not crash). Commit 907ec5fc ("mm: zero remaining unavailable struct pages") tried to fix a similar issue, but forgot to consider this special case. After this patch, there are still problems to solve. E.g., not all of these pages falling into a memory hole will actually get initialized later and set PageReserved - they are only zeroed out - but at least the immediate crashes are gone. A follow-up patch will take care of this. Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: NDavid Hildenbrand <david@redhat.com> Tested-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Bob Picco <bob.picco@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: <stable@vger.kernel.org> [4.15+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Pavel Tatashin 提交于
stable inclusion from linux-4.19.103 commit f19a50c1e3ba9f58ca5a591a82ac4852da8bc4ee -------------------------------- [ Upstream commit ec393a0f ] When checking for valid pfns in zero_resv_unavail(), it is not necessary to verify that pfns within pageblock_nr_pages ranges are valid, only the first one needs to be checked. This is because memory for pages are allocated in contiguous chunks that contain pageblock_nr_pages struct pages. Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.comSigned-off-by: NPavel Tatashin <pavel.tatashin@microsoft.com> Signed-off-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Naoya Horiguchi 提交于
stable inclusion from linux-4.19.103 commit 9ac5917a1d28220981512c4f4c391c90a997e0c6 -------------------------------- [ Upstream commit 907ec5fc ] Patch series "mm: Fix for movable_node boot option", v3. This patch series contains a fix for the movable_node boot option issue which was introduced by commit 124049de ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"). The commit breaks the option because it changed the memory gap range to reserved memblock. So, the node is marked as Normal zone even if the SRAT has Hot pluggable affinity. First and second patch fix the original issue which the commit tried to fix, then revert the commit. This patch (of 3): There is a kernel panic that is triggered when reading /proc/kpageflags on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]': BUG: unable to handle kernel paging request at fffffffffffffffe PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014 RIP: 0010:stable_page_flags+0x27/0x3c0 Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7 RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202 RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0 RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001 R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0 R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10 FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0 Call Trace: kpageflags_read+0xc7/0x120 proc_reg_read+0x3c/0x60 __vfs_read+0x36/0x170 vfs_read+0x89/0x130 ksys_pread64+0x71/0x90 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7efc42e75e23 Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24 According to kernel bisection, this problem became visible due to commit f7f99100 which changes how struct pages are initialized. Memblock layout affects the pfn ranges covered by node/zone. Consider that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the default (no memmap= given) memblock layout is like below: MEMBLOCK configuration: memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000 memory.cnt = 0x4 memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0 memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0 memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0 memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0 ... If you give memmap=1G!4G (so it just covers memory[0x2]), the range [0x100000000-0x13fffffff] is gone: MEMBLOCK configuration: memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000 memory.cnt = 0x3 memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0 memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0 memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0 ... This causes shrinking node 0's pfn range because it is calculated by the address range of memblock.memory. So some of struct pages in the gap range are left uninitialized. We have a function zero_resv_unavail() which does zeroing the struct pages outside memblock.memory, but currently it covers only the reserved unavailable range (i.e. memblock.memory && !memblock.reserved). This patch extends it to cover all unavailable range, which fixes the reported issue. Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com> Tested-by: NOscar Salvador <osalvador@suse.de> Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 06 8月, 2020 1 次提交
-
-
由 Johannes Weiner 提交于
mainline inclusion from v5.4-rc6 commit 1be334e5 category: feature bugzilla: NA CVE: NA ------------------------------------------------- While investigating a bug related to higher atomic allocation failures, we noticed the failure warnings positively drowning the console, and in our case trigger lockup warnings because of a serial console too slow to handle all that output. But even if we had a faster console, it's unclear what additional information the current level of repetition provides. Allocation failures happen for three reasons: The machine is OOM, the VM is failing to handle reasonable requests, or somebody is making unreasonable requests (and didn't acknowledge their opportunism with __GFP_NOWARN). Having the memory dump, a callstack, and the ratelimit stats on skipped failure warnings should provide enough information to let users/admins/developers know whether something is wrong and point them in the right direction for debugging, bpftracing etc. Limit allocation failure warnings to one spew every ten seconds. Link: http://lkml.kernel.org/r/20191028194906.26899-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/page_alloc.c Signed-off-by: NChen Jun <chenjun102@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-