- 03 4月, 2020 40 次提交
-
-
由 Qian Cai 提交于
pgdat->kswapd_classzone_idx could be accessed concurrently in wakeup_kswapd(). Plain writes and reads without any lock protection result in data races. Fix them by adding a pair of READ|WRITE_ONCE() as well as saving a branch (compilers might well optimize the original code in an unintentional way anyway). While at it, also take care of pgdat->kswapd_order and non-kswapd threads in allow_direct_reclaim(). The data races were reported by KCSAN, BUG: KCSAN: data-race in wakeup_kswapd / wakeup_kswapd write to 0xffff9f427ffff2dc of 4 bytes by task 7454 on cpu 13: wakeup_kswapd+0xf1/0x400 wakeup_kswapd at mm/vmscan.c:3967 wake_all_kswapds+0x59/0xc0 wake_all_kswapds at mm/page_alloc.c:4241 __alloc_pages_slowpath+0xdcc/0x1290 __alloc_pages_slowpath at mm/page_alloc.c:4512 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_vma+0x8a/0x2c0 do_anonymous_page+0x16e/0x6f0 __handle_mm_fault+0xcd5/0xd40 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 1 lock held by mtest01/7454: #0: ffff9f425afe8808 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x143/0x6f9 do_user_addr_fault at arch/x86/mm/fault.c:1405 (inlined by) do_page_fault at arch/x86/mm/fault.c:1539 irq event stamp: 6944085 count_memcg_event_mm+0x1a6/0x270 count_memcg_event_mm+0x119/0x270 __do_softirq+0x34c/0x57c irq_exit+0xa2/0xc0 read to 0xffff9f427ffff2dc of 4 bytes by task 7472 on cpu 38: wakeup_kswapd+0xc8/0x400 wake_all_kswapds+0x59/0xc0 __alloc_pages_slowpath+0xdcc/0x1290 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_vma+0x8a/0x2c0 do_anonymous_page+0x16e/0x6f0 __handle_mm_fault+0xcd5/0xd40 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 1 lock held by mtest01/7472: #0: ffff9f425a9ac148 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x143/0x6f9 irq event stamp: 6793561 count_memcg_event_mm+0x1a6/0x270 count_memcg_event_mm+0x119/0x270 __do_softirq+0x34c/0x57c irq_exit+0xa2/0xc0 BUG: KCSAN: data-race in kswapd / wakeup_kswapd write to 0xffff90973ffff2dc of 4 bytes by task 820 on cpu 6: kswapd+0x27c/0x8d0 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 read to 0xffff90973ffff2dc of 4 bytes by task 6299 on cpu 0: wakeup_kswapd+0xf3/0x450 wake_all_kswapds+0x59/0xc0 __alloc_pages_slowpath+0xdcc/0x1290 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_vma+0x8a/0x2c0 do_anonymous_page+0x170/0x700 __handle_mm_fault+0xc9f/0xd00 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 Signed-off-by: NQian Cai <cai@lca.pw> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Matthew Wilcox <willy@infradead.org> Link: http://lkml.kernel.org/r/1582749472-5171-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
kswapd kernel thread starts either with a CPU affinity set to the full cpu mask of its target node or without any affinity at all if the node is CPUless. There is a cpu hotplug callback (kswapd_cpu_online) that implements an elaborate way to update this mask when a cpu is onlined. It is not really clear whether there is any actual benefit from this scheme. Completely CPU-less NUMA nodes rarely gain a new CPU during runtime. Drop the code for that reason. If there is a real usecase then we can resurrect and simplify the code. [mhocko@suse.com rewrite changelog] Suggested-by: NMichal Hocko <mhocko@suse.org> Signed-off-by: NWei Yang <richardw.yang@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/20200218224422.3407-1-richardw.yang@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yang Shi 提交于
The commit 98fa15f3 ("mm: replace all open encodings for NUMA_NO_NODE") did the replacement across the kernel tree, but we got some more in vmscan.c since then. Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com> Acked-by: NMinchan Kim <minchan@kernel.org> Acked-by: NDavid Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/1581568298-45317-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yang Shi 提交于
Use mem_cgroup_is_root() API to check if memcg is root memcg instead of open coding. Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NDavid Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/1581398649-125989-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yang Shi 提交于
When kstrndup fails, no memory was allocated and we can exit directly. [david@redhat.com: reword changelog] Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NDavid Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/1581398649-125989-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 chenqiwu 提交于
Simplify page_is_buddy() to reduce the redundant code for better code readability. Signed-off-by: Nchenqiwu <chenqiwu@xiaomi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/1583853751-5525-1-git-send-email-qiwuchen55@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mateusz Nosek 提交于
Previously if branch condition was false, the assignment was not executed. The assignment can be safely executed even when the condition is false and it is not incorrect as it assigns the value of 'nodemask' to 'ac.nodemask' which already has the same value. So as the assignment can be executed unconditionally, the branch can be removed. Signed-off-by: NMateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Link: http://lkml.kernel.org/r/20200307225335.31300-1-mateusznosek0@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 chenqiwu 提交于
Use free_area_empty() API to replace list_empty() for better code readability. Signed-off-by: Nchenqiwu <chenqiwu@xiaomi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Link: http://lkml.kernel.org/r/1583674354-7713-1-git-send-email-qiwuchen55@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mateusz Nosek 提交于
This patch makes ALLOC_KSWAPD equal to __GFP_KSWAPD_RECLAIM (cast to int). Thanks to that code like: if (gfp_mask & __GFP_KSWAPD_RECLAIM) alloc_flags |= ALLOC_KSWAPD; can be changed to: alloc_flags |= (__force int) (gfp_mask &__GFP_KSWAPD_RECLAIM); Thanks to this one branch less is generated in the assembly. In case of ALLOC_KSWAPD flag two branches are saved, first one in code that always executes in the beginning of page allocation and the second one in loop in page allocator slowpath. Signed-off-by: NMateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMel Gorman <mgorman@techsingularity.net> Link: http://lkml.kernel.org/r/20200304162118.14784-1-mateusznosek0@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joel Savitz 提交于
Currently, the vm.min_free_kbytes sysctl value is capped at a hardcoded 64M in init_per_zone_wmark_min (unless it is overridden by khugepaged initialization). This value has not been modified since 2005, and enterprise-grade systems now frequently have hundreds of GB of RAM and multiple 10, 40, or even 100 GB NICs. We have seen page allocation failures on heavily loaded systems related to NIC drivers. These issues were resolved by an increase to vm.min_free_kbytes. This patch increases the hardcoded value by a factor of 4 as a temporary solution. Further work to make the calculation of vm.min_free_kbytes more consistent throughout the kernel would be desirable. As an example of the inconsistency of the current method, this value is recalculated by init_per_zone_wmark_min() in the case of memory hotplug which will override the value set by set_recommended_min_free_kbytes() called during khugepaged initialization even if khugepaged remains enabled, however an on/off toggle of khugepaged will then recalculate and set the value via set_recommended_min_free_kbytes(). Signed-off-by: NJoel Savitz <jsavitz@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Rafael Aquini <aquini@redhat.com> Link: http://lkml.kernel.org/r/20200220150103.5183-1-jsavitz@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Walter Wu 提交于
Patch series "fix the missing underflow in memory operation function", v4. The patchset helps to produce a KASAN report when size is negative in memory operation functions. It is helpful for programmer to solve an undefined behavior issue. Patch 1 based on Dmitry's review and suggestion, patch 2 is a test in order to verify the patch 1. [1]https://bugzilla.kernel.org/show_bug.cgi?id=199341 [2]https://lore.kernel.org/linux-arm-kernel/20190927034338.15813-1-walter-zh.wu@mediatek.com/ This patch (of 2): KASAN missed detecting size is a negative number in memset(), memcpy(), and memmove(), it will cause out-of-bounds bug. So needs to be detected by KASAN. If size is a negative number, then it has a reason to be defined as out-of-bounds bug type. Casting negative numbers to size_t would indeed turn up as a large size_t and its value will be larger than ULONG_MAX/2, so that this can qualify as out-of-bounds. KASAN report is shown below: BUG: KASAN: out-of-bounds in kmalloc_memmove_invalid_size+0x70/0xa0 Read of size 18446744073709551608 at addr ffffff8069660904 by task cat/72 CPU: 2 PID: 72 Comm: cat Not tainted 5.4.0-rc1-next-20191004ajb-00001-gdb8af2f372b2-dirty #1 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x288 show_stack+0x14/0x20 dump_stack+0x10c/0x164 print_address_description.isra.9+0x68/0x378 __kasan_report+0x164/0x1a0 kasan_report+0xc/0x18 check_memory_region+0x174/0x1d0 memmove+0x34/0x88 kmalloc_memmove_invalid_size+0x70/0xa0 [1] https://bugzilla.kernel.org/show_bug.cgi?id=199341 [cai@lca.pw: fix -Wdeclaration-after-statement warn] Link: http://lkml.kernel.org/r/1583509030-27939-1-git-send-email-cai@lca.pw [peterz@infradead.org: fix objtool warning] Link: http://lkml.kernel.org/r/20200305095436.GV2596@hirez.programming.kicks-ass.netReported-by: Nkernel test robot <lkp@intel.com> Reported-by: NDmitry Vyukov <dvyukov@google.com> Suggested-by: NDmitry Vyukov <dvyukov@google.com> Signed-off-by: NWalter Wu <walter-zh.wu@mediatek.com> Signed-off-by: NQian Cai <cai@lca.pw> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDmitry Vyukov <dvyukov@google.com> Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Link: http://lkml.kernel.org/r/20191112065302.7015-1-walter-zh.wu@mediatek.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Baoquan He 提交于
When allocating memmap for hot added memory with the classic sparse, the specified 'nid' is ignored in populate_section_memmap(). While in allocating memmap for the classic sparse during boot, the node given by 'nid' is preferred. And VMEMMAP prefers the node of 'nid' in both boot stage and memory hot adding. So seems no reason to not respect the node of 'nid' for the classic sparse when hot adding memory. Use kvmalloc_node instead to use the passed in 'nid'. Signed-off-by: NBaoquan He <bhe@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NWei Yang <richard.weiyang@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200316125625.GH3486@MiWiFi-R3L-srvSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Baoquan He 提交于
This change makes populate_section_memmap()/depopulate_section_memmap much simpler. Suggested-by: NMichal Hocko <mhocko@kernel.org> Signed-off-by: NBaoquan He <bhe@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NWei Yang <richard.weiyang@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200316125450.GG3486@MiWiFi-R3L-srvSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pingfan Liu 提交于
After introducing mem sub section concept, pfn_present() loses its literal meaning, and will not be necessary a truth on partial populated mem section. Since all of the callers use it to judge an absent section, it is better to rename pfn_present() as pfn_in_present_section(). Signed-off-by: NPingfan Liu <kernelfans@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Leonardo Bras <leonardo@linux.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Link: http://lkml.kernel.org/r/1581919110-29575-1-git-send-email-kernelfans@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
memmap should be the address to page struct instead of address to pfn. As mentioned by David, if system memory and devmem sit within a section, the mismatch address would lead kdump to dump unexpected memory. Since sub-section only works for SPARSEMEM_VMEMMAP, pfn_to_page() is valid to get the page struct address at this point. Fixes: ba72b4c8 ("mm/sparsemem: support sub-section hotplug") Signed-off-by: NWei Yang <richardw.yang@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NDavid Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Baoquan He <bhe@redhat.com> Link: http://lkml.kernel.org/r/20200210005048.10437-1-richardw.yang@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Brian Geffon 提交于
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set, the source mapping will not be removed. The remap operation will be performed as it would have been normally by moving over the page tables to the new mapping. The old vma will have any locked flags cleared, have no pagetables, and any userfaultfds that were watching that range will continue watching it. For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause the mremap() call to fail. Because MREMAP_DONTUNMAP always results in moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must always be equal to the new_len otherwise it will return -EINVAL. We hope to use this in Chrome OS where with userfaultfd we could write an anonymous mapping to disk without having to STOP the process or worry about VMA permission changes. This feature also has a use case in Android, Lokesh Gidra has said that "As part of using userfaultfd for GC, We'll have to move the physical pages of the java heap to a separate location. For this purpose mremap will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java heap, its virtual mapping will be removed as well. Therefore, we'll require performing mmap immediately after. This is not only time consuming but also opens a time window where a native thread may call mmap and reserve the java heap's address range for its own usage. This flag solves the problem." [bgeffon@google.com: v6] Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com [bgeffon@google.com: v7] Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.comSigned-off-by: NBrian Geffon <bgeffon@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Tested-by: NLokesh Gidra <lokeshgidra@google.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: "Michael S . Tsirkin" <mst@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Deacon <will@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Jesse Barnes <jsbarnes@google.com> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Florian Weimer <fweimer@redhat.com> Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jaewon Kim 提交于
Even on 64 bit kernel, the mmap failure can happen for a 32 bit task. Virtual memory space shortage of a task on mmap is reported to userspace as -ENOMEM. It can be confused as physical memory shortage of overall system. The vm_unmapped_area can be called to by some drivers or other kernel core system like filesystem. In my platform, GPU driver calls to vm_unmapped_area and the driver returns -ENOMEM even in GPU side shortage. It can be hard to distinguish which code layer returns the -ENOMEM. Create mmap trace file and add trace point of vm_unmapped_area. i.e.) 277.156599: vm_unmapped_area: addr=77e0d03000 err=0 total_vm=0x17014b flags=0x1 len=0x400000 lo=0x8000 hi=0x7878c27000 mask=0x0 ofs=0x1 342.838740: vm_unmapped_area: addr=0 err=-12 total_vm=0xffb08 flags=0x0 len=0x100000 lo=0x40000000 hi=0xfffff000 mask=0x0 ofs=0x22 [akpm@linux-foundation.org: prefix address printk with 0x, per Matthew] Signed-off-by: NJaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michel Lespinasse <walken@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200320055823.27089-3-jaewon31.kim@samsung.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jaewon Kim 提交于
Patch series "mm: mmap: add mmap trace point", v3. Create mmap trace file and add trace point of vm_unmapped_area(). This patch (of 2): In preparation for next patch remove inline of vm_unmapped_area and move code to mmap.c. There is no logical change. Also remove unmapped_area[_topdown] out of mm.h, there is no code calling to them. Signed-off-by: NJaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michel Lespinasse <walken@google.com> Cc: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20200320055823.27089-2-jaewon31.kim@samsung.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wang Wenhu 提交于
The param "start" actually referes to the physical memory start, which is to be mapped into virtual area vma. And it is the field vma->vm_start which stands for the start of the area. Most of the time, we do not read through whole implementation of a function but only the definition and essential comments. Accurate comments are definitely the base stone. Signed-off-by: NWang Wenhu <wenhu.wang@vivo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200318052206.105104-1-wenhu.wang@vivo.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 WANG Wenhu 提交于
It really made me scratch my head. Replace the comment with an accurate and consistent description. The parameter pfn actually refers to the page frame number which is right-shifted by PAGE_SHIFT from the physical address. Signed-off-by: NWANG Wenhu <wenhu.wang@vivo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200310073955.43415-1-wenhu.wang@vivo.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Xu 提交于
The existing gup code does not react to the fatal signals in many code paths. For example, in one retry path of gup we're still using down_read() rather than down_read_killable(). Also, when doing page faults we don't pass in FAULT_FLAG_KILLABLE as well, which means that within the faulting process we'll wait in non-killable way as well. These were spotted by Linus during the code review of some other patches. Let's allow the gup code to react to fatal signals to improve the responsiveness of threads when during gup and being killed. Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Tested-by: NBrian Geffon <bgeffon@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220160256.9887-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Xu 提交于
This is the gup counterpart of the change that allows the VM_FAULT_RETRY to happen for more than once. One thing to mention is that we must check the fatal signal here before retry because the GUP can be interrupted by that, otherwise we can loop forever. Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Tested-by: NBrian Geffon <bgeffon@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220195357.16371-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Xu 提交于
The idea comes from a discussion between Linus and Andrea [1]. Before this patch we only allow a page fault to retry once. We achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing handle_mm_fault() the second time. This was majorly used to avoid unexpected starvation of the system by looping over forever to handle the page fault on a single page. However that should hardly happen, and after all for each code path to return a VM_FAULT_RETRY we'll first wait for a condition (during which time we should possibly yield the cpu) to happen before VM_FAULT_RETRY is really returned. This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY. It means that the page fault handler now can retry the page fault for multiple times if necessary without the need to generate another page fault event. Meanwhile we still keep the FAULT_FLAG_TRIED flag so page fault handler can still identify whether a page fault is the first attempt or not. Then we'll have these combinations of fault flags (only considering ALLOW_RETRY flag and TRIED flag): - ALLOW_RETRY and !TRIED: this means the page fault allows to retry, and this is the first try - ALLOW_RETRY and TRIED: this means the page fault allows to retry, and this is not the first try - !ALLOW_RETRY and !TRIED: this means the page fault does not allow to retry at all - !ALLOW_RETRY and TRIED: this is forbidden and should never be used In existing code we have multiple places that has taken special care of the first condition above by checking against (fault_flags & FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to detect the first retry of a page fault by checking against both (fault_flags & FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now even the 2nd try will have the ALLOW_RETRY set, then use that helper in all existing special paths. One example is in __lock_page_or_retry(), now we'll drop the mmap_sem only in the first attempt of page fault and we'll keep it in follow up retries, so old locking behavior will be retained. This will be a nice enhancement for current code [2] at the same time a supporting material for the future userfaultfd-writeprotect work, since in that work there will always be an explicit userfault writeprotect retry for protected pages, and if that cannot resolve the page fault (e.g., when userfaultfd-writeprotect is used in conjunction with swapped pages) then we'll possibly need a 3rd retry of the page fault. It might also benefit other potential users who will have similar requirement like userfault write-protection. GUP code is not touched yet and will be covered in follow up patch. Please read the thread below for more information. [1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/ [2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Suggested-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Tested-by: NBrian Geffon <bgeffon@google.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Xu 提交于
When follow_hugetlb_page() returns with *locked==0, it means we've got a VM_FAULT_RETRY within the fauling process and we've released the mmap_sem. When that happens, we should stop and bail out. Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Tested-by: NBrian Geffon <bgeffon@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220155353.8676-3-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Xu 提交于
Patch series "mm: Page fault enhancements", v6. This series contains cleanups and enhancements to current page fault logic. The whole idea comes from the discussion between Andrea and Linus on the bug reported by syzbot here: https://lkml.org/lkml/2017/11/2/833 Basically it does two things: (a) Allows the page fault logic to be more interactive on not only SIGKILL, but also the rest of userspace signals, and, (b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more than once. For (a): with the changes we should be able to react faster when page faults are working in parallel with userspace signals like SIGSTOP and SIGCONT (and more), and with that we can remove the buggy part in userfaultfd and benefit the whole page fault mechanism on faster signal processing to reach the userspace. For (b), we should be able to allow the page fault handler to loop for even more than twice. Some context: for now since we have FAULT_FLAG_ALLOW_RETRY we can allow to retry the page fault once with the same interrupt context, however never more than twice. This can be not only a potential cleanup to remove this assumption since AFAIU the code itself doesn't really have this twice-only limitation (though that should be a protective approach in the past), at the same time it'll greatly simplify future works like userfaultfd write-protect where it's possible to retry for more than twice (please have a look at [1] below for a possible user that might require the page fault to be handled for a third time; if we can remove the retry limitation we can simply drop that patch and those complexity). This patch (of 16): There's plenty of places around __get_user_pages() that has a parameter "nonblocking" which does not really mean that "it won't block" (because it can really block) but instead it shows whether the mmap_sem is released by up_read() during the page fault handling mostly when VM_FAULT_RETRY is returned. We have the correct naming in e.g. get_user_pages_locked() or get_user_pages_remote() as "locked", however there're still many places that are using the "nonblocking" as name. Renaming the places to "locked" where proper to better suite the functionality of the variable. While at it, fixing up some of the comments accordingly. Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Tested-by: NBrian Geffon <bgeffon@google.com> Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: NJerome Glisse <jglisse@redhat.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Martin Cracauer <cracauer@cons.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220155353.8676-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Anshuman Khandual 提交于
Currently the declaration and definition for is_vma_temporary_stack() are scattered. Lets make is_vma_temporary_stack() helper available for general use and also drop the declaration from (include/linux/huge_mm.h) which is no longer required. While at this, rename this as vma_is_temporary_stack() in line with existing helpers. This should not cause any functional change. Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Anshuman Khandual 提交于
Patch series "mm/vma: some more minor changes", v2. The motivation here is to consolidate VMA flags and helpers in generic memory header and reduce code duplication when ever applicable. If there are other possible similar instances which might be missing here, please do let me me know. I will be happy to incorporate them. This patch (of 3): Move VM_NO_KHUGEPAGED into generic header (include/linux/mm.h). This just makes sure that no VMA flag is scattered in individual function files any longer. While at this, fix an old comment which is no longer valid. This should not cause any functional change. Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1582782965-3274-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Thomas Hellstrom 提交于
Following the update of pagewalk code commit a07984d48146 ("mm: pagewalk: add p4d_entry() and pgd_entry()") we can modify the mapping_dirty_helpers' huge page-table entry callbacks to avoid splitting when a huge pud or -pmd is encountered. Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NSteven Price <steven.price@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200203154305.15045-1-thomas_os@shipmail.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
If a task is getting moved out of the OOMing cgroup, it might result in unexpected OOM killings if memory.oom.group is used anywhere in the cgroup tree. Imagine the following example: A (oom.group = 1) / \ (OOM) B C Let's say B's memory.max is exceeded and it's OOMing. The OOM killer selects a task in B as a victim, but someone asynchronously moves the task into C. mem_cgroup_get_oom_group() will iterate over all ancestors of C up to the root cgroup. In theory it had to stop at the oom_domain level - the memory cgroup which is OOMing. But because B is not an ancestor of C, it's not happening. Instead it chooses A (because it's oom.group is set), and kills all tasks in A. This behavior is wrong because the OOM happened in B, so there is no reason to kill anything outside. Fix this by checking it the memory cgroup to which the task belongs is a descendant of the oom_domain. If not, memory.oom.group should be ignored, and the OOM killer should kill only the victim task. Reported-by: NDan Schatzberg <dschatzberg@fb.com> Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Link: http://lkml.kernel.org/r/20200316223510.3176148-1-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chris Down 提交于
The read side of this is all protected, but we can still tear if multiple iterations of mem_cgroup_protected are going at the same time. There's some intentional racing in mem_cgroup_protected which is ok, but load/store tearing should be avoided. Signed-off-by: NChris Down <chris@chrisdown.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/d1e9fbc0379fe8db475d82c8b6fbe048876e12ae.1584034301.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chris Down 提交于
The write side of this is xchg()/smp_mb(), so that's all good. Just a few sites missing a READ_ONCE. Signed-off-by: NChris Down <chris@chrisdown.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/bbec2c3d822217334855c8877a9d28b2a6d395fb.1584034301.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chris Down 提交于
This can be set concurrently with reads, which may cause the wrong value to be propagated. Signed-off-by: NChris Down <chris@chrisdown.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/e809b4e6b0c1626dac6945970de06409a180ee65.1584034301.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chris Down 提交于
This can be set concurrently with reads, which may cause the wrong value to be propagated. Signed-off-by: NChris Down <chris@chrisdown.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/448206f44b0fa7be9dad2ca2601d2bcb2c0b7844.1584034301.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chris Down 提交于
This one is a bit more nuanced because we have memcg_max_mutex, which is mostly just used for enforcing invariants, but we still need to READ_ONCE since (despite its name) it doesn't really protect memory.max access. On write (page_counter_set_max() and memory_max_write()) we use xchg(), which uses smp_mb(), so that's already fine. Signed-off-by: NChris Down <chris@chrisdown.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/50a31e5f39f8ae6c8fb73966ba1455f0924e8f44.1584034301.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chris Down 提交于
A mem_cgroup's high attribute can be concurrently set at the same time as we are trying to read it -- for example, if we are in memory_high_write at the same time as we are trying to do high reclaim. Signed-off-by: NChris Down <chris@chrisdown.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/2f66f7038ed1d4688e59de72b627ae0ea52efa83.1584034301.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vincenzo Frascino 提交于
mem_cgroup_id_get_many() is currently used only when MMU or MEMCG_SWAP configuration options are enabled. Having them disabled triggers the following warning at compile time: linux/mm/memcontrol.c:4797:13: warning: `mem_cgroup_id_get_many' defined but not used [-Wunused-function] static void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n) Make mem_cgroup_id_get_many() __maybe_unused to address the issue. Signed-off-by: NVincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NChris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200305164354.48147-1-vincenzo.frascino@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shakeel Butt 提交于
Currently multiple locations in memcg code, css_tryget_online() is being used. However it doesn't matter whether the cgroup is online for the callers. Online used to matter when we had reparenting on offlining and we needed a way to prevent new ones from showing up. The failure case for couple of these css_tryget_online usage is to fallback to root_mem_cgroup which kind of make bypassing the memcg limits possible for some workloads. For example creating an inotify group in a subcontainer and then deleting that container after moving the process to a different container will make all the event objects allocated for that group to the root_mem_cgroup. So, using css_tryget_online() is dangerous for such cases. Two locations still use the online version. The swapin of offlined memcg's pages and the memcg kmem cache creation. The kmem cache indeed needs the online version as the kernel does the reparenting of memcg kmem caches. For the swapin case, it has been left for later as the fallback is not really that concerning. With swap accounting enabled, if the memcg of the swapped out page is not online then the memcg extracted from the given 'mm' will be charged and if 'mm' is NULL then root memcg will be charged. However I could not find a code path where the given 'mm' will be NULL for swap-in case. Signed-off-by: NShakeel Butt <shakeelb@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/20200302203109.179417-1-shakeelb@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Right now, the effective protection of any given cgroup is capped by its own explicit memory.low setting, regardless of what the parent says. The reasons for this are mostly historical and ease of implementation: to make delegation of memory.low safe, effective protection is the min() of all memory.low up the tree. Unfortunately, this limitation makes it impossible to protect an entire subtree from another without forcing the user to make explicit protection allocations all the way to the leaf cgroups - something that is highly undesirable in real life scenarios. Consider memory in a data center host. At the cgroup top level, we have a distinction between system management software and the actual workload the system is executing. Both branches are further subdivided into individual services, job components etc. We want to protect the workload as a whole from the system management software, but that doesn't mean we want to protect and prioritize individual workload wrt each other. Their memory demand can vary over time, and we'd want the VM to simply cache the hottest data within the workload subtree. Yet, the current memory.low limitations force us to allocate a fixed amount of protection to each workload component in order to get protection from system management software in general. This results in very inefficient resource distribution. Another concern with mandating downward allocation is that, as the complexity of the cgroup tree grows, it gets harder for the lower levels to be informed about decisions made at the host-level. Consider a container inside a namespace that in turn creates its own nested tree of cgroups to run multiple workloads. It'd be extremely difficult to configure memory.low parameters in those leaf cgroups that on one hand balance pressure among siblings as the container desires, while also reflecting the host-level protection from e.g. rpm upgrades, that lie beyond one or more delegation and namespacing points in the tree. It's highly unusual from a cgroup interface POV that nested levels have to be aware of and reflect decisions made at higher levels for them to be effective. To enable such use cases and scale configurability for complex trees, this patch implements a resource inheritance model for memory that is similar to how the CPU and the IO controller implement work-conserving resource allocations: a share of a resource allocated to a subree always applies to the entire subtree recursively, while allowing, but not mandating, children to further specify distribution rules. That means that if protection is explicitly allocated among siblings, those configured shares are being followed during page reclaim just like they are now. However, if the memory.low set at a higher level is not fully claimed by the children in that subtree, the "floating" remainder is applied to each cgroup in the tree in proportion to its size. Since reclaim pressure is applied in proportion to size as well, each child in that tree gets the same boost, and the effect is neutral among siblings - with respect to each other, they behave as if no memory control was enabled at all, and the VM simply balances the memory demands optimally within the subtree. But collectively those cgroups enjoy a boost over the cgroups in neighboring trees. E.g. a leaf cgroup with a memory.low setting of 0 no longer means that it's not getting a share of the hierarchically assigned resource, just that it doesn't claim a fixed amount of it to protect from its siblings. This allows us to recursively protect one subtree (workload) from another (system management), while letting subgroups compete freely among each other - without having to assign fixed shares to each leaf, and without nested groups having to echo higher-level settings. The floating protection composes naturally with fixed protection. Consider the following example tree: A A: low = 2G / \ A1: low = 1G A1 A2 A2: low = 0G As outside pressure is applied to this tree, A1 will enjoy a fixed protection from A2 of 1G, but the remaining, unclaimed 1G from A is split evenly among A1 and A2, coming out to 1.5G and 0.5G. There is a slight risk of regressing theoretical setups where the top-level cgroups don't know about the true budgeting and set bogusly high "bypass" values that are meaningfully allocated down the tree. Such setups would rely on unclaimed protection to be discarded, and distributing it would change the intended behavior. Be safe and hide the new behavior behind a mount option, 'memory_recursiveprot'. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NChris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The effective protection of any given cgroup is a somewhat complicated construct that depends on the ancestor's configuration, siblings' configurations, as well as current memory utilization in all these groups. It's done this way to satisfy hierarchical delegation requirements while also making the configuration semantics flexible and expressive in complex real life scenarios. Unfortunately, all the rules and requirements are sparsely documented, and the code is a little too clever in merging different scenarios into a single min() expression. This makes it hard to reason about the implementation and avoid breaking semantics when making changes to it. This patch documents each semantic rule individually and splits out the handling of the overcommit case from the regular case. Michal Koutný also points out that the points of equilibrium as described in the existing example scenarios aren't actually accurate. Delete these examples for now to avoid confusion. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NChris Down <chris@chrisdown.name> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Link: http://lkml.kernel.org/r/20200227195606.46212-3-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Patch series "mm: memcontrol: recursive memory.low protection", v3. The current memory.low (and memory.min) semantics require protection to be assigned to a cgroup in an untinterrupted chain from the top-level cgroup all the way to the leaf. In practice, we want to protect entire cgroup subtrees from each other (system management software vs. workload), but we would like the VM to balance memory optimally *within* each subtree, without having to make explicit weight allocations among individual components. The current semantics make that impossible. They also introduce unmanageable complexity into more advanced resource trees. For example: host root `- system.slice `- rpm upgrades `- logging `- workload.slice `- a container `- system.slice `- workload.slice `- job A `- component 1 `- component 2 `- job B At a host-level perspective, we would like to protect the outer workload.slice subtree as a whole from rpm upgrades, logging etc. But for that to be effective, right now we'd have to propagate it down through the container, the inner workload.slice, into the job cgroup and ultimately the component cgroups where memory is actually, physically allocated. This may cross several tree delegation points and namespace boundaries, which make such a setup near impossible. CPU and IO on the other hand are already distributed recursively. The user would simply configure allowances at the host level, and they would apply to the entire subtree without any downward propagation. To enable the above-mentioned usecases and bring memory in line with other resource controllers, this patch series extends memory.low/min such that settings apply recursively to the entire subtree. Users can still assign explicit shares in subgroups, but if they don't, any ancestral protection will be distributed such that children compete freely amongst each other - as if no memory control were enabled inside the subtree - but enjoy protection from neighboring trees. In the above example, the user would then be able to configure shares of CPU, IO and memory at the host level to comprehensively protect and isolate the workload.slice as a whole from system.slice activity. Patch #1 fixes an existing bug that can give a cgroup tree more protection than it should receive as per ancestor configuration. Patch #2 simplifies and documents the existing code to make it easier to reason about the changes in the next patch. Patch #3 finally implements recursive memory protection semantics. Because of a risk of regressing legacy setups, the new semantics are hidden behind a cgroup2 mount option, 'memory_recursiveprot'. More details in patch #3. This patch (of 3): When memory.low is overcommitted - i.e. the children claim more protection than their shared ancestor grants them - the allowance is distributed in proportion to how much each sibling uses their own declared protection: low_usage = min(memory.low, memory.current) elow = parent_elow * (low_usage / siblings_low_usage) However, siblings_low_usage is not the sum of all low_usages. It sums up the usages of *only those cgroups that are within their memory.low* That means that low_usage can be *bigger* than siblings_low_usage, and consequently the total protection afforded to the children can be bigger than what the ancestor grants the subtree. Consider three groups where two are in excess of their protection: A/memory.low = 10G A/A1/memory.low = 10G, memory.current = 20G A/A2/memory.low = 10G, memory.current = 20G A/A3/memory.low = 10G, memory.current = 8G siblings_low_usage = 8G (only A3 contributes) A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G (the 12.5G are capped to the explicit memory.low setting of 10G) With that, the sum of all awarded protection below A is 30G, when A only grants 10G for the entire subtree. What does this mean in practice? A1 and A2 would still be in excess of their 10G allowance and would be reclaimed, whereas A3 would not. As they eventually drop below their protection setting, they would be counted in siblings_low_usage again and the error would right itself. When reclaim was applied in a binary fashion (cgroup is reclaimed when it's above its protection, otherwise it's skipped) this would actually work out just fine. However, since 1bc63fb1 ("mm, memcg: make scan aggression always exclude protection"), reclaim pressure is scaled to how much a cgroup is above its protection. As a result this calculation error unduly skews pressure away from A1 and A2 toward the rest of the system. But why did we do it like this in the first place? The reasoning behind exempting groups in excess from siblings_low_usage was to go after them first during reclaim in an overcommitted subtree: A/memory.low = 2G, memory.current = 4G A/A1/memory.low = 3G, memory.current = 2G A/A2/memory.low = 1G, memory.current = 2G siblings_low_usage = 2G (only A1 contributes) A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G While the children combined are overcomitting A and are technically both at fault, A2 is actively declaring unprotected memory and we would like to reclaim that first. However, while this sounds like a noble goal on the face of it, it doesn't make much difference in actual memory distribution: Because A is overcommitted, reclaim will not stop once A2 gets pushed back to within its allowance; we'll have to reclaim A1 either way. The end result is still that protection is distributed proportionally, with A1 getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance. [ If A weren't overcommitted, it wouldn't make a difference since each cgroup would just get the protection it declares: A/memory.low = 2G, memory.current = 3G A/A1/memory.low = 1G, memory.current = 1G A/A2/memory.low = 1G, memory.current = 2G With the current calculation: siblings_low_usage = 1G (only A1 contributes) A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G Including excess groups in siblings_low_usage: siblings_low_usage = 2G A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ] Simplify the calculation and fix the proportional reclaim bug by including excess cgroups in siblings_low_usage. After this patch, the effective memory.low distribution from the example above would be as follows: A/memory.low = 10G A/A1/memory.low = 10G, memory.current = 20G A/A2/memory.low = 10G, memory.current = 20G A/A3/memory.low = 10G, memory.current = 8G siblings_low_usage = 28G A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G Fixes: 1bc63fb1 ("mm, memcg: make scan aggression always exclude protection") Fixes: 23067153 ("mm: memory.low hierarchical behavior") Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NChris Down <chris@chrisdown.name> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Link: http://lkml.kernel.org/r/20200227195606.46212-2-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-