- 01 5月, 2021 40 次提交
-
-
由 Andrey Konovalov 提交于
This change adds an argument to kasan_poison() and kasan_unpoison() that allows initializing memory along with setting the tags for HW_TAGS. Combining setting allocation tags with memory initialization will improve HW_TAGS KASAN performance when init_on_alloc/free is enabled. This change doesn't integrate memory initialization with KASAN, this is done is subsequent patches in this series. Link: https://lkml.kernel.org/r/3054314039fa64510947e674180d675cab1b4c41.1615296150.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com> Reviewed-by: NMarco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Konovalov 提交于
Patch series "kasan: integrate with init_on_alloc/free", v3. This patch series integrates HW_TAGS KASAN with init_on_alloc/free by initializing memory via the same arm64 instruction that sets memory tags. This is expected to improve HW_TAGS KASAN performance when init_on_alloc/free is enabled. The exact perfomance numbers are unknown as MTE-enabled hardware doesn't exist yet. This patch (of 5): This change adds an argument to mte_set_mem_tag_range() that allows to enable memory initialization when settinh the allocation tags. The implementation uses stzg instruction instead of stg when this argument indicates to initialize memory. Combining setting allocation tags with memory initialization will improve HW_TAGS KASAN performance when init_on_alloc/free is enabled. This change doesn't integrate memory initialization with KASAN, this is done is subsequent patches in this series. Link: https://lkml.kernel.org/r/cover.1615296150.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/d04ae90cc36be3fe246ea8025e5085495681c3d7.1615296150.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com> Acked-by: NMarco Elver <elver@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Konovalov 提交于
During boot, all non-reserved memblock memory is exposed to page_alloc via memblock_free_pages->__free_pages_core(). This results in kasan_free_pages() being called, which poisons that memory. Poisoning all that memory lengthens boot time. The most noticeable effect is observed with the HW_TAGS mode. A boot-time impact may potentially also affect systems with large amount of RAM. This patch changes the tag-based modes to not poison the memory during the memblock->page_alloc transition. An exception is made for KASAN_GENERIC. Since it marks all new memory as accessible, not poisoning the memory released from memblock will lead to KASAN missing invalid boot-time accesses to that memory. With KASAN_SW_TAGS, as it uses the invalid 0xFE tag as the default tag for all memory, it won't miss bad boot-time accesses even if the poisoning of memblock memory is removed. With KASAN_HW_TAGS, the default memory tags values are unspecified. Therefore, if memblock poisoning is removed, this KASAN mode will miss the mentioned type of boot-time bugs with a 1/16 probability. This is taken as an acceptable trafe-off. Internally, the poisoning is removed as follows. __free_pages_core() is used when exposing fresh memory during system boot and when onlining memory during hotplug. This patch adds a new FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() through free_pages_prepare() from __free_pages_core(). If FPI_SKIP_KASAN_POISON is set, kasan_free_pages() is not called. All memory allocated normally when the boot is over keeps getting poisoned as usual. Link: https://lkml.kernel.org/r/a0570dc1e3a8f39a55aa343a1fc08cd5c2d4cad6.1613692950.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com> Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Collingbourne 提交于
We can sometimes end up with kasan_byte_accessible() being called on non-slab memory. For example ksize() and krealloc() may end up calling it on KFENCE allocated memory. In this case the memory will be tagged with KASAN_SHADOW_INIT, which a subsequent patch ("kasan: initialize shadow to TAG_INVALID for SW_TAGS") will set to the same value as KASAN_TAG_INVALID, causing kasan_byte_accessible() to fail when called on non-slab memory. This highlighted the fact that the check in kasan_byte_accessible() was inconsistent with checks as implemented for loads and stores (kasan_check_range() in SW tags mode and hardware-implemented checks in HW tags mode). kasan_check_range() does not have a check for KASAN_TAG_INVALID, and instead has a comparison against KASAN_SHADOW_START. In HW tags mode, we do not have either, but we do set TCR_EL1.TCMA which corresponds with the comparison against KASAN_TAG_KERNEL. Therefore, update kasan_byte_accessible() for both SW and HW tags modes to correspond with the respective checks on loads and stores. Link: https://linux-review.googlesource.com/id/Ic6d40803c57dcc6331bd97fbb9a60b0d38a65a36 Link: https://lkml.kernel.org/r/20210405220647.1965262-1-pcc@google.comSigned-off-by: NPeter Collingbourne <pcc@google.com> Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com> Reviewed-by: NMarco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zhiyuan Dai 提交于
strlcpy is marked as deprecated in Documentation/process/deprecated.rst, and there is no functional difference when the caller expects truncation (when not checking the return value). strscpy is relatively better as it also avoids scanning the whole source string. Link: https://lkml.kernel.org/r/1613970647-23272-1-git-send-email-daizhiyuan@phytium.com.cnSigned-off-by: NZhiyuan Dai <daizhiyuan@phytium.com.cn> Acked-by: NAlexander Potapenko <glider@google.com> Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Uladzislau Rezki (Sony) 提交于
Link: https://lkml.kernel.org/r/20210402202237.20334-5-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Uladzislau Rezki (Sony) 提交于
Instead of keeping open-coded style, move the code related to preloading into a separate function. Therefore introduce the preload_this_cpu_lock() routine that prelaods a current CPU with one extra vmap_area object. There is no functional change as a result of this patch. Link: https://lkml.kernel.org/r/20210402202237.20334-4-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vijayanand Jitta 提交于
A potential use after free can occur in _vm_unmap_aliases where an already freed vmap_area could be accessed, Consider the following scenario: Process 1 Process 2 __vm_unmap_aliases __vm_unmap_aliases purge_fragmented_blocks_allcpus rcu_read_lock() rcu_read_lock() list_del_rcu(&vb->free_list) list_for_each_entry_rcu(vb .. ) __purge_vmap_area_lazy kmem_cache_free(va) va_start = vb->va->va_start Here Process 1 is in purge path and it does list_del_rcu on vmap_block and later frees the vmap_area, since Process 2 was holding the rcu lock at this time vmap_block will still be present in and Process 2 accesse it and thereby it tries to access vmap_area of that vmap_block which was already freed by Process 1 and this results in use after free. Fix this by adding a check for vb->dirty before accessing vmap_area structure since vb->dirty will be set to VMAP_BBMAP_BITS in purge path checking for this will prevent the use after free. Link: https://lkml.kernel.org/r/1616062105-23263-1-git-send-email-vjitta@codeaurora.orgSigned-off-by: NVijayanand Jitta <vjitta@codeaurora.org> Reviewed-by: NUladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nicholas Piggin 提交于
There are several reasons why a vmalloc can fail, virtual space exhausted, page array allocation failure, page allocation failure, and kernel page table allocation failure. Add distinct warning messages for the main causes of failure, with some added information like page order or allocation size where applicable. [urezki@gmail.com: print correct vmalloc allocation size] Link: https://lkml.kernel.org/r/20210329193214.GA28602@pc638.lan Link: https://lkml.kernel.org/r/20210322021806.892164-6-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Cédric Le Goater <clg@kaod.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nicholas Piggin 提交于
This is a shim around vunmap_range, get rid of it. Move the main API comment from the _noflush variant to the normal variant, and make _noflush internal to mm/. [npiggin@gmail.com: fix nommu builds and a comment bug per sfr] Link: https://lkml.kernel.org/r/1617292598.m6g0knx24s.astroid@bobo.none [akpm@linux-foundation.org: move vunmap_range_noflush() stub inside !CONFIG_MMU, not !CONFIG_NUMA] [npiggin@gmail.com: fix nommu builds] Link: https://lkml.kernel.org/r/1617292497.o1uhq5ipxp.astroid@bobo.none Link: https://lkml.kernel.org/r/20210322021806.892164-5-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Cédric Le Goater <clg@kaod.org> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nicholas Piggin 提交于
Patch series "mm/vmalloc: cleanup after hugepage series", v2. Christoph pointed out some overdue cleanups required after the huge vmalloc series, and I had another failure error message improvement as well. This patch (of 5): This is a shim around vmap_pages_range, get rid of it. Move the main API comment from the _noflush variant to the normal variant, and make _noflush internal to mm/. Link: https://lkml.kernel.org/r/20210322021806.892164-1-npiggin@gmail.com Link: https://lkml.kernel.org/r/20210322021806.892164-2-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Cédric Le Goater <clg@kaod.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nicholas Piggin 提交于
Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC enables support on architectures that define HAVE_ARCH_HUGE_VMAP and supports PMD sized vmap mappings. vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or larger, and fall back to small pages if that was unsuccessful. Architectures must ensure that any arch specific vmalloc allocations that require PAGE_SIZE mappings (e.g., module allocations vs strict module rwx) use the VM_NOHUGE flag to inhibit larger mappings. This can result in more internal fragmentation and memory overhead for a given allocation, an option nohugevmalloc is added to disable at boot. [colin.king@canonical.com: fix read of uninitialized pointer area] Link: https://lkml.kernel.org/r/20210318155955.18220-1-colin.king@canonical.com Link: https://lkml.kernel.org/r/20210317062402.533919-14-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nicholas Piggin 提交于
As a side-effect, the order of flush_cache_vmap() and arch_sync_kernel_mappings() calls are switched, but that now matches the other callers in this file. Link: https://lkml.kernel.org/r/20210317062402.533919-13-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nicholas Piggin 提交于
This is a generic kernel virtual memory mapper, not specific to ioremap. Code is unchanged other than making vmap_range non-static. Link: https://lkml.kernel.org/r/20210317062402.533919-12-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nicholas Piggin 提交于
This changes the awkward approach where architectures provide init functions to determine which levels they can provide large mappings for, to one where the arch is queried for each call. This removes code and indirection, and allows constant-folding of dead code for unsupported levels. This also adds a prot argument to the arch query. This is unused currently but could help with some architectures (e.g., some powerpc processors can't map uncacheable memory with large pages). Link: https://lkml.kernel.org/r/20210317062402.533919-7-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com> Reviewed-by: NDing Tianhong <dingtianhong@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Will Deacon <will@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Russell King <linux@armlinux.org.uk> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nicholas Piggin 提交于
This will be used as a generic kernel virtual mapping function, so re-name it in preparation. Link: https://lkml.kernel.org/r/20210317062402.533919-6-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nicholas Piggin 提交于
The vmalloc mapper operates on a struct page * array rather than a linear physical address, re-name it to make this distinction clear. Link: https://lkml.kernel.org/r/20210317062402.533919-5-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nicholas Piggin 提交于
apply_to_pte_range might mistake a large pte for bad, or treat it as a page table, resulting in a crash or corruption. Add a test to warn and return error if large entries are found. Link: https://lkml.kernel.org/r/20210317062402.533919-4-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nicholas Piggin 提交于
vmalloc_to_page returns NULL for addresses mapped by larger pages[*]. Whether or not a vmap is huge depends on the architecture details, alignments, boot options, etc., which the caller can not be expected to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page. This change teaches vmalloc_to_page about larger pages, and returns the struct page that corresponds to the offset within the large page. This makes the API agnostic to mapping implementation details. [*] As explained by commit 029c54b0 ("mm/vmalloc.c: huge-vmap: fail gracefully on unexpected huge vmap mappings") [npiggin@gmail.com: sparc32: add stub pud_page define for walking huge vmalloc page tables] Link: https://lkml.kernel.org/r/20210324232825.1157363-1-npiggin@gmail.com Link: https://lkml.kernel.org/r/20210317062402.533919-3-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Serapheim Dimitropoulos 提交于
vread() has been linearly searching vmap_area_list for looking up vmalloc areas to read from. These same areas are also tracked by a rb_tree (vmap_area_root) which offers logarithmic lookup. This patch modifies vread() to use the rb_tree structure instead of the list and the speedup for heavy /proc/kcore readers can be pretty significant. Below are the wall clock measurements of a Python application that leverages the drgn debugging library to read and interpret data read from /proc/kcore. Before the patch: ----- $ time sudo sdb -e 'dbuf | head 3000 | wc' (unsigned long)3000 real 0m22.446s user 0m2.321s sys 0m20.690s ----- With the patch: ----- $ time sudo sdb -e 'dbuf | head 3000 | wc' (unsigned long)3000 real 0m2.104s user 0m2.043s sys 0m0.921s ----- Link: https://lkml.kernel.org/r/20210209190253.108763-1-serapheim@delphix.comSigned-off-by: NSerapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: NUladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Hellwig 提交于
remap_vmalloc_range_partial is only used to implement remap_vmalloc_range and by procfs. Unexport it. Link: https://lkml.kernel.org/r/20210301082235.932968-3-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de> Cc: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wang Wensheng 提交于
sparse_buffer_init() and sparse_buffer_fini() should appear in pair, or a WARN issue would be through the next time sparse_buffer_init() runs. Add the missing sparse_buffer_fini() in error branch. Link: https://lkml.kernel.org/r/20210325113155.118574-1-wangwensheng4@huawei.com Fixes: 85c77f79 ("mm/sparse: add new sparse_init_nid() and sparse_init()") Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zhiyuan Dai 提交于
strlcpy is marked as deprecated in Documentation/process/deprecated.rst, and there is no functional difference when the caller expects truncation (when not checking the return value). strscpy is relatively better as it also avoids scanning the whole source string. Link: https://lkml.kernel.org/r/1613962050-14188-1-git-send-email-daizhiyuan@phytium.com.cnSigned-off-by: NZhiyuan Dai <daizhiyuan@phytium.com.cn> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Brian Geffon 提交于
This reverts commit cd544fd1. As discussed in [1] this commit was a no-op because the mapping type was checked in vma_to_resize before move_vma is ever called. This meant that vm_ops->mremap() would never be called on such mappings. Furthermore, we've since expanded support of MREMAP_DONTUNMAP to non-anonymous mappings, and these special mappings are still protected by the existing check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EINVAL. 1. https://lkml.org/lkml/2020/12/28/2340 Link: https://lkml.kernel.org/r/20210323182520.2712101-2-bgeffon@google.comSigned-off-by: NBrian Geffon <bgeffon@google.com> Acked-by: NHugh Dickins <hughd@google.com> Reviewed-by: NDmitry Safonov <0x7f454c46@gmail.com> Cc: Alejandro Colomar <alx.manpages@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Michael S . Tsirkin" <mst@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Brian Geffon 提交于
Patch series "mm: Extend MREMAP_DONTUNMAP to non-anonymous mappings", v5. This patch (of 3): Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This restriction was placed initially for simplicity and not because there exists a technical reason to do so. This change will widen the support to include any mappings which are not VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support MREMAP_DONTUNMAP on mappings which may have been created from a memfd. This change will result in mremap(MREMAP_DONTUNMAP) returning -EINVAL if VM_DONTEXPAND or VM_PFNMAP mappings are specified. Lokesh Gidra who works on the Android JVM, provided an explanation of how such a feature will improve Android JVM garbage collection: "Android is developing a new garbage collector (GC), based on userfaultfd. The garbage collector will use userfaultfd (uffd) on the java heap during compaction. On accessing any uncompacted page, the application threads will find it missing, at which point the thread will create the compacted page and then use UFFDIO_COPY ioctl to get it mapped and then resume execution. Before starting this compaction, in a stop-the-world pause the heap will be mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive UFFD_EVENT_PAGEFAULT events after resuming execution. To speedup mremap operations, pagetable movement was optimized by moving PUD entries instead of PTE entries [1]. It was necessary as mremap of even modest sized memory ranges also took several milliseconds, and stopping the application for that long isn't acceptable in response-time sensitive cases. With UFFDIO_CONTINUE feature [2], it will be even more efficient to implement this GC, particularly the 'non-moveable' portions of the heap. It will also help in reducing the need to copy (UFFDIO_COPY) the pages. However, for this to work, the java heap has to be on a 'shared' vma. Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap compaction." [1] https://lore.kernel.org/linux-mm/20201215030730.NC3CU98e4%25akpm@linux-foundation.org/ [2] https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmussen@google.com/ Link: https://lkml.kernel.org/r/20210323182520.2712101-1-bgeffon@google.comSigned-off-by: NBrian Geffon <bgeffon@google.com> Acked-by: NHugh Dickins <hughd@google.com> Tested-by: NLokesh Gidra <lokeshgidra@google.com> Reviewed-by: NDmitry Safonov <0x7f454c46@gmail.com> Cc: Alejandro Colomar <alx.manpages@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Michael S . Tsirkin" <mst@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
With NUMA balancing, in hint page fault handler, the faulting page will be migrated to the accessing node if necessary. During the migration, TLB will be shot down on all CPUs that the process has run on recently. Because in the hint page fault handler, the PTE will be made accessible before the migration is tried. The overhead of TLB shooting down can be high, so it's better to be avoided if possible. In fact, if we delay mapping the page until migration, that can be avoided. This is what this patch doing. For the multiple threads applications, it's possible that a page is accessed by multiple threads almost at the same time. In the original implementation, because the first thread will install the accessible PTE before migrating the page, the other threads may access the page directly before the page is made inaccessible again during migration. While with the patch, the second thread will go through the page fault handler too. And because of the PageLRU() checking in the following code path, migrate_misplaced_page() numamigrate_isolate_page() isolate_lru_page() the migrate_misplaced_page() will return 0, and the PTE will be made accessible in the second thread. This will introduce a little more overhead. But we think the possibility for a page to be accessed by the multiple threads at the same time is low, and the overhead difference isn't too large. If this becomes a problem in some workloads, we need to consider how to reduce the overhead. To test the patch, we run a test case as follows on a 2-socket Intel server (1 NUMA node per socket) with 128GB DRAM (64GB per socket). 1. Run a memory eater on NUMA node 1 to use 40GB memory before running pmbench. 2. Run pmbench (normal accessing pattern) with 8 processes, and 8 threads per process, so there are 64 threads in total. The working-set size of each process is 8960MB, so the total working-set size is 8 * 8960MB = 70GB. The CPU of all pmbench processes is bound to node 1. The pmbench processes will access some DRAM on node 0. 3. After the pmbench processes run for 10 seconds, kill the memory eater. Now, some pages will be migrated from node 0 to node 1 via NUMA balancing. Test results show that, with the patch, the pmbench throughput (page accesses/s) increases 5.5%. The number of the TLB shootdowns interrupts reduces 98% (from ~4.7e7 to ~9.7e5) with about 9.2e6 pages (35.8GB) migrated. From the perf profile, it can be found that the CPU cycles spent by try_to_unmap() and its callees reduces from 6.02% to 0.47%. That is, the CPU cycles spent by TLB shooting down decreases greatly. Link: https://lkml.kernel.org/r/20210408132236.1175607-1-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Reviewed-by: NMel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Matthew Wilcox" <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Michel Lespinasse <walken@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Hellwig 提交于
Add a helper that calls remap_pfn_range for an struct io_mapping, relying on the pgprot pre-validation done when creating the mapping instead of doing it at runtime. Link: https://lkml.kernel.org/r/20210326055505.1424432-3-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Hellwig 提交于
Patch series "add remap_pfn_range_notrack instead of reinventing it in i915", v2. i915 has some reason to want to avoid the track_pfn_remap overhead in remap_pfn_range. Add a function to the core VM to do just that rather than reinventing the functionality poorly in the driver. Note that the remap_io_sg path does get exercises when using Xorg on my Thinkpad X1, so this should be considered lightly tested, I've not managed to hit the remap_io_mapping path at all. This patch (of 4): Add a version of remap_pfn_range that does not call track_pfn_range. This will be used to fix horrible abuses of VM internals in the i915 driver. Link: https://lkml.kernel.org/r/20210326055505.1424432-1-hch@lst.de Link: https://lkml.kernel.org/r/20210326055505.1424432-2-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zhiyuan Dai 提交于
Add a comment explaining the value of the ISSTATIC parameter, Inform the reader that this is not a coding style issue. Link: https://lkml.kernel.org/r/1613964695-17614-1-git-send-email-daizhiyuan@phytium.com.cnSigned-off-by: NZhiyuan Dai <daizhiyuan@phytium.com.cn> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wang Qing 提交于
Smatch gives the warning: do_numa_page() warn: assigning (-11) to unsigned variable 'migrated' Link: https://lkml.kernel.org/r/1614603421-2681-1-git-send-email-wangqing@vivo.comSigned-off-by: NWang Qing <wangqing@vivo.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
When the unsigned page_counter underflows, even just by a few pages, a cgroup will not be able to run anything afterwards and trigger the OOM killer in a loop. Underflows shouldn't happen, but when they do in practice, we may just be off by a small amount that doesn't interfere with the normal operation - consequences don't need to be that dire. Reset the page_counter to 0 upon underflow. We'll issue a warning that the accounting will be off and then try to keep limping along. [ We used to do this with the original res_counter, where it was a more straight-forward correction inside the spinlock section. I didn't carry it forward into the lockless page counters for simplicity, but it turns out this is quite useful in practice. ] Link: https://lkml.kernel.org/r/20210408143155.2679744-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NChris Down <chris@chrisdown.name> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
There is only one user of __memcg_kmem_charge(), so manually inline __memcg_kmem_charge() to obj_cgroup_charge_pages(). Similarly manually inline __memcg_kmem_uncharge() into obj_cgroup_uncharge_pages() and call obj_cgroup_uncharge_pages() in obj_cgroup_release(). This is just code cleanup without any functionality changes. Link: https://lkml.kernel.org/r/20210319163821.20704-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
Since Roman's series "The new cgroup slab memory controller" applied. All slab objects are charged via the new APIs of obj_cgroup. The new APIs introduce a struct obj_cgroup to charge slab objects. It prevents long-living objects from pinning the original memory cgroup in the memory. But there are still some corner objects (e.g. allocations larger than order-1 page on SLUB) which are not charged via the new APIs. Those objects (include the pages which are allocated from buddy allocator directly) are charged as kmem pages which still hold a reference to the memory cgroup. We want to reuse the obj_cgroup APIs to charge the kmem pages. If we do that, we should store an object cgroup pointer to page->memcg_data for the kmem pages. Finally, page->memcg_data will have 3 different meanings. 1) For the slab pages, page->memcg_data points to an object cgroups vector. 2) For the kmem pages (exclude the slab pages), page->memcg_data points to an object cgroup. 3) For the user pages (e.g. the LRU pages), page->memcg_data points to a memory cgroup. We do not change the behavior of page_memcg() and page_memcg_rcu(). They are also suitable for LRU pages and kmem pages. Why? Because memory allocations pinning memcgs for a long time - it exists at a larger scale and is causing recurring problems in the real world: page cache doesn't get reclaimed for a long time, or is used by the second, third, fourth, ... instance of the same job that was restarted into a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, and make page reclaim very inefficient. We can convert LRU pages and most other raw memcg pins to the objcg direction to fix this problem, and then the page->memcg will always point to an object cgroup pointer. At that time, LRU pages and kmem pages will be treated the same. The implementation of page_memcg() will remove the kmem page check. This patch aims to charge the kmem pages by using the new APIs of obj_cgroup. Finally, the page->memcg_data of the kmem page points to an object cgroup. We can use the __page_objcg() to get the object cgroup associated with a kmem page. Or we can use page_memcg() to get the memory cgroup associated with a kmem page, but caller must ensure that the returned memcg won't be released (e.g. acquire the rcu_read_lock or css_set_lock). Link: https://lkml.kernel.org/r/20210401030141.37061-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210319163821.20704-6-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> [songmuchun@bytedance.com: fix forget to obtain the ref to objcg in split_page_memcg] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
Just like assignment to ug->memcg, we only need to update ug->dummy_page if memcg changed. So move it to there. This is a very small optimization. Link: https://lkml.kernel.org/r/20210319163821.20704-5-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
page_memcg() is not suitable for use by page_expected_state() and page_bad_reason(). Because it can BUG_ON() for the slab pages when CONFIG_DEBUG_VM is enabled. As neither lru, nor kmem, nor slab page should have anything left in there by the time the page is freed, what we care about is whether the value of page->memcg_data is 0. So just directly access page->memcg_data here. Link: https://lkml.kernel.org/r/20210319163821.20704-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
We know that the unit of slab object charging is bytes, the unit of kmem page charging is PAGE_SIZE. If we want to reuse obj_cgroup APIs to charge the kmem pages, we should pass PAGE_SIZE (as third parameter) to obj_cgroup_charge(). Because the size is already PAGE_SIZE, we can skip touch the objcg stock. And obj_cgroup_{un}charge_pages() are introduced to charge in units of page level. In the latter patch, we also can reuse those two helpers to charge or uncharge a number of kernel pages to a object cgroup. This is just a code movement without any functional changes. Link: https://lkml.kernel.org/r/20210319163821.20704-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
Patch series "Use obj_cgroup APIs to charge kmem pages", v5. Since Roman's series "The new cgroup slab memory controller" applied. All slab objects are charged with the new APIs of obj_cgroup. The new APIs introduce a struct obj_cgroup to charge slab objects. It prevents long-living objects from pinning the original memory cgroup in the memory. But there are still some corner objects (e.g. allocations larger than order-1 page on SLUB) which are not charged with the new APIs. Those objects (include the pages which are allocated from buddy allocator directly) are charged as kmem pages which still hold a reference to the memory cgroup. E.g. We know that the kernel stack is charged as kmem pages because the size of the kernel stack can be greater than 2 pages (e.g. 16KB on x86_64 or arm64). If we create a thread (suppose the thread stack is charged to memory cgroup A) and then move it from memory cgroup A to memory cgroup B. Because the kernel stack of the thread hold a reference to the memory cgroup A. The thread can pin the memory cgroup A in the memory even if we remove the cgroup A. If we want to see this scenario by using the following script. We can see that the system has added 500 dying cgroups (This is not a real world issue, just a script to show that the large kmallocs are charged as kmem pages which can pin the memory cgroup in the memory). #!/bin/bash cat /proc/cgroups | grep memory cd /sys/fs/cgroup/memory echo 1 > memory.move_charge_at_immigrate for i in range{1..500} do mkdir kmem_test echo $$ > kmem_test/cgroup.procs sleep 3600 & echo $$ > cgroup.procs echo `cat kmem_test/cgroup.procs` > cgroup.procs rmdir kmem_test done cat /proc/cgroups | grep memory This patchset aims to make those kmem pages to drop the reference to memory cgroup by using the APIs of obj_cgroup. Finally, we can see that the number of the dying cgroups will not increase if we run the above test script. This patch (of 7): The rcu_read_lock/unlock only can guarantee that the memcg will not be freed, but it cannot guarantee the success of css_get (which is in the refill_stock when cached memcg changed) to memcg. rcu_read_lock() memcg = obj_cgroup_memcg(old) __memcg_kmem_uncharge(memcg) refill_stock(memcg) if (stock->cached != memcg) // css_get can change the ref counter from 0 back to 1. css_get(&memcg->css) rcu_read_unlock() This fix is very like the commit: eefbfa7f ("mm: memcg/slab: fix use after free in obj_cgroup_charge") Fix this by holding a reference to the memcg which is passed to the __memcg_kmem_uncharge() before calling __memcg_kmem_uncharge(). Link: https://lkml.kernel.org/r/20210319163821.20704-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210319163821.20704-2-songmuchun@bytedance.com Fixes: 3de7d4f2 ("mm: memcg/slab: optimize objcg stock draining") Signed-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shakeel Butt 提交于
Currently the kernel adds the page, allocated for swapin, to the swapcache before charging the page. This is fine but now we want a per-memcg swapcache stat which is essential for folks who wants to transparently migrate from cgroup v1's memsw to cgroup v2's memory and swap counters. In addition charging a page before exposing it to other parts of the kernel is a step in the right direction. To correctly maintain the per-memcg swapcache stat, this patch has adopted to charge the page before adding it to swapcache. One challenge in this option is the failure case of add_to_swap_cache() on which we need to undo the mem_cgroup_charge(). Specifically undoing mem_cgroup_uncharge_swap() is not simple. To resolve the issue, this patch decouples the charging for swapin pages from mem_cgroup_charge(). Two new functions are introduced, mem_cgroup_swapin_charge_page() for just charging the swapin page and mem_cgroup_swapin_uncharge_swap() for uncharging the swap slot once the page has been successfully added to the swapcache. [shakeelb@google.com: set page->private before calling swap_readpage] Link: https://lkml.kernel.org/r/20210318015959.2986837-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20210305212639.775498-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NHugh Dickins <hughd@google.com> Tested-by: NHeiko Carstens <hca@linux.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
There are two functions to flush the per-cpu data of an lruvec into the rest of the cgroup tree: when the cgroup is being freed, and when a CPU disappears during hotplug. The difference is whether all CPUs or just one is being collected, but the rest of the flushing code is the same. Merge them into one function and share the common code. Link: https://lkml.kernel.org/r/20210209163304.77088-8-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NRoman Gushchin <guro@fb.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Replace the memory controller's custom hierarchical stats code with the generic rstat infrastructure provided by the cgroup core. The current implementation does batched upward propagation from the write side (i.e. as stats change). The per-cpu batches introduce an error, which is multiplied by the number of subgroups in a tree. In systems with many CPUs and sizable cgroup trees, the error can be large enough to confuse users (e.g. 32 batch pages * 32 CPUs * 32 subgroups results in an error of up to 128M per stat item). This can entirely swallow allocation bursts inside a workload that the user is expecting to see reflected in the statistics. In the past, we've done read-side aggregation, where a memory.stat read would have to walk the entire subtree and add up per-cpu counts. This became problematic with lazily-freed cgroups: we could have large subtrees where most cgroups were entirely idle. Hence the switch to change-driven upward propagation. Unfortunately, it needed to trade accuracy for speed due to the write side being so hot. Rstat combines the best of both worlds: from the write side, it cheaply maintains a queue of cgroups that have pending changes, so that the read side can do selective tree aggregation. This way the reported stats will always be precise and recent as can be, while the aggregation can skip over potentially large numbers of idle cgroups. The way rstat works is that it implements a tree for tracking cgroups with pending local changes, as well as a flush function that walks the tree upwards. The controller then drives this by 1) telling rstat when a local cgroup stat changes (e.g. mod_memcg_state) and 2) when a flush is required to get uptodate hierarchy stats for a given subtree (e.g. when memory.stat is read). The controller also provides a flush callback that is called during the rstat flush walk for each cgroup and aggregates its local per-cpu counters and propagates them upwards. This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT + NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward aggregation. It removes 3 words from the per-cpu data. It eliminates memcg_exact_page_state(), since memcg_page_state() is now exact. [akpm@linux-foundation.org: merge fix] [hannes@cmpxchg.org: fix a sleep in atomic section problem] Link: https://lkml.kernel.org/r/20210315234100.64307-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20210209163304.77088-7-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRoman Gushchin <guro@fb.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NMichal Koutný <mkoutny@suse.com> Acked-by: NBalbir Singh <bsingharora@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-