- 30 6月, 2021 40 次提交
-
-
由 Kuan-Ying Lee 提交于
Patch series "kasan: add memory corruption identification support for hw tag-based kasan", v4. Add memory corruption identification for hardware tag-based KASAN mode. This patch (of 3): Rename CONFIG_KASAN_SW_TAGS_IDENTIFY to CONFIG_KASAN_TAGS_IDENTIFY in order to be compatible with hardware tag-based mode. Link: https://lkml.kernel.org/r/20210626100931.22794-1-Kuan-Ying.Lee@mediatek.com Link: https://lkml.kernel.org/r/20210626100931.22794-2-Kuan-Ying.Lee@mediatek.comSigned-off-by: NKuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Suggested-by: NMarco Elver <elver@google.com> Reviewed-by: NAlexander Potapenko <glider@google.com> Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com> Reviewed-by: NMarco Elver <elver@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Daniel Axtens 提交于
powerpc has a variable number of PTRS_PER_*, set at runtime based on the MMU that the kernel is booted under. This means the PTRS_PER_* are no longer constants, and therefore breaks the build. Switch to using MAX_PTRS_PER_*, which are constant. Link: https://lkml.kernel.org/r/20210624034050.511391-5-dja@axtens.netSigned-off-by: NDaniel Axtens <dja@axtens.net> Suggested-by: NChristophe Leroy <christophe.leroy@csgroup.eu> Suggested-by: NBalbir Singh <bsingharora@gmail.com> Reviewed-by: NChristophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: NBalbir Singh <bsingharora@gmail.com> Reviewed-by: NMarco Elver <elver@google.com> Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Daniel Axtens 提交于
Allow architectures to define a kasan_arch_is_ready() hook that bails out of any function that's about to touch the shadow unless the arch says that it is ready for the memory to be accessed. This is fairly uninvasive and should have a negligible performance penalty. This will only work in outline mode, so an arch must specify ARCH_DISABLE_KASAN_INLINE if it requires this. Link: https://lkml.kernel.org/r/20210624034050.511391-3-dja@axtens.netSigned-off-by: NDaniel Axtens <dja@axtens.net> Reviewed-by: NMarco Elver <elver@google.com> Suggested-by: NChristophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexander Potapenko 提交于
Most of the contents of KASAN reports are printed with pr_err(), so use a consistent logging level to print the memory access stacks. Link: https://lkml.kernel.org/r/20210506105405.3535023-2-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com> Reviewed-by: NMarco Elver <elver@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Prasad Sodagudi <psodagud@quicinc.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: he, bo <bo.he@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rafael Aquini 提交于
On non-preemptible kernel builds the watchdog can complain about soft lockups when vfree() is called against large vmalloc areas: [ 210.851798] kvmalloc-test: vmalloc(2199023255552) succeeded [ 238.654842] watchdog: BUG: soft lockup - CPU#181 stuck for 26s! [rmmod:5203] [ 238.662716] Modules linked in: kvmalloc_test(OE-) ... [ 238.772671] CPU: 181 PID: 5203 Comm: rmmod Tainted: G S OE 5.13.0-rc7+ #1 [ 238.781413] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYXCRB1.86B.0553.D01.1809190614 09/19/2018 [ 238.792383] RIP: 0010:free_unref_page+0x52/0x60 [ 238.797447] Code: 48 c1 fd 06 48 89 ee e8 9c d0 ff ff 84 c0 74 19 9c 41 5c fa 48 89 ee 48 89 df e8 b9 ea ff ff 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 29 77 [ 238.818406] RSP: 0018:ffffb4d87868fe98 EFLAGS: 00000206 [ 238.824236] RAX: 0000000000000000 RBX: 000000001da0c945 RCX: ffffb4d87868fe40 [ 238.832200] RDX: ffffd79d3beed108 RSI: ffffd7998501dc08 RDI: ffff9c6fbffd7010 [ 238.840166] RBP: 000000000d518cbd R08: ffffd7998501dc08 R09: 0000000000000001 [ 238.848131] R10: 0000000000000000 R11: ffffd79d3beee088 R12: 0000000000000202 [ 238.856095] R13: ffff9e5be3eceec0 R14: 0000000000000000 R15: 0000000000000000 [ 238.864059] FS: 00007fe082c2d740(0000) GS:ffff9f4c69b40000(0000) knlGS:0000000000000000 [ 238.873089] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 238.879503] CR2: 000055a000611128 CR3: 000000f6094f6006 CR4: 00000000007706e0 [ 238.887467] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 238.895433] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 238.903397] PKRU: 55555554 [ 238.906417] Call Trace: [ 238.909149] __vunmap+0x17c/0x220 [ 238.912851] __x64_sys_delete_module+0x13a/0x250 [ 238.918008] ? syscall_trace_enter.isra.20+0x13c/0x1b0 [ 238.923746] do_syscall_64+0x39/0x80 [ 238.927740] entry_SYSCALL_64_after_hwframe+0x44/0xae Like in other range zapping routines that iterate over a large list, lets just add cond_resched() within __vunmap()'s page-releasing loop in order to avoid the watchdog splats. Link: https://lkml.kernel.org/r/20210622225030.478384-1-aquini@redhat.comSigned-off-by: NRafael Aquini <aquini@redhat.com> Acked-by: NNicholas Piggin <npiggin@gmail.com> Reviewed-by: NUladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: NAaron Tomlin <atomlin@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Uladzislau Rezki 提交于
Currently for order-0 pages we use a bulk-page allocator to get set of pages. From the other hand not allocating all pages is something that might occur. In that case we should fallbak to the single-page allocator trying to get missing pages, because it is more permissive(direct reclaim, etc). Introduce a vm_area_alloc_pages() function where the described logic is implemented. Link: https://lkml.kernel.org/r/20210521130718.GA17882@pc638.lanSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Uladzislau Rezki (Sony) 提交于
A checkpatch.pl script complains on splitting a text across lines. It is because if a user wants to find an entire string he or she will not succeeded. <snip> WARNING: quoted string split across lines + "vmalloc size %lu allocation failure: " + "page order %u allocation failed", total: 0 errors, 1 warnings, 10 lines checked <snip> Link: https://lkml.kernel.org/r/20210521204359.19943-1-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Uladzislau Rezki (Sony) 提交于
When a memory allocation for array of pages are not succeed emit a warning message as a first step and then perform the further cleanup. The reason it should be done in a right order is the clean up function which is free_vm_area() can potentially also follow its error paths what can lead to confusion what was broken first. Link: https://lkml.kernel.org/r/20210516202056.2120-4-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Uladzislau Rezki (Sony) 提交于
Recently there has been introduced a page bulk allocator for users which need to get number of pages per one call request. For order-0 pages switch to an alloc_pages_bulk_array_node() instead of alloc_pages_node(), the reason is the former is not capable of allocating set of pages, thus a one call is per one page. Second, according to my tests the bulk allocator uses less cycles even for scenarios when only one page is requested. Running the "perf" on same test case shows below difference: <default> - 45.18% __vmalloc_node - __vmalloc_node_range - 35.60% __alloc_pages - get_page_from_freelist 3.36% __list_del_entry_valid 3.00% check_preemption_disabled 1.42% prep_new_page <default> <patch> - 31.00% __vmalloc_node - __vmalloc_node_range - 14.48% __alloc_pages_bulk 3.22% __list_del_entry_valid - 0.83% __alloc_pages get_page_from_freelist <patch> The "test_vmalloc.sh" also shows performance improvements: fix_size_alloc_test_4MB loops: 1000000 avg: 89105095 usec fix_size_alloc_test loops: 1000000 avg: 513672 usec full_fit_alloc_test loops: 1000000 avg: 748900 usec long_busy_list_alloc_test loops: 1000000 avg: 8043038 usec random_size_alloc_test loops: 1000000 avg: 4028582 usec fix_align_alloc_test loops: 1000000 avg: 1457671 usec fix_size_alloc_test_4MB loops: 1000000 avg: 62083711 usec fix_size_alloc_test loops: 1000000 avg: 449207 usec full_fit_alloc_test loops: 1000000 avg: 735985 usec long_busy_list_alloc_test loops: 1000000 avg: 5176052 usec random_size_alloc_test loops: 1000000 avg: 2589252 usec fix_align_alloc_test loops: 1000000 avg: 1365009 usec For example 4MB allocations illustrates ~30% gain, all the rest is also better. Link: https://lkml.kernel.org/r/20210516202056.2120-3-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 YueHaibing 提交于
Use DEVICE_ATTR_RO() helper instead of plain DEVICE_ATTR(), which makes the code a bit shorter and easier to read. Link: https://lkml.kernel.org/r/20210524112852.34716-1-yuehaibing@huawei.comSigned-off-by: NYueHaibing <yuehaibing@huawei.com> Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Liam Howlett 提交于
vma_lookup() finds the vma of a specific address with a cleaner interface and is more readable. Link: https://lkml.kernel.org/r/20210521174745.2219620-23-Liam.Howlett@Oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com> Acked-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NDavidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Liam Howlett 提交于
Use vma_lookup() to find the VMA at a specific address. As vma_lookup() will return NULL if the address is not within any VMA, the start address no longer needs to be validated. Link: https://lkml.kernel.org/r/20210521174745.2219620-22-Liam.Howlett@Oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com> Acked-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NDavidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Liam Howlett 提交于
Use vma_lookup() to find the VMA at a specific address. As vma_lookup() will return NULL if the address is not within any VMA, the start address no longer needs to be validated. Link: https://lkml.kernel.org/r/20210521174745.2219620-21-Liam.Howlett@Oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com> Acked-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NDavidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Liam Howlett 提交于
Use vma_lookup() to find the VMA at a specific address. As vma_lookup() will return NULL if the address is not within any VMA, the start address no longer needs to be validated. Link: https://lkml.kernel.org/r/20210521174745.2219620-20-Liam.Howlett@Oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com> Acked-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NDavidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Liam Howlett 提交于
Use vma_lookup() to find the VMA at a specific address. As vma_lookup() will return NULL if the address is not within any VMA, the start address no longer needs to be validated. Link: https://lkml.kernel.org/r/20210521174745.2219620-19-Liam.Howlett@Oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com> Acked-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NDavidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Liu Xiang 提交于
Fix the return value in comment of finish_mkwrite_fault(). Link: https://lkml.kernel.org/r/20210513093931.15234-1-liu.xiang@zlingsmart.comSigned-off-by: NLiu Xiang <liu.xiang@zlingsmart.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Liam Howlett 提交于
Using find_vma_intersection() avoids the need for a temporary variable and makes the code cleaner. Link: https://lkml.kernel.org/r/20210511014328.2902782-1-Liam.Howlett@Oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Liam Howlett 提交于
Both __do_munmap() and exit_mmap() unlock a range of VMAs using almost identical code blocks. Replace both blocks by a static inline function. [akpm@linux-foundation.org: tweak code layout] Link: https://lkml.kernel.org/r/20210510211021.2797427-1-Liam.Howlett@Oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NDavidlohr Bueso <dbueso@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
Logic of find_vma_intersection() is repeated in __do_munmap(). Also, prev is assigned a value before checking vma->vm_start >= end which might end up on a return statement making that assignment useless. Calling find_vma_intersection() checks that condition and returns NULL if no vma is found, hence only the !vma check is needed in __do_munmap(). Link: https://lkml.kernel.org/r/20210409162129.18313-1-gmjuareztello@gmail.comSigned-off-by: NGonzalo Matias Juarez Tello <gmjuareztello@gmail.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Hildenbrand 提交于
Let's also remove masking off MAP_EXECUTABLE from ksys_mmap_pgoff(): the last in-tree occurrence of MAP_EXECUTABLE is now in LEGACY_MAP_MASK, which accepts the flag e.g., for MAP_SHARED_VALIDATE; however, the flag is ignored throughout the kernel now. Add a comment to LEGACY_MAP_MASK stating that MAP_EXECUTABLE is ignored. Link: https://lkml.kernel.org/r/20210421093453.6904-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com> Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: NKees Cook <keescook@chromium.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kevin Brodsky <Kevin.Brodsky@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Schatzberg 提交于
The current code only associates with the existing blkcg when aio is used to access the backing file. This patch covers all types of i/o to the backing file and also associates the memcg so if the backing file is on tmpfs, memory is charged appropriately. This patch also exports cgroup_get_e_css and int_active_memcg so it can be used by the loop module. Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.comSigned-off-by: NDan Schatzberg <schatzberg.dan@gmail.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NJens Axboe <axboe@kernel.dk> Cc: Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Schatzberg 提交于
set_active_memcg() worked for kernel allocations but was silently ignored for user pages. This patch establishes a precedence order for who gets charged: 1. If there is a memcg associated with the page already, that memcg is charged. This happens during swapin. 2. If an explicit mm is passed, mm->memcg is charged. This happens during page faults, which can be triggered in remote VMs (eg gup). 3. Otherwise consult the current process context. If there is an active_memcg, use that. Otherwise, current->mm->memcg. Previously, if a NULL mm was passed to mem_cgroup_charge (case 3) it would always charge the root cgroup. Now it looks up the active_memcg first (falling back to charging the root cgroup if not set). Link: https://lkml.kernel.org/r/20210610173944.1203706-3-schatzberg.dan@gmail.comSigned-off-by: NDan Schatzberg <schatzberg.dan@gmail.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NChris Down <chris@chrisdown.name> Acked-by: NJens Axboe <axboe@kernel.dk> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NMichal Koutný <mkoutny@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
The noinline_for_stack is introduced by commit 66635629 ("vmscan: set up pagevec as late as possible in shrink_inactive_list()"), its purpose is to delay the allocation of pagevec as late as possible to save stack memory. But the commit 2bcf8879 ("mm: take pagevecs off reclaim stack") replace pagevecs by lists of pages_to_free. So we do not need noinline_for_stack, just remove it (let the compiler decide whether to inline). Link: https://lkml.kernel.org/r/20210417043538.9793-9-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
The css_set_lock is used to guard the list of inherited objcgs. So there is no need to uncharge kernel memory under css_set_lock. Just move it out of the lock. Link: https://lkml.kernel.org/r/20210417043538.9793-8-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by the css_set_lock. We do not need to care about objcg->memcg being released in the process of obj_cgroup_release(). So there is no need to pin memcg before releasing objcg. Remove those pinning logic to simplfy the code. There are only two places that modifies the objcg->memcg. One is the initialization to objcg->memcg in the memcg_online_kmem(), another is objcgs reparenting in the memcg_reparent_objcgs(). It is also impossible for the two to run in parallel. So xchg() is unnecessary and it is enough to use WRITE_ONCE(). Link: https://lkml.kernel.org/r/20210417043538.9793-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
lruvec_holds_page_lru_lock() doesn't check anything about locking and is used to check whether the page belongs to the lruvec. So rename it to page_matches_lruvec(). Link: https://lkml.kernel.org/r/20210417043538.9793-6-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page) as the 2nd parameter to it (except isolate_migratepages_block()). But for isolate_migratepages_block(), the page_pgdat(page) is also equal to the local variable of @pgdat. So mem_cgroup_page_lruvec() do not need the pgdat parameter. Just remove it to simplify the code. Link: https://lkml.kernel.org/r/20210417043538.9793-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
When mm is NULL, we do not need to hold rcu lock and call css_tryget for the root memcg. And we also do not need to check !mm in every loop of while. So bail out early when !mm. Link: https://lkml.kernel.org/r/20210417043538.9793-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
Patch series "memcontrol code cleanup and simplification", v3. This patch (of 8): The pages aren't accounted at the root level, so do not charge the page to the root memcg in page replacement. Although we do not display the value (mem_cgroup_usage) so there shouldn't be any actual problem, but there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it will trigger? So it is better to fix it. Link: https://lkml.kernel.org/r/20210417043538.9793-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210417043538.9793-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
The below scenario can cause the page counters of the root_mem_cgroup to be out of balance. CPU0: CPU1: objcg = get_obj_cgroup_from_current() obj_cgroup_charge_pages(objcg) memcg_reparent_objcgs() // reparent to root_mem_cgroup WRITE_ONCE(iter->memcg, parent) // memcg == root_mem_cgroup memcg = get_mem_cgroup_from_objcg(objcg) // do not charge to the root_mem_cgroup try_charge(memcg) obj_cgroup_uncharge_pages(objcg) memcg = get_mem_cgroup_from_objcg(objcg) // uncharge from the root_mem_cgroup refill_stock(memcg) drain_stock(memcg) page_counter_uncharge(&memcg->memory) get_obj_cgroup_from_current() never returns a root_mem_cgroup's objcg, so we never explicitly charge the root_mem_cgroup. And it's not going to change. It's all about a race when we got an obj_cgroup pointing at some non-root memcg, but before we were able to charge it, the cgroup was gone, objcg was reparented to the root and so we're skipping the charging. Then we store the objcg pointer and later use to uncharge the root_mem_cgroup. This can cause the page counter to be less than the actual value. Although we do not display the value (mem_cgroup_usage) so there shouldn't be any actual problem, but there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it will trigger? So it is better to fix it. Link: https://lkml.kernel.org/r/20210425075410.19255-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Waiman Long 提交于
The KMALLOC_NORMAL (kmalloc-<n>) caches are for unaccounted objects only when CONFIG_MEMCG_KMEM is enabled. To make sure that this condition remains true, we will have to prevent KMALOC_NORMAL caches to merge with other kmem caches. This is now done by setting its refcount to -1 right after its creation. Link: https://lkml.kernel.org/r/20210505200610.13943-4-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com> Suggested-by: NRoman Gushchin <guro@fb.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Waiman Long 提交于
There are currently two problems in the way the objcg pointer array (memcg_data) in the page structure is being allocated and freed. On its allocation, it is possible that the allocated objcg pointer array comes from the same slab that requires memory accounting. If this happens, the slab will never become empty again as there is at least one object left (the obj_cgroup array) in the slab. When it is freed, the objcg pointer array object may be the last one in its slab and hence causes kfree() to be called again. With the right workload, the slab cache may be set up in a way that allows the recursive kfree() calling loop to nest deep enough to cause a kernel stack overflow and panic the system. One way to solve this problem is to split the kmalloc-<n> caches (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc-<n> (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of kmalloc-cg-<n> (KMALLOC_CGROUP) caches for accounted objects only. All the other caches can still allow a mix of accounted and unaccounted objects. With this change, all the objcg pointer array objects will come from KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So both the recursive kfree() problem and non-freeable slab problem are gone. Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have mixed accounted and unaccounted objects, this will slightly reduce the number of objcg pointer arrays that need to be allocated and save a bit of memory. On the other hand, creating a new set of kmalloc caches does have the effect of reducing cache utilization. So it is properly a wash. The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches() will include the newly added caches without change. [vbabka@suse.cz: don't create kmalloc-cg caches with cgroup.memory=nokmem] Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com [akpm@linux-foundation.org: un-fat-finger v5 delta creation] [longman@redhat.com: disable cache merging for KMALLOC_NORMAL caches] Link: https://lkml.kernel.org/r/20210505200610.13943-4-longman@redhat.com Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210505200610.13943-3-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com> Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Suggested-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> [longman@redhat.com: fix for CONFIG_ZONE_DMA=n] Suggested-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Waiman Long 提交于
Patch series "mm: memcg/slab: Fix objcg pointer array handling problem", v4. Since the merging of the new slab memory controller in v5.9, the page structure stores a pointer to objcg pointer array for slab pages. When the slab has no used objects, it can be freed in free_slab() which will call kfree() to free the objcg pointer array in memcg_alloc_page_obj_cgroups(). If it happens that the objcg pointer array is the last used object in its slab, that slab may then be freed which may caused kfree() to be called again. With the right workload, the slab cache may be set up in a way that allows the recursive kfree() calling loop to nest deep enough to cause a kernel stack overflow and panic the system. In fact, we have a reproducer that can cause kernel stack overflow on a s390 system involving kmalloc-rcl-256 and kmalloc-rcl-128 slabs with the following kfree() loop recursively called 74 times: [ 285.520739] [<000000000ec432fc>] kfree+0x4bc/0x560 [ 285.520740] [<000000000ec43466>] __free_slab+0xc6/0x228 [ 285.520741] [<000000000ec41fc2>] __slab_free+0x3c2/0x3e0 [ 285.520742] [<000000000ec432fc>] kfree+0x4bc/0x560 : While investigating this issue, I also found an issue on the allocation side. If the objcg pointer array happen to come from the same slab or a circular dependency linkage is formed with multiple slabs, those affected slabs can never be freed again. This patch series addresses these two issues by introducing a new set of kmalloc-cg-<n> caches split from kmalloc-<n> caches. The new set will only contain non-reclaimable and non-dma objects that are accounted in memory cgroups whereas the old set are now for unaccounted objects only. By making this split, all the objcg pointer arrays will come from the kmalloc-<n> caches, but those caches will never hold any objcg pointer array. As a result, deeply nested kfree() call and the unfreeable slab problems are now gone. This patch (of 4): Since the merging of the new slab memory controller in v5.9, the page structure may store a pointer to obj_cgroup pointer array for slab pages. Currently, only the __GFP_ACCOUNT bit is masked off. However, the array is not readily reclaimable and doesn't need to come from the DMA buffer. So those GFP bits should be masked off as well. Do the flag bit clearing at memcg_alloc_page_obj_cgroups() to make sure that it is consistently applied no matter where it is called. Link: https://lkml.kernel.org/r/20210505200610.13943-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210505200610.13943-2-longman@redhat.com Fixes: 286e04b8 ("mm: memcg/slab: allocate obj_cgroups for non-root slab pages") Signed-off-by: NWaiman Long <longman@redhat.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Waiman Long 提交于
Most kmem_cache_alloc() calls are from user context. With instrumentation enabled, the measured amount of kmem_cache_alloc() calls from non-task context was about 0.01% of the total. The irq disable/enable sequence used in this case to access content from object stock is slow. To optimize for user context access, there are now two sets of object stocks (in the new obj_stock structure) for task context and interrupt context access respectively. The task context object stock can be accessed after disabling preemption which is cheap in non-preempt kernel. The interrupt context object stock can only be accessed after disabling interrupt. User context code can access interrupt object stock, but not vice versa. The downside of this change is that there are more data stored in local object stocks and not reflected in the charge counter and the vmstat arrays. However, this is a small price to pay for better performance. [longman@redhat.com: fix potential uninitialized variable warning] Link: https://lkml.kernel.org/r/20210526193602.8742-1-longman@redhat.com [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20210506150007.16288-5-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Roman Gushchin <guro@fb.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Waiman Long 提交于
There are two issues with the current refill_obj_stock() code. First of all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to atomically flush out remaining bytes to obj_cgroup, clear cached_objcg and do a obj_cgroup_put(). It is likely that the same obj_cgroup will be used again which leads to another call to drain_obj_stock() and obj_cgroup_get() as well as atomically retrieve the available byte from obj_cgroup. That is costly. Instead, we should just uncharge the excess pages, reduce the stock bytes and be done with it. The drain_obj_stock() function should only be called when obj_cgroup changes. Secondly, when charging an object of size not less than a page in obj_cgroup_charge(), it is possible that the remaining bytes to be refilled to the stock will overflow a page and cause refill_obj_stock() to uncharge 1 page. To avoid the additional uncharge in this case, a new allow_uncharge flag is added to refill_obj_stock() which will be set to false when called from obj_cgroup_charge() so that an uncharge_pages() call won't be issued right after a charge_pages() call unless the objcg changes. A multithreaded kmalloc+kfree microbenchmark on a 2-socket 48-core 96-thread x86-64 system with 96 testing threads were run. Before this patch, the total number of kilo kmalloc+kfree operations done for a 4k large object by all the testing threads per second were 4,304 kops/s (cgroup v1) and 8,478 kops/s (cgroup v2). After applying this patch, the number were 4,731 (cgroup v1) and 418,142 (cgroup v2) respectively. This represents a performance improvement of 1.10X (cgroup v1) and 49.3X (cgroup v2). Link: https://lkml.kernel.org/r/20210506150007.16288-4-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Waiman Long 提交于
Before the new slab memory controller with per object byte charging, charging and vmstat data update happen only when new slab pages are allocated or freed. Now they are done with every kmem_cache_alloc() and kmem_cache_free(). This causes additional overhead for workloads that generate a lot of alloc and free calls. The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup to reduce that overhead. To further reducing it, this patch makes the vmstat data cached in the memcg_stock_pcp structure as well until it accumulates a page size worth of update or when other cached data change. Caching the vmstat data in the per-cpu stock eliminates two writes to non-hot cachelines for memcg specific as well as memcg-lruvecs specific vmstat data by a write to a hot local stock cacheline. On a 2-socket Cascade Lake server with instrumentation enabled and this patch applied, it was found that about 20% (634400 out of 3243830) of the time when mod_objcg_state() is called leads to an actual call to __mod_objcg_state() after initial boot. When doing parallel kernel build, the figure was about 17% (24329265 out of 142512465). So caching the vmstat data reduces the number of calls to __mod_objcg_state() by more than 80%. Link: https://lkml.kernel.org/r/20210506150007.16288-3-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Waiman Long 提交于
Patch series "mm/memcg: Reduce kmemcache memory accounting overhead", v6. With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1] and [2]. A simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object or a big 4k object at module init time with a batch size of 4 (4 kmalloc's followed by 4 kfree's) is used for benchmarking. The benchmarking tool was run on a kernel based on linux-next-20210419. The test was run on a CascadeLake server with turbo-boosting disable to reduce run-to-run variation. The small object test exercises mainly the object stock charging and vmstat update code paths. The large object test also exercises the refill_obj_stock() and __memcg_kmem_charge()/__memcg_kmem_uncharge() code paths. With memory accounting disabled, the run time was 3.130s with both small object big object tests. With memory accounting enabled, both cgroup v1 and v2 showed similar results in the small object test. The performance results of the large object test, however, differed between cgroup v1 and v2. The execution times with the application of various patches in the patchset were: Applied patches Run time Accounting overhead %age 1 %age 2 --------------- -------- ------------------- ------ ------ Small 32-byte object: None 11.634s 8.504s 100.0% 271.7% 1-2 9.425s 6.295s 74.0% 201.1% 1-3 9.708s 6.578s 77.4% 210.2% 1-4 8.062s 4.932s 58.0% 157.6% Large 4k object (v2): None 22.107s 18.977s 100.0% 606.3% 1-2 20.960s 17.830s 94.0% 569.6% 1-3 14.238s 11.108s 58.5% 354.9% 1-4 11.329s 8.199s 43.2% 261.9% Large 4k object (v1): None 36.807s 33.677s 100.0% 1075.9% 1-2 36.648s 33.518s 99.5% 1070.9% 1-3 22.345s 19.215s 57.1% 613.9% 1-4 18.662s 15.532s 46.1% 496.2% N.B. %age 1 = overhead/unpatched overhead %age 2 = overhead/accounting disabled time Patch 2 (vmstat data stock caching) helps in both the small object test and the large v2 object test. It doesn't help much in v1 big object test. Patch 3 (refill_obj_stock improvement) does help the small object test but offer significant performance improvement for the large object test (both v1 and v2). Patch 4 (eliminating irq disable/enable) helps in all test cases. To test for the extreme case, a multi-threaded kmalloc/kfree microbenchmark was run on the 2-socket 48-core 96-thread system with 96 testing threads in the same memcg doing kmalloc+kfree of a 4k object with accounting enabled for 10s. The total number of kmalloc+kfree done in kilo operations per second (kops/s) were as follows: Applied patches v1 kops/s v1 change v2 kops/s v2 change --------------- --------- --------- --------- --------- None 3,520 1.00X 6,242 1.00X 1-2 4,304 1.22X 8,478 1.36X 1-3 4,731 1.34X 418,142 66.99X 1-4 4,587 1.30X 438,838 70.30X With memory accounting disabled, the kmalloc/kfree rate was 1,481,291 kop/s. This test shows how significant the memory accouting overhead can be in some extreme situations. For this multithreaded test, the improvement from patch 2 mainly comes from the conditional atomic xchg of objcg->nr_charged_bytes in mod_objcg_state(). By using an unconditional xchg, the operation rates were similar to the unpatched kernel. Patch 3 elminates the single highly contended cacheline of objcg->nr_charged_bytes for cgroup v2 leading to a huge performance improvement. Cgroup v1, however, still has another highly contended cacheline in the shared page counter &memcg->kmem. So the improvement is only modest. Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as eliminating the irq_disable/irq_enable overhead seems to aggravate the cacheline contention. [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/ This patch (of 4): mod_objcg_state() is moved from mm/slab.h to mm/memcontrol.c so that further optimization can be done to it in later patches without exposing unnecessary details to other mm components. Link: https://lkml.kernel.org/r/20210506150007.16288-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210506150007.16288-2-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NRoman Gushchin <guro@fb.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
To check whether all pages and shadow entries in swap cache has been removed before swap cache is freed. Link: https://lkml.kernel.org/r/20210608005121.511140-1-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
With commit 09854ba9 ("mm: do_wp_page() simplification"), after COW, the idle swap cache page (neither the page nor the corresponding swap entry is mapped by any process) will be left in the LRU list, even if it's in the active list or the head of the inactive list. So, the page reclaimer may take quite some overhead to reclaim these actually unused pages. To help the page reclaiming, in this patch, after COW, the idle swap cache page will be tried to be freed. To avoid to introduce much overhead to the hot COW code path, a) there's almost zero overhead for non-swap case via checking PageSwapCache() firstly. b) the page lock is acquired via trylock only. To test the patch, we used pmbench memory accessing benchmark with working-set larger than available memory on a 2-socket Intel server with a NVMe SSD as swap device. Test results shows that the pmbench score increases up to 23.8% with the decreased size of swap cache and swapin throughput. Link: https://lkml.kernel.org/r/20210601053143.1380078-1-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> [use free_swap_cache()] Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
Before commit c10d38cc ("mm, swap: bounds check swap_info array accesses to avoid NULL derefs"), the typical code to reference the swap_info[] is as follows, type = swp_type(swp_entry); if (type >= nr_swapfiles) /* handle invalid swp_entry */; p = swap_info[type]; /* access fields of *p. OOPS! p may be NULL! */ Because the ordering isn't guaranteed, it's possible that swap_info[type] is read before "nr_swapfiles". And that may result in NULL pointer dereference. So after commit c10d38cc, the code becomes, struct swap_info_struct *swap_type_to_swap_info(int type) { if (type >= READ_ONCE(nr_swapfiles)) return NULL; smp_rmb(); return READ_ONCE(swap_info[type]); } /* users */ type = swp_type(swp_entry); p = swap_type_to_swap_info(type); if (!p) /* handle invalid swp_entry */; /* dereference p */ Where the value of swap_info[type] (that is, "p") is checked to be non-zero before being dereferenced. So, the NULL deferencing becomes impossible even if "nr_swapfiles" is read after swap_info[type]. Therefore, the "smp_rmb()" becomes unnecessary. And, we don't even need to read "nr_swapfiles" here. Because the non-zero checking for "p" is sufficient. We just need to make sure we will not access out of the boundary of the array. With the change, nr_swapfiles will only be accessed with swap_lock held, except in swapcache_free_entries(). Where the absolute correctness of the value isn't needed, as described in the comments. We still need to guarantee swap_info[type] is read before being dereferenced. That can be satisfied via the data dependency ordering enforced by READ_ONCE(swap_info[type]). This needs to be paired with proper write barriers. So smp_store_release() is used in alloc_swap_info() to guarantee the fields of *swap_info[type] is initialized before swap_info[type] itself being written. Note that the fields of *swap_info[type] is initialized to be 0 via kvzalloc() firstly. The assignment and deferencing of swap_info[type] is like rcu_assign_pointer() and rcu_dereference(). Link: https://lkml.kernel.org/r/20210520073301.1676294-1-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Andrea Parri <andrea.parri@amarulasolutions.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Paul McKenney <paulmck@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-