- 07 2月, 2018 3 次提交
-
-
由 Dmitry Vyukov 提交于
Patch series "kasan: detect invalid frees". KASAN detects double-frees, but does not detect invalid-frees (when a pointer into a middle of heap object is passed to free). We recently had a very unpleasant case in crypto code which freed an inner object inside of a heap allocation. This left unnoticed during free, but totally corrupted heap and later lead to a bunch of random crashes all over kernel code. Detect invalid frees. This patch (of 5): Detect frees of pointers into middle of large heap objects. I dropped const from kasan_kfree_large() because it starts propagating through a bunch of functions in kasan_report.c, slab/slub nearest_obj(), all of their local variables, fixup_red_left(), etc. Link: http://lkml.kernel.org/r/1b45b4fe1d20fc0de1329aab674c1dd973fee723.1514378558.git.dvyukov@google.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>a Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexander Potapenko 提交于
As a code-size optimization, LLVM builds since r279383 may bulk-manipulate the shadow region when (un)poisoning large memory blocks. This requires new callbacks that simply do an uninstrumented memset(). This fixes linking the Clang-built kernel when using KASAN. [arnd@arndb.de: add declarations for internal functions] Link: http://lkml.kernel.org/r/20180105094112.2690475-1-arnd@arndb.de [fengguang.wu@intel.com: __asan_set_shadow_00 can be static] Link: http://lkml.kernel.org/r/20171223125943.GA74341@lkp-ib03 [ghackmann@google.com: fix memset() parameters, and tweak commit message to describe new callbacks] Link: http://lkml.kernel.org/r/20171204191735.132544-6-paullawrence@google.comSigned-off-by: NAlexander Potapenko <glider@google.com> Signed-off-by: NGreg Hackmann <ghackmann@google.com> Signed-off-by: NPaul Lawrence <paullawrence@google.com> Signed-off-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Paul Lawrence 提交于
clang's AddressSanitizer implementation adds redzones on either side of alloca()ed buffers. These redzones are 32-byte aligned and at least 32 bytes long. __asan_alloca_poison() is passed the size and address of the allocated buffer, *excluding* the redzones on either side. The left redzone will always be to the immediate left of this buffer; but AddressSanitizer may need to add padding between the end of the buffer and the right redzone. If there are any 8-byte chunks inside this padding, we should poison those too. __asan_allocas_unpoison() is just passed the top and bottom of the dynamic stack area, so unpoisoning is simpler. Link: http://lkml.kernel.org/r/20171204191735.132544-4-paullawrence@google.comSigned-off-by: NGreg Hackmann <ghackmann@google.com> Signed-off-by: NPaul Lawrence <paullawrence@google.com> Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 2月, 2018 1 次提交
-
-
由 Roman Gushchin 提交于
This patch effectively reverts commit 9f1c2674 ("net: memcontrol: defer call to mem_cgroup_sk_alloc()"). Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks memcg socket memory accounting, as packets received before memcg pointer initialization are not accounted and are causing refcounting underflow on socket release. Actually the free-after-use problem was fixed by commit c0576e39 ("net: call cgroup_sk_alloc() earlier in sk_clone_lock()") for the cgroup pointer. So, let's revert it and call mem_cgroup_sk_alloc() just before cgroup_sk_alloc(). This is safe, as we hold a reference to the socket we're cloning, and it holds a reference to the memcg. Also, let's drop BUG_ON(mem_cgroup_is_root()) check from mem_cgroup_sk_alloc(). I see no reasons why bumping the root memcg counter is a good reason to panic, and there are no realistic ways to hit it. Signed-off-by: NRoman Gushchin <guro@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 2月, 2018 36 次提交
-
-
由 Randy Dunlap 提交于
Fix some basic kernel-doc notation in mm/swap.c: - for function lru_cache_add_anon(), make its kernel-doc function name match its function name and change colon to hyphen following the function name - for function pagevec_lookup_entries(), change the function parameter name from nr_pages to nr_entries since that is more descriptive of what the parameter actually is and then it matches the kernel-doc comments also Fix function kernel-doc to match the change in commit 67fd707f: - drop the kernel-doc notation for @nr_pages from pagevec_lookup_range() and correct the function description for that change Link: http://lkml.kernel.org/r/3b42ee3e-04a9-a6ca-6be4-f00752a114fe@infradead.org Fixes: 67fd707f ("mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()") Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Bharata has noticed that onlining a newly added memory doesn't increase the total memory, pointing to commit f7f99100 ("mm: stop zeroing memory during allocation in vmemmap") as a culprit. This commit has changed the way how the memory for memmaps is initialized and moves it from the allocation time to the initialization time. This works properly for the early memmap init path. It doesn't work for the memory hotplug though because we need to mark page as reserved when the sparsemem section is created and later initialize it completely during onlining. memmap_init_zone is called in the early stage of onlining. With the current code it calls __init_single_page and as such it clears up the whole stage and therefore online_pages_range skips those pages. Fix this by skipping mm_zero_struct_page in __init_single_page for memory hotplug path. This is quite uggly but unifying both early init and memory hotplug init paths is a large project. Make sure we plug the regression at least. Link: http://lkml.kernel.org/r/20180130101141.GW21609@dhcp22.suse.cz Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NBharata B Rao <bharata@linux.vnet.ibm.com> Tested-by: NBharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Bob Picco <bob.picco@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 William Kucharski 提交于
There are multiple comments surrounding do_fault_around that memtion fault_around_pages() and fault_around_mask(), two routines that do not exist. These comments should be reworded to reference fault_around_bytes, the value which is used to determine how much do_fault_around() will attempt to read when processing a fault. These comments should have been updated when fault_around_pages() and fault_around_mask() were removed in commit aecd6f44 ("mm: close race between do_fault_around() and fault_around_bytes_set()"). Fixes: aecd6f44 ("mm: close race between do_fault_around() and fault_around_bytes_set()") Link: http://lkml.kernel.org/r/302D0B14-C7E9-44C6-8BED-033F9ACBD030@oracle.comSigned-off-by: NWilliam Kucharski <william.kucharski@oracle.com> Reviewed-by: NLarry Bassel <larry.bassel@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Henry Willard 提交于
Workloads consisting of a large number of processes running the same program with a very large shared data segment may experience performance problems when numa balancing attempts to migrate the shared cow pages. This manifests itself with many processes or tasks in TASK_UNINTERRUPTIBLE state waiting for the shared pages to be migrated. The program listed below simulates the conditions with these results when run with 288 processes on a 144 core/8 socket machine. Average throughput Average throughput Average throughput with numa_balancing=0 with numa_balancing=1 with numa_balancing=1 without the patch with the patch --------------------- --------------------- --------------------- 2118782 2021534 2107979 Complex production environments show less variability and fewer poorly performing outliers accompanied with a smaller number of processes waiting on NUMA page migration with this patch applied. In some cases, %iowait drops from 16%-26% to 0. // SPDX-License-Identifier: GPL-2.0 /* * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved. */ #include <sys/time.h> #include <stdio.h> #include <wait.h> #include <sys/mman.h> int a[1000000] = {13}; int main(int argc, const char **argv) { int n = 0; int i; pid_t pid; int stat; int *count_array; int cpu_count = 288; long total = 0; struct timeval t1, t2 = {(argc > 1 ? atoi(argv[1]) : 10), 0}; if (argc > 2) cpu_count = atoi(argv[2]); count_array = mmap(NULL, cpu_count * sizeof(int), (PROT_READ|PROT_WRITE), (MAP_SHARED|MAP_ANONYMOUS), 0, 0); if (count_array == MAP_FAILED) { perror("mmap:"); return 0; } for (i = 0; i < cpu_count; ++i) { pid = fork(); if (pid <= 0) break; if ((i & 0xf) == 0) usleep(2); } if (pid != 0) { if (i == 0) { perror("fork:"); return 0; } for (;;) { pid = wait(&stat); if (pid < 0) break; } for (i = 0; i < cpu_count; ++i) total += count_array[i]; printf("Total %ld\n", total); munmap(count_array, cpu_count * sizeof(int)); return 0; } gettimeofday(&t1, 0); timeradd(&t1, &t2, &t1); while (timercmp(&t2, &t1, <)) { int b = 0; int j; for (j = 0; j < 1000000; j++) b += a[j]; gettimeofday(&t2, 0); n++; } count_array[i] = n; return 0; } This patch changes change_pte_range() to skip shared copy-on-write pages when called from change_prot_numa(). NOTE: change_prot_numa() is nominally called from task_numa_work() and queue_pages_test_walk(). task_numa_work() is the auto NUMA balancing path, and queue_pages_test_walk() is part of explicit NUMA policy management. However, queue_pages_test_walk() only calls change_prot_numa() when MPOL_MF_LAZY is specified and currently that is not allowed, so change_prot_numa() is only called from auto NUMA balancing. In the case of explicit NUMA policy management, shared pages are not migrated unless MPOL_MF_MOVE_ALL is specified, and MPOL_MF_MOVE_ALL depends on CAP_SYS_NICE. Currently, there is no way to pass information about MPOL_MF_MOVE_ALL to change_pte_range. This will have to be fixed if MPOL_MF_LAZY is enabled and MPOL_MF_MOVE_ALL is to be honored in lazy migration mode. task_numa_work() skips the read-only VMAs of programs and shared libraries. Link: http://lkml.kernel.org/r/1516751617-7369-1-git-send-email-henry.willard@oracle.comSigned-off-by: NHenry Willard <henry.willard@oracle.com> Reviewed-by: NHåkon Bugge <haakon.bugge@oracle.com> Reviewed-by: NSteve Sistare <steven.sistare@oracle.com> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Dan Carpenter has noticed that mbind migration callback (new_page) can get a NULL vma pointer and choke on it inside alloc_huge_page_vma which relies on the VMA to get the hstate. We used to BUG_ON this case but the BUG_+ON has been removed recently by "hugetlb, mempolicy: fix the mbind hugetlb migration". The proper way to handle this is to get the hstate from the migrated page and rely on huge_node (resp. get_vma_policy) do the right thing with null VMA. We are currently falling back to the default mempolicy in that case which is in line what THP path is doing here. Link: http://lkml.kernel.org/r/20180110104712.GR1732@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
do_mbind migration code relies on alloc_huge_page_noerr for hugetlb pages. alloc_huge_page_noerr uses alloc_huge_page which is a highlevel allocation function which has to take care of reserves, overcommit or hugetlb cgroup accounting. None of that is really required for the page migration because the new page is only temporal and either will replace the original page or it will be dropped. This is essentially as for other migration call paths and there shouldn't be any reason to handle mbind in a special way. The current implementation is even suboptimal because the migration might fail just because the hugetlb cgroup limit is reached, or the overcommit is saturated. Fix this by making mbind like other hugetlb migration paths. Add a new migration helper alloc_huge_page_vma as a wrapper around alloc_huge_page_nodemask with additional mempolicy handling. alloc_huge_page_noerr has no more users and it can go. Link: http://lkml.kernel.org/r/20180103093213.26329-7-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Hugetlb allocator has several layer of allocation functions depending and the purpose of the allocation. There are two allocators depending on whether the page can be allocated from the page allocator or we need a contiguous allocator. This is currently opencoded in alloc_fresh_huge_page which is the only path that might allocate giga pages which require the later allocator. Create alloc_fresh_huge_page which hides this implementation detail and use it in all callers which hardcoded the buddy allocator path (__hugetlb_alloc_buddy_huge_page). This shouldn't introduce any funtional change because both migration and surplus allocators exlude giga pages explicitly. While we are at it let's do some renaming. The current scheme is not consistent and overly painfull to read and understand. Get rid of prefix underscores from most functions. There is no real reason to make names longer. * alloc_fresh_huge_page is the new layer to abstract underlying allocator * __hugetlb_alloc_buddy_huge_page becomes shorter and neater alloc_buddy_huge_page. * Former alloc_fresh_huge_page becomes alloc_pool_huge_page because we put the new page directly to the pool * alloc_surplus_huge_page can drop the opencoded prep_new_huge_page code as it uses alloc_fresh_huge_page now * others lose their excessive prefix underscores to make names shorter [dan.carpenter@oracle.com: fix double unlock bug in alloc_surplus_huge_page()] Link: http://lkml.kernel.org/r/20180109200559.g3iz5kvbdrz7yydp@mwanda Link: http://lkml.kernel.org/r/20180103093213.26329-6-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
alloc_surplus_huge_page increases the pool size and the number of surplus pages opportunistically to prevent from races with the pool size change. See commit d1c3fb1f ("hugetlb: introduce nr_overcommit_hugepages sysctl") for more details. The resulting code is unnecessarily hairy, cause code duplication and doesn't allow to share the allocation paths. Moreover pool size changes tend to be very seldom so optimizing for them is not really reasonable. Simplify the code and allow to allocate a fresh surplus page as long as we are under the overcommit limit and then recheck the condition after the allocation and drop the new page if the situation has changed. This should provide a reasonable guarantee that an abrupt allocation requests will not go way off the limit. If we consider races with the pool shrinking and enlarging then we should be reasonably safe as well. In the first case we are off by one in the worst case and the second case should work OK because the page is not yet visible. We can waste CPU cycles for the allocation but that should be acceptable for a relatively rare condition. Link: http://lkml.kernel.org/r/20180103093213.26329-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
hugepage migration relies on __alloc_buddy_huge_page to get a new page. This has 2 main disadvantages. 1) it doesn't allow to migrate any huge page if the pool is used completely which is not an exceptional case as the pool is static and unused memory is just wasted. 2) it leads to a weird semantic when migration between two numa nodes might increase the pool size of the destination NUMA node while the page is in use. The issue is caused by per NUMA node surplus pages tracking (see free_huge_page). Address both issues by changing the way how we allocate and account pages allocated for migration. Those should temporal by definition. So we mark them that way (we will abuse page flags in the 3rd page) and update free_huge_page to free such pages to the page allocator. Page migration path then just transfers the temporal status from the new page to the old one which will be freed on the last reference. The global surplus count will never change during this path but we still have to be careful when migrating a per-node suprlus page. This is now handled in move_hugetlb_state which is called from the migration path and it copies the hugetlb specific page state and fixes up the accounting when needed Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better reflect its purpose. The new allocation routine for the migration path is __alloc_migrate_huge_page. The user visible effect of this patch is that migrated pages are really temporal and they travel between NUMA nodes as per the migration request: Before migration /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0 After /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1 /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0 with the previous implementation, both nodes would have nr_hugepages:1 until the page is freed. Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Gigantic hugetlb pages were ingrown to the hugetlb code as an alien specie with a lot of special casing. The allocation path is not an exception. Unnecessarily so to be honest. It is true that the underlying allocator is different but that is an implementation detail. This patch unifies the hugetlb allocation path that a prepares fresh pool pages. alloc_fresh_gigantic_page basically copies alloc_fresh_huge_page logic so we can move everything there. This will simplify set_max_huge_pages which doesn't have to care about what kind of huge page we allocate. Link: http://lkml.kernel.org/r/20180103093213.26329-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Patch series "mm, hugetlb: allocation API and migration improvements" Motivation: this is a follow up for [3] for the allocation API and [4] for the hugetlb migration. It wasn't really easy to split those into two separate patch series as they share some code. My primary motivation to touch this code is to make the gigantic pages migration working. The giga pages allocation code is just too fragile and hacked into the hugetlb code now. This series tries to move giga pages closer to the first class citizen. We are not there yet but having 5 patches is quite a lot already and it will already make the code much easier to follow. I will come with other changes on top after this sees some review. The first two patches should be trivial to review. The third patch changes the way how we migrate huge pages. Newly allocated pages are a subject of the overcommit check and they participate surplus accounting which is quite unfortunate as the changelog explains. This patch doesn't change anything wrt. giga pages. Patch #4 removes the surplus accounting hack from __alloc_surplus_huge_page. I hope I didn't miss anything there and a deeper review is really due there. Patch #5 finally unifies allocation paths and giga pages shouldn't be any special anymore. There is also some renaming going on as well. This patch (of 6): hugetlb allocator has two entry points to the page allocator - alloc_fresh_huge_page_node - __hugetlb_alloc_buddy_huge_page The two differ very subtly in two aspects. The first one doesn't care about HTLB_BUDDY_* stats and it doesn't initialize the huge page. prep_new_huge_page is not used because it not only initializes hugetlb specific stuff but because it also put_page and releases the page to the hugetlb pool which is not what is required in some contexts. This makes things more complicated than necessary. Simplify things by a) removing the page allocator entry point duplicity and only keep __hugetlb_alloc_buddy_huge_page and b) make prep_new_huge_page more reusable by removing the put_page which moves the page to the allocator pool. All current callers are updated to call put_page explicitly. Later patches will add new callers which won't need it. This patch shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20180103093213.26329-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Reale <ar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Ryabinin 提交于
mem_cgroup_resize_[memsw]_limit() tries to free only 32 (SWAP_CLUSTER_MAX) pages on each iteration. This makes it practically impossible to decrease limit of memory cgroup. Tasks could easily allocate back 32 pages, so we can't reduce memory usage, and once retry_count reaches zero we return -EBUSY. Easy to reproduce the problem by running the following commands: mkdir /sys/fs/cgroup/memory/test echo $$ >> /sys/fs/cgroup/memory/test/tasks cat big_file > /dev/null & sleep 1 && echo $((100*1024*1024)) > /sys/fs/cgroup/memory/test/memory.limit_in_bytes -bash: echo: write error: Device or resource busy Instead of relying on retry_count, keep retrying the reclaim until the desired limit is reached or fail if the reclaim doesn't make any progress or a signal is pending. Link: http://lkml.kernel.org/r/20180119132544.19569-1-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christopher Díaz Riveros 提交于
Fix the following sparse warning: mm/memcontrol.c:1097:14: warning: symbol 'memcg1_stats' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20180118193327.14200-1-chrisadr@gentoo.orgSigned-off-by: NChristopher Díaz Riveros <chrisadr@gentoo.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ralph Campbell 提交于
The variable 'entry' is used before being initialized in hmm_vma_walk_pmd(). No bad effect (beside performance hit) so !non_swap_entry(0) evaluate to true which trigger a fault as if CPU was trying to access migrated memory and migrate memory back from device memory to regular memory. This function (hmm_vma_walk_pmd()) is called when a device driver tries to populate its own page table. For migrated memory it should not happen as the device driver should already have populated its page table correctly during the migration. Only case I can think of is multi-GPU where a second GPU triggers migration back to regular memory. Again this would just result in a performance hit, nothing bad would happen. Link: http://lkml.kernel.org/r/20180122185759.26286-1-jglisse@redhat.comSigned-off-by: NRalph Campbell <rcampbell@nvidia.com> Signed-off-by: NJérôme Glisse <jglisse@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Petr Tesarik 提交于
The comment is confusing. On the one hand, it refers to 32-bit alignment (struct page alignment on 32-bit platforms), but this would only guarantee that the 2 lowest bits must be zero. On the other hand, it claims that at least 3 bits are available, and 3 bits are actually used. This is not broken, because there is a stronger alignment guarantee, just less obvious. Let's fix the comment to make it clear how many bits are available and why. Although memmap arrays are allocated in various places, the resulting pointer is encoded eventually, so I am adding a BUG_ON() here to enforce at runtime that all expected bits are indeed available. I have also added a BUILD_BUG_ON to check that PFN_SECTION_SHIFT is sufficient, because this part of the calculation can be easily checked at build time. [ptesarik@suse.com: v2] Link: http://lkml.kernel.org/r/20180125100516.589ea6af@ezekiel.suse.cz Link: http://lkml.kernel.org/r/20180119080908.3a662e6f@ezekiel.suse.czSigned-off-by: NPetr Tesarik <ptesarik@suse.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemi Wang <kemi.wang@intel.com> Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yang Shi 提交于
"mode" argument is not used by try_to_compact_pages() and sub functions anymore, it has been replaced by "prio". Fix the comment to explain the use of "prio" argument. Link: http://lkml.kernel.org/r/1515801336-20611-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oscar Salvador 提交于
static struct page_ext_operations *page_ext_ops[] always contains debug_guardpage_ops, static struct page_ext_operations *page_ext_ops[] = { &debug_guardpage_ops, #ifdef CONFIG_PAGE_OWNER &page_owner_ops, #endif ... } but for it to work, CONFIG_DEBUG_PAGEALLOC must be enabled first. If someone has CONFIG_PAGE_EXTENSION, but has none of its users, eg: (CONFIG_PAGE_OWNER, CONFIG_DEBUG_PAGEALLOC, CONFIG_IDLE_PAGE_TRACKING), we can shrink page_ext_init() to a simple retq. $ size vmlinux (before patch) text data bss dec hex filename 14356698 5681582 1687748 21726028 14b834c vmlinux $ size vmlinux (after patch) text data bss dec hex filename 14356008 5681538 1687748 21725294 14b806e vmlinux On the other hand, it might does not even make sense, since if someone enables CONFIG_PAGE_EXTENSION, I would expect him to enable also at least one of its users. Link: http://lkml.kernel.org/r/20180105130235.GA21241@techadventures.netSigned-off-by: NOscar Salvador <osalvador@techadventures.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Desaulniers 提交于
Fix warning about shifting unsigned literals being undefined behavior. Link: http://lkml.kernel.org/r/1515642078-4259-1-git-send-email-nick.desaulniers@gmail.comSigned-off-by: NNick Desaulniers <nick.desaulniers@gmail.com> Suggested-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nick Desaulniers <nick.desaulniers@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oscar Salvador 提交于
Remove two redundant assignments in init_pages_in_zone(). [osalvador@techadventures.net: v3] Link: http://lkml.kernel.org/r/20180117124513.GA876@techadventures.net [akpm@linux-foundation.org: coding style tweaks] Link: http://lkml.kernel.org/r/20180110084355.GA22822@techadventures.netSigned-off-by: NOscar Salvador <osalvador@techadventures.net> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shile Zhang 提交于
Link: http://lkml.kernel.org/r/1515485774-4768-1-git-send-email-zhangshile@gmail.comSigned-off-by: NShile Zhang <zhangshile@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yu Zhao 提交于
mem_cgroup_resize_limit() and mem_cgroup_resize_memsw_limit() have identical logics. Refactor code so we don't need to keep two pieces of code that does same thing. Link: http://lkml.kernel.org/r/20180108224238.14583-1-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com> Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yu Zhao 提交于
We waste sizeof(swp_entry_t) for zswap header when using zsmalloc as zpool driver because zsmalloc doesn't support eviction. Add zpool_evictable() to detect if zpool is potentially evictable, and use it in zswap to avoid waste memory for zswap header. [yuzhao@google.com: The zpool->" prefix is a result of copy & paste] Link: http://lkml.kernel.org/r/20180110225626.110330-1-yuzhao@google.com Link: http://lkml.kernel.org/r/20180110224741.83751-1-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com> Acked-by: NDan Streetman <ddstreet@ieee.org> Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 shidao.ytt 提交于
During our recent testing with fadvise(FADV_DONTNEED), we find that if given offset/length is not page-aligned, the last page will not be discarded. The tool we use is vmtouch (https://hoytech.com/vmtouch/), we map a 10KB-sized file into memory and then try to run this tool to evict the whole file mapping, but the last single page always remains staying in the memory: $./vmtouch -e test_10K Files: 1 Directories: 0 Evicted Pages: 3 (12K) Elapsed: 2.1e-05 seconds $./vmtouch test_10K Files: 1 Directories: 0 Resident Pages: 1/3 4K/12K 33.3% Elapsed: 5.5e-05 seconds However when we test with an older kernel, say 3.10, this problem is gone. So we wonder if this is a regression: $./vmtouch -e test_10K Files: 1 Directories: 0 Evicted Pages: 3 (12K) Elapsed: 8.2e-05 seconds $./vmtouch test_10K Files: 1 Directories: 0 Resident Pages: 0/3 0/12K 0% <-- partial page also discarded Elapsed: 5e-05 seconds After digging a little bit into this problem, we find it seems not a regression. Not discarding partial page is likely to be on purpose according to commit 441c228f ("mm: fadvise: document the fadvise(FADV_DONTNEED) behaviour for partial pages") written by Mel Gorman. He explained why partial pages should be preserved instead of being discarded when using fadvise(FADV_DONTNEED). However, the interesting part is that the actual code did NOT work as the same as it was described, the partial page was still discarded anyway, due to a calculation mistake of `end_index' passed to invalidate_mapping_pages(). This mistake has not been fixed until recently, that's why we fail to reproduce our problem in old kernels. The fix is done in commit 18aba41c ("mm/fadvise.c: do not discard partial pages with POSIX_FADV_DONTNEED") by Oleg Drokin. Back to the original testing, our problem becomes that there is a special case that, if the page-unaligned `endbyte' is also the end of file, it is not necessary at all to preserve the last partial page, as we all know no one else will use the rest of it. It should be safe enough if we just discard the whole page. So we add an EOF check in this patch. We also find a poosbile real world issue in mainline kernel. Assume such scenario: A userspace backup application want to backup a huge amount of small files (<4k) at once, the developer might (I guess) want to use fadvise(FADV_DONTNEED) to save memory. However, FADV_DONTNEED won't really happen since the only page mapped is a partial page, and kernel will preserve it. Our patch also fixes this problem, since we know the endbyte is EOF, so we discard it. Here is a simple reproducer to reproduce and verify each scenario we described above: test_fadvise.c ============================== #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <unistd.h> int main(int argc, char **argv) { int i, fd, ret, len; struct stat buf; void *addr; unsigned char *vec; char *strbuf; ssize_t pagesize = getpagesize(); ssize_t filesize; fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR); if (fd < 0) return -1; filesize = strtoul(argv[2], NULL, 10); strbuf = malloc(filesize); memset(strbuf, 42, filesize); write(fd, strbuf, filesize); free(strbuf); fsync(fd); len = (filesize + pagesize - 1) / pagesize; printf("length of pages: %d\n", len); addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0); if (addr == MAP_FAILED) return -1; ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED); if (ret < 0) return -1; vec = malloc(len); ret = mincore(addr, filesize, (void *)vec); if (ret < 0) return -1; for (i = 0; i < len; i++) printf("pages[%d]: %x\n", i, vec[i] & 0x1); free(vec); close(fd); return 0; } ============================== Test 1: running on kernel with commit 18aba41c reverted: [root@caspar ~]# uname -r 4.15.0-rc6.revert+ [root@caspar ~]# ./test_fadvise file1 1024 length of pages: 1 pages[0]: 0 # <-- partial page discarded [root@caspar ~]# ./test_fadvise file2 8192 length of pages: 2 pages[0]: 0 pages[1]: 0 [root@caspar ~]# ./test_fadvise file3 10240 length of pages: 3 pages[0]: 0 pages[1]: 0 pages[2]: 0 # <-- partial page discarded Test 2: running on mainline kernel: [root@caspar ~]# uname -r 4.15.0-rc6+ [root@caspar ~]# ./test_fadvise test1 1024 length of pages: 1 pages[0]: 1 # <-- partial and the only page not discarded [root@caspar ~]# ./test_fadvise test2 8192 length of pages: 2 pages[0]: 0 pages[1]: 0 [root@caspar ~]# ./test_fadvise test3 10240 length of pages: 3 pages[0]: 0 pages[1]: 0 pages[2]: 1 # <-- partial page not discarded Test 3: running on kernel with this patch: [root@caspar ~]# uname -r 4.15.0-rc6.patched+ [root@caspar ~]# ./test_fadvise test1 1024 length of pages: 1 pages[0]: 0 # <-- partial page and EOF, discarded [root@caspar ~]# ./test_fadvise test2 8192 length of pages: 2 pages[0]: 0 pages[1]: 0 [root@caspar ~]# ./test_fadvise test3 10240 length of pages: 3 pages[0]: 0 pages[1]: 0 pages[2]: 0 # <-- partial page and EOF, discarded [akpm@linux-foundation.org: tweak code comment] Link: http://lkml.kernel.org/r/5222da9ee20e1695eaabb69f631f200d6e6b8876.1515132470.git.jinli.zjl@alibaba-inc.comSigned-off-by: Nshidao.ytt <shidao.ytt@alibaba-inc.com> Signed-off-by: NCaspar Zhang <jinli.zjl@alibaba-inc.com> Reviewed-by: NOliver Yang <zhiche.yy@alibaba-inc.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Minchan Kim asked the following question -- what locks protects address_space destroying when race happens between inode trauncation and __isolate_lru_page? Jan Kara clarified by describing the race as follows CPU1 CPU2 truncate(inode) __isolate_lru_page() ... truncate_inode_page(mapping, page); delete_from_page_cache(page) spin_lock_irqsave(&mapping->tree_lock, flags); __delete_from_page_cache(page, NULL) page_cache_tree_delete(..) ... mapping = page_mapping(page); page->mapping = NULL; ... spin_unlock_irqrestore(&mapping->tree_lock, flags); page_cache_free_page(mapping, page) put_page(page) if (put_page_testzero(page)) -> false - inode now has no pages and can be freed including embedded address_space if (mapping && !mapping->a_ops->migratepage) - we've dereferenced mapping which is potentially already free. The race is theoretically possible but unlikely. Before the delete_from_page_cache, truncate_cleanup_page is called so the page is likely to be !PageDirty or PageWriteback which gets skipped by the only caller that checks the mappping in __isolate_lru_page. Even if the race occurs, a substantial amount of work has to happen during a tiny window with no preemption but it could potentially be done using a virtual machine to artifically slow one CPU or halt it during the critical window. This patch should eliminate the race with truncation by try-locking the page before derefencing mapping and aborting if the lock was not acquired. There was a suggestion from Huang Ying to use RCU as a side-effect to prevent mapping being freed. However, I do not like the solution as it's an unconventional means of preserving a mapping and it's not a context where rcu_read_lock is obviously protecting rcu data. Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net Fixes: c8244935 ("mm: compaction: make isolate_lru_page() filter-aware again") Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NMinchan Kim <minchan@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Marc-André Lureau 提交于
Adapt add_seals()/get_seals() to work with hugetbfs-backed memory. Teach memfd_create() to allow sealing operations on MFD_HUGETLB. Link: http://lkml.kernel.org/r/20171107122800.25517-6-marcandre.lureau@redhat.comSigned-off-by: NMarc-André Lureau <marcandre.lureau@redhat.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Marc-André Lureau 提交于
Those functions are called for memfd files, backed by shmem or hugetlb (the next patches will handle hugetlb). Link: http://lkml.kernel.org/r/20171107122800.25517-3-marcandre.lureau@redhat.comSigned-off-by: NMarc-André Lureau <marcandre.lureau@redhat.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Marc-André Lureau 提交于
Patch series "memfd: add sealing to hugetlb-backed memory", v3. Recently, Mike Kravetz added hugetlbfs support to memfd. However, he didn't add sealing support. One of the reasons to use memfd is to have shared memory sealing when doing IPC or sharing memory with another process with some extra safety. qemu uses shared memory & hugetables with vhost-user (used by dpdk), so it is reasonable to use memfd now instead for convenience and security reasons. This patch (of 9): The functions are called through shmem_fcntl() only. And no danger in removing the EXPORTs as the routines only work with shmem file structs. Link: http://lkml.kernel.org/r/20171107122800.25517-2-marcandre.lureau@redhat.comSigned-off-by: NMarc-André Lureau <marcandre.lureau@redhat.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aliaksei Karaliou 提交于
Structure zs_pool has special flag to indicate success of shrinker initialization. unregister_shrinker() has improved and can detect by itself whether actual deinitialization should be performed or not, so extra flag becomes redundant. [akpm@linux-foundation.org: update comment (Aliaksei), remove unneeded cast] Link: http://lkml.kernel.org/r/1513680552-9798-1-git-send-email-akaraliou.dev@gmail.comSigned-off-by: NAliaksei Karaliou <akaraliou.dev@gmail.com> Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: NMinchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
This uses the new annotation to determine if an mm has mmu notifiers with blockable invalidate range callbacks to avoid oom reaping. Otherwise, the callbacks are used around unmap_page_range(). Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141330120.74052@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dimitri Sivanich <sivanich@hpe.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Joerg Roedel <joro@8bytes.org> Cc: Doug Ledford <dledford@redhat.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
Commit 4d4bbd85 ("mm, oom_reaper: skip mm structs with mmu notifiers") prevented the oom reaper from unmapping private anonymous memory with the oom reaper when the oom victim mm had mmu notifiers registered. The rationale is that doing mmu_notifier_invalidate_range_{start,end}() around the unmap_page_range(), which is needed, can block and the oom killer will stall forever waiting for the victim to exit, which may not be possible without reaping. That concern is real, but only true for mmu notifiers that have blockable invalidate_range_{start,end}() callbacks. This patch adds a "flags" field to mmu notifier ops that can set a bit to indicate that these callbacks do not block. The implementation is steered toward an expensive slowpath, such as after the oom reaper has grabbed mm->mmap_sem of a still alive oom victim. [rientjes@google.com: mmu_notifier_invalidate_range_end() can also call the invalidate_range() must not block, fix comment] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801091339570.240101@chino.kir.corp.google.com [akpm@linux-foundation.org: make mm_has_blockable_invalidate_notifiers() return bool, use rwsem_is_locked()] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141329500.74052@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NPaolo Bonzini <pbonzini@redhat.com> Acked-by: NChristian König <christian.koenig@amd.com> Acked-by: NDimitri Sivanich <sivanich@hpe.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Joerg Roedel <joro@8bytes.org> Cc: Doug Ledford <dledford@redhat.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yang Shi 提交于
In the current design, khugepaged needs to acquire mmap_sem before scanning an mm. But in some corner cases, khugepaged may scan a process which is modifying its memory mapping, so khugepaged blocks in uninterruptible state. But the process might hold the mmap_sem for a long time when modifying a huge memory space and it may trigger the below khugepaged hung issue: INFO: task khugepaged:270 blocked for more than 120 seconds. Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. khugepaged D 0 270 2 0x00000000 ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440 ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64 0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 khugepaged+0x476/0x11d0 kthread+0xe6/0x100 ret_from_fork+0x25/0x30 So it sounds pointless to just block khugepaged waiting for the semaphore so replace down_read() with down_read_trylock() to move to scan the next mm quickly instead of just blocking on the semaphore so that other processes can get more chances to install THP. Then khugepaged can come back to scan the skipped mm when it has finished the current round full_scan. And it appears that the change can improve khugepaged efficiency a little bit. Below is the test result when running LTP on a 24 cores 4GB memory 2 nodes NUMA VM: pristine w/ trylock full_scan 197 187 pages_collapsed 21 26 thp_fault_alloc 40818 44466 thp_fault_fallback 18413 16679 thp_collapse_alloc 21 150 thp_collapse_alloc_failed 14 16 thp_file_alloc 369 369 [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: tweak comment] [arnd@arndb.de: avoid uninitialized variable use] Link: http://lkml.kernel.org/r/20171215125129.2948634-1-arnd@arndb.de Link: http://lkml.kernel.org/r/1513281203-54878-1-git-send-email-yang.s@alibaba-inc.comSigned-off-by: NYang Shi <yang.s@alibaba-inc.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aneesh Kumar K.V 提交于
Instead of marking the pmd ready for split, invalidate the pmd. This should take care of powerpc requirement. Only side effect is that we mark the pmd invalid early. This can result in us blocking access to the page a bit longer if we race against a thp split. [kirill.shutemov@linux.intel.com: rebased, dirty THP once] Link: http://lkml.kernel.org/r/20171213105756.69879-13-kirill.shutemov@linux.intel.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Daney <david.daney@cavium.com> Cc: David Miller <davem@davemloft.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitin Gupta <nitin.m.gupta@oracle.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Use the modifed pmdp_invalidate() that returns the previous value of pmd to transfer dirty and accessed bits. Link: http://lkml.kernel.org/r/20171213105756.69879-12-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Daney <david.daney@cavium.com> Cc: David Miller <davem@davemloft.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitin Gupta <nitin.m.gupta@oracle.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Vlastimil noted that pmdp_invalidate() is not atomic and we can lose dirty and access bits if CPU sets them after pmdp dereference, but before set_pmd_at(). The patch change pmdp_invalidate() to make the entry non-present atomically and return previous value of the entry. This value can be used to check if CPU set dirty/accessed bits under us. The race window is very small and I haven't seen any reports that can be attributed to the bug. For this reason, I don't think backporting to stable trees needed. Link: http://lkml.kernel.org/r/20171213105756.69879-11-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: NVlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Daney <david.daney@cavium.com> Cc: David Miller <davem@davemloft.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitin Gupta <nitin.m.gupta@oracle.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox 提交于
Several users of unmap_mapping_range() would prefer to express their range in pages rather than bytes. Unfortuately, on a 32-bit kernel, you have to remember to cast your page number to a 64-bit type before shifting it, and four places in the current tree didn't remember to do that. That's a sign of a bad interface. Conveniently, unmap_mapping_range() actually converts from bytes into pages, so hoist the guts of unmap_mapping_range() into a new function unmap_mapping_pages() and convert the callers which want to use pages. Link: http://lkml.kernel.org/r/20171206142627.GD32044@bombadil.infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com> Reported-by: N"zhangyi (F)" <yi.zhang@huawei.com> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yisheng Xie 提交于
pmd_trans_splitting() was removed after THP refcounting redesign, therefore related comment should be updated. Link: http://lkml.kernel.org/r/1512625745-59451-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: NYisheng Xie <xieyisheng1@huawei.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-