- 11 2月, 2015 8 次提交
-
-
由 Kirill A. Shutemov 提交于
Nobody uses it anymore. [akpm@linux-foundation.org: fix filemap_xip.c] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
We don't create non-linear mappings anymore. Let's drop code which handles them on page fault. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
We have remap_file_pages(2) emulation in -mm tree for few release cycles and we plan to have it mainline in v3.20. This patchset removes rest of VM_NONLINEAR infrastructure. Patches 1-8 take care about generic code. They are pretty straight-forward and can be applied without other of patches. Rest patches removes pte_file()-related stuff from architecture-specific code. It usually frees up one bit in non-present pte. I've tried to reuse that bit for swap offset, where I was able to figure out how to do that. For obvious reason I cannot test all that arch-specific code and would like to see acks from maintainers. In total, remap_file_pages(2) required about 1.4K lines of not-so-trivial kernel code. That's too much for functionality nobody uses. Tested-by: NFelipe Balbi <balbi@ti.com> This patch (of 38): We don't create non-linear mappings anymore. Let's drop code which handles them on unmap/zap. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
remap_file_pages(2) was invented to be able efficiently map parts of huge file into limited 32-bit virtual address space such as in database workloads. Nonlinear mappings are pain to support and it seems there's no legitimate use-cases nowadays since 64-bit systems are widely available. Let's drop it and get rid of all these special-cased code. The patch replaces the syscall with emulation which creates new VMA on each remap_file_pages(), unless they it can be merged with an adjacent one. I didn't find *any* real code that uses remap_file_pages(2) to test emulation impact on. I've checked Debian code search and source of all packages in ALT Linux. No real users: libc wrappers, mentions in strace, gdb, valgrind and this kind of stuff. There are few basic tests in LTP for the syscall. They work just fine with emulation. To test performance impact, I've written small test case which demonstrate pretty much worst case scenario: map 4G shmfs file, write to begin of every page pgoff of the page, remap pages in reverse order, read every page. The test creates 1 million of VMAs if emulation is in use, so I had to set vm.max_map_count to 1100000 to avoid -ENOMEM. Before: 23.3 ( +- 4.31% ) seconds After: 43.9 ( +- 0.85% ) seconds Slowdown: 1.88x I believe we can live with that. Test case: #define _GNU_SOURCE #include <assert.h> #include <stdlib.h> #include <stdio.h> #include <sys/mman.h> #define MB (1024UL * 1024) #define SIZE (4096 * MB) int main(int argc, char **argv) { unsigned long *p; long i, pass; for (pass = 0; pass < 10; pass++) { p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) { perror("mmap"); return -1; } for (i = 0; i < SIZE / 4096; i++) p[i * 4096 / sizeof(*p)] = i; for (i = 0; i < SIZE / 4096; i++) { if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096, 0, (SIZE - 4096 * (i + 1)) >> 12, 0)) { perror("remap_file_pages"); return -1; } } for (i = SIZE / 4096 - 1; i >= 0; i--) assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1); munmap(p, SIZE); } return 0; } [akpm@linux-foundation.org: fix spello] [sasha.levin@oracle.com: initialize populate before usage] [sasha.levin@oracle.com: grab file ref to prevent race while mmaping] Signed-off-by: N"Kirill A. Shutemov" <kirill@shutemov.name> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Jones <davej@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Armin Rigo <arigo@tunes.org> Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
CONFIG_COMPACTION=y, CONFIG_DEBUG_FS=n: mm/vmstat.c:690: warning: 'frag_start' defined but not used mm/vmstat.c:702: warning: 'frag_next' defined but not used mm/vmstat.c:710: warning: 'frag_stop' defined but not used mm/vmstat.c:715: warning: 'walk_zones_in_node' defined but not used It's all a bit of a tangly mess and it's unclear why CONFIG_COMPACTION figures in there at all. Move frag_start/frag_next/frag_stop and migratetype_names[] into the existing CONFIG_PROC_FS block. walk_zones_in_node() gets a special ifdef. Also move the #include lines up to where #include lines live. [axel.lin@ingics.com: fix build error when !CONFIG_PROC_FS] Signed-off-by: NAxel Lin <axel.lin@ingics.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Tested-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vaishali Thakkar 提交于
Here, free memory is allocated using kmem_cache_zalloc. So, use kmem_cache_free instead of kfree. This is done using Coccinelle and semantic patch used is as follows: @@ expression x,E,c; @@ x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...) ... when != x = E when != &x ?-kfree(x) +kmem_cache_free(c,x) Signed-off-by: NVaishali Thakkar <vthakkar1994@gmail.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kim Phillips 提交于
Signed-off-by: NKim Phillips <kim.phillips@freescale.com> Acked-by: NChristoph Lameter <cl@linux.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
We had to insert a preempt enable/disable in the fastpath a while ago in order to guarantee that tid and kmem_cache_cpu are retrieved on the same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler can move the process to other cpu during retrieving data. Now, I reach the solution to remove preempt enable/disable in the fastpath. If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu are retrieved by separate this_cpu operation, it means that they are retrieved on the same cpu. If not matched, we just have to retry it. With this guarantee, preemption enable/disable isn't need at all even if CONFIG_PREEMPT, so this patch removes it. I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns) Below is the result of Christoph's slab_test reported by Jesper Dangaard Brouer. * Before Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles 10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles 10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles 10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles 10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles 10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles 10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles 10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles 10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles 10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles 10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles 10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 68 cycles 10000 times kmalloc(16)/kfree -> 68 cycles 10000 times kmalloc(32)/kfree -> 69 cycles 10000 times kmalloc(64)/kfree -> 68 cycles 10000 times kmalloc(128)/kfree -> 68 cycles 10000 times kmalloc(256)/kfree -> 68 cycles 10000 times kmalloc(512)/kfree -> 74 cycles 10000 times kmalloc(1024)/kfree -> 75 cycles 10000 times kmalloc(2048)/kfree -> 74 cycles 10000 times kmalloc(4096)/kfree -> 74 cycles 10000 times kmalloc(8192)/kfree -> 75 cycles 10000 times kmalloc(16384)/kfree -> 510 cycles * After Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles 10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles 10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles 10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles 10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles 10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles 10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles 10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles 10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles 10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles 10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles 10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 65 cycles 10000 times kmalloc(16)/kfree -> 66 cycles 10000 times kmalloc(32)/kfree -> 65 cycles 10000 times kmalloc(64)/kfree -> 66 cycles 10000 times kmalloc(128)/kfree -> 66 cycles 10000 times kmalloc(256)/kfree -> 71 cycles 10000 times kmalloc(512)/kfree -> 72 cycles 10000 times kmalloc(1024)/kfree -> 71 cycles 10000 times kmalloc(2048)/kfree -> 71 cycles 10000 times kmalloc(4096)/kfree -> 71 cycles 10000 times kmalloc(8192)/kfree -> 65 cycles 10000 times kmalloc(16384)/kfree -> 511 cycles Most of the results are better than before. Note that this change slightly worses performance in !CONFIG_PREEMPT, roughly 0.3%. Implementing each case separately would help performance, but, since it's so marginal, I didn't do that. This would help maintanance since we have same code for all cases. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NChristoph Lameter <cl@linux.com> Tested-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 2月, 2015 3 次提交
-
-
由 Michal Hocko 提交于
It has been reported that 965GM might trigger VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage) in mem_cgroup_migrate when shmem wants to replace a swap cache page because of shmem_should_replace_page (the page is allocated from an inappropriate zone). shmem_replace_page expects that the oldpage is not on LRU list and calls mem_cgroup_migrate without lrucare. This is obviously incorrect because swapcache pages might be on the LRU list (e.g. swapin readahead page). Fix this by enabling lrucare for the migration in shmem_replace_page. Also clarify that lrucare should be used even if one of the pages might be on LRU list. The BUG_ON will trigger only when CONFIG_DEBUG_VM is enabled but even without that the migration code might leave the old page on an inappropriate memcg' LRU which is not that critical because the page would get removed with its last reference but it is still confusing. Fixes: 0a31bc97 ("mm: memcontrol: rewrite uncharge API") Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reported-by: NChris Wilson <chris@chris-wilson.co.uk> Reported-by: NDave Airlie <airlied@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.17+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Arnd Bergmann 提交于
The symbol 'high_memory' is provided on both MMU- and NOMMU-kernels, but only one of them is exported, which leads to module build errors in drivers that work fine built-in: ERROR: "high_memory" [drivers/net/virtio_net.ko] undefined! ERROR: "high_memory" [drivers/net/ppp/ppp_mppe.ko] undefined! ERROR: "high_memory" [drivers/mtd/nand/nand.ko] undefined! ERROR: "high_memory" [crypto/tcrypt.ko] undefined! ERROR: "high_memory" [crypto/cts.ko] undefined! This exports the symbol to get these to work on NOMMU as well. Signed-off-by: NArnd Bergmann <arnd@arndb.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NGreg Ungerer <gerg@uclinux.org> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shiraz Hashim 提交于
walk_page_range() silently skips vma having VM_PFNMAP set, which leads to undesirable behaviour at client end (who called walk_page_range). Userspace applications get the wrong data, so the effect is like just confusing users (if the applications just display the data) or sometimes killing the processes (if the applications do something with misunderstanding virtual addresses due to the wrong data.) For example for pagemap_read, when no callbacks are called against VM_PFNMAP vma, pagemap_read may prepare pagemap data for next virtual address range at wrong index. Eventually userspace may get wrong pagemap data for a task. Corresponding to a VM_PFNMAP marked vma region, kernel may report mappings from subsequent vma regions. User space in turn may account more pages (than really are) to the task. In my case I was using procmem, procrack (Android utility) which uses pagemap interface to account RSS pages of a task. Due to this bug it was giving a wrong picture for vmas (with VM_PFNMAP set). Fixes: a9ff785e ("mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas") Signed-off-by: NShiraz Hashim <shashim@codeaurora.org> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 1月, 2015 2 次提交
-
-
由 Linus Torvalds 提交于
The stack guard page error case has long incorrectly caused a SIGBUS rather than a SIGSEGV, but nobody actually noticed until commit fee7e49d ("mm: propagate error from stack expansion even for guard page") because that error case was never actually triggered in any normal situations. Now that we actually report the error, people noticed the wrong signal that resulted. So far, only the test suite of libsigsegv seems to have actually cared, but there are real applications that use libsigsegv, so let's not wait for any of those to break. Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de> Tested-by: NJan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a "you should SIGSEGV" error, because the SIGSEGV case was generally handled by the caller - usually the architecture fault handler. That results in lots of duplication - all the architecture fault handlers end up doing very similar "look up vma, check permissions, do retries etc" - but it generally works. However, there are cases where the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV. In particular, when accessing the stack guard page, libsigsegv expects a SIGSEGV. And it usually got one, because the stack growth is handled by that duplicated architecture fault handler. However, when the generic VM layer started propagating the error return from the stack expansion in commit fee7e49d ("mm: propagate error from stack expansion even for guard page"), that now exposed the existing VM_FAULT_SIGBUS result to user space. And user space really expected SIGSEGV, not SIGBUS. To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those duplicate architecture fault handlers about it. They all already have the code to handle SIGSEGV, so it's about just tying that new return value to the existing code, but it's all a bit annoying. This is the mindless minimal patch to do this. A more extensive patch would be to try to gather up the mostly shared fault handling logic into one generic helper routine, and long-term we really should do that cleanup. Just from this patch, you can generally see that most architectures just copied (directly or indirectly) the old x86 way of doing things, but in the meantime that original x86 model has been improved to hold the VM semaphore for shorter times etc and to handle VM_FAULT_RETRY and other "newer" things, so it would be a good idea to bring all those improvements to the generic case and teach other architectures about them too. Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de> Tested-by: NJan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 1月, 2015 3 次提交
-
-
由 Michael S. Tsirkin 提交于
for_each_zone_zonelist_nodemask wants an enum zone_type argument, but is passed gfp_t: mm/vmscan.c:2658:9: expected int enum zone_type [signed] highest_zoneidx mm/vmscan.c:2658:9: got restricted gfp_t [usertype] gfp_mask mm/vmscan.c:2658:9: warning: incorrect type in argument 2 (different base types) mm/vmscan.c:2658:9: expected int enum zone_type [signed] highest_zoneidx mm/vmscan.c:2658:9: got restricted gfp_t [usertype] gfp_mask convert argument to the correct type. Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Greg Thelen 提交于
Commit e61734c5 ("cgroup: remove cgroup->name") added two extra newlines to memcg oom kill log messages. This makes dmesg hard to read and parse. The issue affects 3.15+. Example: Task in /t <<< extra #1 killed as a result of limit of /t <<< extra #2 memory: usage 102400kB, limit 102400kB, failcnt 274712 Remove the extra newlines from memcg oom kill messages, so the messages look like: Task in /t killed as a result of limit of /t memory: usage 102400kB, limit 102400kB, failcnt 240649 Fixes: e61734c5 ("cgroup: remove cgroup->name") Signed-off-by: NGreg Thelen <gthelen@google.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The OOM killing invocation does a lot of duplicative checks against the task's allocation context. Rework it to take advantage of the existing checks in the allocator slowpath. The OOM killer is invoked when the allocator is unable to reclaim any pages but the allocation has to keep looping. Instead of having a check for __GFP_NORETRY hidden in oom_gfp_allowed(), just move the OOM invocation to the true branch of should_alloc_retry(). The __GFP_FS check from oom_gfp_allowed() can then be moved into the OOM avoidance branch in __alloc_pages_may_oom(), along with the PF_DUMPCORE test. __alloc_pages_may_oom() can then signal to the caller whether the OOM killer was invoked, instead of requiring it to duplicate the order and high_zoneidx checks to guess this when deciding whether to continue. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 1月, 2015 1 次提交
-
-
由 Will Deacon 提交于
When batching up address ranges for TLB invalidation, we check tlb->end != 0 to indicate that some pages have actually been unmapped. As of commit f045bbb9 ("mmu_gather: fix over-eager tlb_flush_mmu_free() calling"), we use the same check for freeing these pages in order to avoid a performance regression where we call free_pages_and_swap_cache even when no pages are actually queued up. Unfortunately, the range could have been reset (tlb->end = 0) by tlb_end_vma, which has been shown to cause memory leaks on arm64. Furthermore, investigation into these leaks revealed that the fullmm case on task exit no longer invalidates the TLB, by virtue of tlb->end == 0 (in 3.18, need_flush would have been set). This patch resolves the problem by reverting commit f045bbb9, using instead tlb->local.nr as the predicate for page freeing in tlb_flush_mmu_free and ensuring that tlb->end is initialised to a non-zero value in the fullmm case. Tested-by: NMark Langsdorf <mlangsdo@redhat.com> Tested-by: NDave Hansen <dave@sr71.net> Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 1月, 2015 2 次提交
-
-
由 Konstantin Khlebnikov 提交于
Fix for BUG_ON(anon_vma->degree) splashes in unlink_anon_vmas() ("kernel BUG at mm/rmap.c:399!") caused by commit 7a3ef208 ("mm: prevent endless growth of anon_vma hierarchy") Anon_vma_clone() is usually called for a copy of source vma in destination argument. If source vma has anon_vma it should be already in dst->anon_vma. NULL in dst->anon_vma is used as a sign that it's called from anon_vma_fork(). In this case anon_vma_clone() finds anon_vma for reusing. Vma_adjust() calls it differently and this breaks anon_vma reusing logic: anon_vma_clone() links vma to old anon_vma and updates degree counters but vma_adjust() overrides vma->anon_vma right after that. As a result final unlink_anon_vmas() decrements degree for wrong anon_vma. This patch assigns ->anon_vma before calling anon_vma_clone(). Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com> Reported-and-tested-by: NChris Clayton <chris2553@googlemail.com> Reported-and-tested-by: NOded Gabbay <oded.gabbay@amd.com> Reported-and-tested-by: NChih-Wei Huang <cwhuang@android-x86.org> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Daniel Forrest <dan.forrest@ssec.wisc.edu> Cc: Michal Hocko <mhocko@suse.cz> Cc: stable@vger.kernel.org # to match back-porting of 7a3ef208Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
Commit fee7e49d ("mm: propagate error from stack expansion even for guard page") made sure that we return the error properly for stack growth conditions. It also theorized that counting the guard page towards the stack limit might break something, but also said "Let's see if anybody notices". Somebody did notice. Apparently android-x86 sets the stack limit very close to the limit indeed, and including the guard page in the rlimit check causes the android 'zygote' process problems. So this adds the (fairly trivial) code to make the stack rlimit check be against the actual real stack size, rather than the size of the vma that includes the guard page. Reported-and-tested-by: NChih-Wei Huang <cwhuang@android-x86.org> Cc: Jay Foad <jay.foad@gmail.com> Cc: stable@kernel.org # to match back-porting of fee7e49dSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 1月, 2015 6 次提交
-
-
由 Vlastimil Babka 提交于
Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing to sleep. Their analysis found the cause to be a combination of several factors: 1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait 2. The process has been killed (by OOM in this case), but has not yet been scheduled to remove itself from the waitqueue and die. 3. kswapd checks for throttled processes in prepare_kswapd_sleep(): if (waitqueue_active(&pgdat->pfmemalloc_wait)) { wake_up(&pgdat->pfmemalloc_wait); return false; // kswapd will not go to sleep } However, for a process that was already killed, wake_up() does not remove the process from the waitqueue, since try_to_wake_up() checks its state first and returns false when the process is no longer waiting. 4. kswapd is running on the same CPU as the only CPU that the process is allowed to run on (through cpus_allowed, or possibly single-cpu system). 5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd encounters no voluntary preemption points and repeatedly fails prepare_kswapd_sleep(), blocking the process from running and removing itself from the waitqueue, which would let kswapd sleep. So, the source of the problem is that we prevent kswapd from going to sleep until there are processes waiting on the pfmemalloc_wait queue, and a process waiting on a queue is guaranteed to be removed from the queue only when it gets scheduled. This was done to make sure that no process is left sleeping on pfmemalloc_wait when kswapd itself goes to sleep. However, it isn't necessary to postpone kswapd sleep until the pfmemalloc_wait queue actually empties. To prevent processes from being left sleeping, it's actually enough to guarantee that all processes waiting on pfmemalloc_wait queue have been woken up by the time we put kswapd to sleep. This patch therefore fixes this issue by substituting 'wake_up' with 'wake_up_all' and removing 'return false' in the code snippet from prepare_kswapd_sleep() above. Note that if any process puts itself in the queue after this waitqueue_active() check, or after the wake up itself, it means that the process will also wake up kswapd - and since we are under prepare_to_wait(), the wake up won't be missed. Also we update the comment prepare_kswapd_sleep() to hopefully more clearly describe the races it is preventing. Fixes: 5515061d ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage") Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NRik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [3.6+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
We are supposed to take one css reference per each memory page and per each swap entry accounted to a memory cgroup. However, during task charges migration we take a reference to the destination cgroup twice per each swap entry: first in mem_cgroup_do_precharge()->try_charge() and then in mem_cgroup_move_swap_account(), permanently leaking the destination cgroup. The hunk taking the second reference seems to be a leftover from the pre-00501b53 ("mm: memcontrol: rewrite charge API") era. Remove it to fix the leak. Fixes: e8ea14cc (mm: memcontrol: take a css reference for each charged page) Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Commit 3e32cb2e ("mm: memcontrol: lockless page counters") accidentally switched the soft limit default from infinity to zero, which turns all memcgs with even a single page into soft limit excessors and engages soft limit reclaim on all of them during global memory pressure. This makes global reclaim generally more aggressive, but also inverts the meaning of existing soft limit configurations where unset soft limits are usually more generous than set ones. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NVladimir Davydov <vdavydov@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
These are obsolete since commit e30825f1 ("mm/debug-pagealloc: prepare boottime configurable") was merged. So remove them. [pebolle@tiscali.nl: find obsolete Kconfig options] Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Paul Bolle <pebolle@tiscali.nl> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Tejun, while reviewing the code, spotted the following race condition between the dirtying and truncation of a page: __set_page_dirty_nobuffers() __delete_from_page_cache() if (TestSetPageDirty(page)) page->mapping = NULL if (PageDirty()) dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); if (page->mapping) account_page_dirtied(page) __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE. Dirtiers usually lock out truncation, either by holding the page lock directly, or in case of zap_pte_range(), by pinning the mapcount with the page table lock held. The notable exception to this rule, though, is do_wp_page(), for which this race exists. However, do_wp_page() already waits for a locked page to unlock before setting the dirty bit, in order to prevent a race where clear_page_dirty() misses the page bit in the presence of dirty ptes. Upgrade that wait to a fully locked set_page_dirty() to also cover the situation explained above. Afterwards, the code in set_page_dirty() dealing with a truncation race is no longer needed. Remove it. Reported-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Constantly forking task causes unlimited grow of anon_vma chain. Each next child allocates new level of anon_vmas and links vma to all previous levels because pages might be inherited from any level. This patch adds heuristic which decides to reuse existing anon_vma instead of forking new one. It adds counter anon_vma->degree which counts linked vmas and directly descending anon_vmas and reuses anon_vma if counter is lower than two. As a result each anon_vma has either vma or at least two descending anon_vmas. In such trees half of nodes are leafs with alive vmas, thus count of anon_vmas is no more than two times bigger than count of vmas. This heuristic reuses anon_vmas as few as possible because each reuse adds false aliasing among vmas and rmap walker ought to scan more ptes when it searches where page is might be mapped. Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu Fixes: 5beb4930 ("mm: change anon_vma linking to fix multi-process server scalability issue") [akpm@linux-foundation.org: fix typo, per Rik] Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com> Reported-by: NDaniel Forrest <dan.forrest@ssec.wisc.edu> Tested-by: NMichal Hocko <mhocko@suse.cz> Tested-by: NJerome Marchand <jmarchan@redhat.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [2.6.34+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 1月, 2015 2 次提交
-
-
由 Linus Torvalds 提交于
Jay Foad reports that the address sanitizer test (asan) sometimes gets confused by a stack pointer that ends up being outside the stack vma that is reported by /proc/maps. This happens due to an interaction between RLIMIT_STACK and the guard page: when we do the guard page check, we ignore the potential error from the stack expansion, which effectively results in a missing guard page, since the expected stack expansion won't have been done. And since /proc/maps explicitly ignores the guard page (commit d7824370: "mm: fix up some user-visible effects of the stack guard page"), the stack pointer ends up being outside the reported stack area. This is the minimal patch: it just propagates the error. It also effectively makes the guard page part of the stack limit, which in turn measn that the actual real stack is one page less than the stack limit. Let's see if anybody notices. We could teach acct_stack_growth() to allow an extra page for a grow-up/grow-down stack in the rlimit test, but I don't want to add more complexity if it isn't needed. Reported-and-tested-by: NJay Foad <jay.foad@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pranith Kumar 提交于
SRCU is not necessary to be compiled by default in all cases. For tinification efforts not compiling SRCU unless necessary is desirable. The current patch tries to make compiling SRCU optional by introducing a new Kconfig option CONFIG_SRCU which is selected when any of the components making use of SRCU are selected. If we do not select CONFIG_SRCU, srcu.o will not be compiled at all. text data bss dec hex filename 2007 0 0 2007 7d7 kernel/rcu/srcu.o Size of arch/powerpc/boot/zImage changes from text data bss dec hex filename 831552 64180 23944 919676 e087c arch/powerpc/boot/zImage : before 829504 64180 23952 917636 e0084 arch/powerpc/boot/zImage : after so the savings are about ~2000 bytes. Signed-off-by: NPranith Kumar <bobby.prani@gmail.com> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com> CC: Josh Triplett <josh@joshtriplett.org> CC: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
-
- 30 12月, 2014 1 次提交
-
-
由 Michal Hocko 提交于
Commit 2457aec6 ("mm: non-atomically mark page accessed during page cache allocation where possible") has added a separate parameter for specifying gfp mask for radix tree allocations. Not only this is less than optimal from the API point of view because it is error prone, it is also buggy currently because grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then the radix tree allocation wouldn't obey the restriction and might recurse into filesystem and cause deadlocks. This is the case for most filesystems unfortunately because only ext4 and gfs2 are using AOP_FLAG_NOFS. Let's simply remove radix_gfp_mask parameter because the allocation context is same for both page cache and for the radix tree. Just make sure that the radix tree gets only the sane subset of the mask (e.g. do not pass __GFP_WRITE). Long term it is more preferable to convert remaining users of AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this interface even further. Reported-by: NDave Chinner <david@fromorbit.com> Signed-off-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 12月, 2014 1 次提交
-
-
由 Kirill A. Shutemov 提交于
This reverts commit c8475d14. There are several[1][2] of bug reports which points to this commit as potential cause[3]. Let's revert it until we figure out what's going on. [1] https://lkml.org/lkml/2014/11/14/342 [2] https://lkml.org/lkml/2014/12/22/213 [3] https://lkml.org/lkml/2014/12/9/741Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: NSasha Levin <sasha.levin@oracle.com> Acked-by: NDavidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 12月, 2014 4 次提交
-
-
由 Ganesh Mahendran 提交于
Currently functions in zsmalloc.c does not arranged in a readable and reasonable sequence. With the more and more functions added, we may meet below inconvenience. For example: Current functions: void zs_init() { } static void get_maxobj_per_zspage() { } Then I want to add a func_1() which is called from zs_init(), and this new added function func_1() will used get_maxobj_per_zspage() which is defined below zs_init(). void func_1() { get_maxobj_per_zspage() } void zs_init() { func_1() } static void get_maxobj_per_zspage() { } This will cause compiling issue. So we must add a declaration: static void get_maxobj_per_zspage(); before func_1() if we do not put get_maxobj_per_zspage() before func_1(). In addition, puting module_[init|exit] functions at the bottom of the file conforms to our habit. So, this patch ajusts function sequence as: /* helper functions */ ... obj_location_to_handle() ... /* Some exported functions */ ... zs_map_object() zs_unmap_object() zs_malloc() zs_free() zs_init() zs_exit() Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
Belatedly document the changes in commit f0c6d4d2 ("mm: introduce do_shared_fault() and drop do_fault()"). Cc: Andi Kleen <ak@linux.intel.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pintu Kumar 提交于
When the system boots up, in the dmesg logs we can see the memory statistics along with total reserved as below. Memory: 458840k/458840k available, 65448k reserved, 0K highmem When CMA is enabled, still the total reserved memory remains the same. However, the CMA memory is not considered as reserved. But, when we see /proc/meminfo, the CMA memory is part of free memory. This creates confusion. This patch corrects the problem by properly subtracting the CMA reserved memory from the total reserved memory in dmesg logs. Below is the dmesg snapshot from an arm based device with 512MB RAM and 12MB single CMA region. Before this change: Memory: 458840k/458840k available, 65448k reserved, 0K highmem After this change: Memory: 458840k/458840k available, 53160k reserved, 12288k cma-reserved, 0K highmem Signed-off-by: NPintu Kumar <pintu.k@samsung.com> Signed-off-by: NVishnu Pratap Singh <vishnu.ps@samsung.com> Acked-by: NMichal Nazarewicz <mina86@mina86.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zhihui Zhang 提交于
When nodes is true, nsc->mask2 has already been filtered by nsc->mask1, which has already factored in node_states[N_MEMORY]. Signed-off-by: NZhihui Zhang <zzhsuny@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 12月, 2014 2 次提交
-
-
由 Christian Borntraeger 提交于
ACCESS_ONCE does not work reliably on non-scalar types. For example gcc 4.6 and 4.7 might remove the volatile tag for such accesses during the SRA (scalar replacement of aggregates) step (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145) Let's change the code to access the page table elements with READ_ONCE that does implicit scalar accesses for the gup code. mm_find_pmd is tricky, because m68k and sparc(32bit) define pmd_t as array of longs. This code requires just that the pmd_present and pmd_trans_huge check are done on the same value, so a barrier is sufficent. A similar case is in handle_pte_fault. On ppc44x the word size is 32 bit, but a pte is 64 bit. A barrier is ok as well. Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com> Cc: linux-mm@kvack.org Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Linus Torvalds 提交于
Dave Hansen reports that commit fb7332a9 ("mmu_gather: move minimal range calculations into generic code") caused a performance problem: "tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch applied, but does not show up at all on the commit before" and the reason is that Will moved the test for whether we need to flush from tlb_flush_mmu() into tlb_flush_mmu_tlbonly(). But that meant that tlb_flush_mmu_free() basically lost that check. Move it back into tlb_flush_mmu() where it belongs, so that it covers both tlb_flush_mmu_tlbonly() _and_ tlb_flush_mmu_free(). Reported-and-tested-by: NDave Hansen <dave@sr71.net> Acked-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 12月, 2014 2 次提交
-
-
由 Al Viro 提交于
the only instance this method has ever grown was one in kernfs - one that call ->migrate() of another vm_ops if it exists. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 14 12月, 2014 3 次提交
-
-
由 Pavel Emelyanov 提交于
There are actually two issues this patch addresses. Let me start with the one I tried to solve in the beginning. So, in the checkpoint-restore project (criu) we try to dump tasks' state and restore one back exactly as it was. One of the tasks' state bits is rings set up with io_setup() call. There's (almost) no problems in dumping them, there's a problem restoring them -- if I dump a task with aio ring originally mapped at address A, I want to restore one back at exactly the same address A. Unfortunately, the io_setup() does not allow for that -- it mmaps the ring at whatever place mm finds appropriate (it calls do_mmap_pgoff() with zero address and without the MAP_FIXED flag). To make restore possible I'm going to mremap() the freshly created ring into the address A (under which it was seen before dump). The problem is that the ring's virtual address is passed back to the user-space as the context ID and this ID is then used as search key by all the other io_foo() calls. Reworking this ID to be just some integer doesn't seem to work, as this value is already used by libaio as a pointer using which this library accesses memory for aio meta-data. So, to make restore work we need to make sure that a) ring is mapped at desired virtual address b) kioctx->user_id matches this value Having said that, the patch makes mremap() on aio region update the kioctx's user_id and mmap_base values. Here appears the 2nd issue I mentioned in the beginning of this mail. If (regardless of the C/R dances I do) someone creates an io context with io_setup(), then mremap()-s the ring and then destroys the context, the kill_ioctx() routine will call munmap() on wrong (old) address. This will result in a) aio ring remaining in memory and b) some other vma get unexpectedly unmapped. What do you think? Signed-off-by: NPavel Emelyanov <xemul@parallels.com> Acked-by: NDmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
-
由 Thierry Reding 提交于
kmemleak will add allocations as objects to a pool. The memory allocated for each object in this pool is periodically searched for pointers to other allocated objects. This only works for memory that is mapped into the kernel's virtual address space, which happens not to be the case for most CMA regions. Furthermore, CMA regions are typically used to store data transferred to or from a device and therefore don't contain pointers to other objects. Without this, the kernel crashes on the first execution of the scan_gray_list() because it tries to access highmem. Perhaps a more appropriate fix would be to reject any object that can't map to a kernel virtual address? [akpm@linux-foundation.org: add comment] [akpm@linux-foundation.org: fix comment, per Catalin] [sfr@canb.auug.org.au: include linux/io.h for phys_to_virt()] Signed-off-by: NThierry Reding <treding@nvidia.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
If we fail to allocate from the current node's stock, we look for free objects on other nodes before calling the page allocator (see get_any_partial). While checking other nodes we respect cpuset constraints by calling cpuset_zone_allowed. We enforce hardwall check. As a result, we will fallback to the page allocator even if there are some pages cached on other nodes, but the current cpuset doesn't have them set. However, the page allocator uses softwall check for kernel allocations, so it may allocate from one of the other nodes in this case. Therefore we should use softwall cpuset check in get_any_partial to conform with the cpuset check in the page allocator. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NZefan Li <lizefan@huawei.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-