- 08 4月, 2014 15 次提交
-
-
由 Kirill A. Shutemov 提交于
There's no reason to enable split page table lock if don't have page tables. It also triggers build error at least on ARM since we don't define pmd_page() for !MMU. In file included from arch/arm/kernel/asm-offsets.c:14:0: include/linux/mm.h: In function 'pte_lockptr': include/linux/mm.h:1392:2: error: implicit declaration of function 'pmd_page' [-Werror=implicit-function-declaration] include/linux/mm.h:1392:2: warning: passing argument 1 of 'ptlock_ptr' makes pointer from integer without a cast [enabled by default] include/linux/mm.h:1384:27: note: expected 'struct page *' but argument is of type 'int' Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alex Thorlton 提交于
The main motivation behind this patch is to provide a way to disable THP for jobs where the code cannot be modified, and using a malloc hook with madvise is not an option (i.e. statically allocated data). This patch allows us to do just that, without affecting other jobs running on the system. We need to do this sort of thing for jobs where THP hurts performance, due to the possibility of increased remote memory accesses that can be created by situations such as the following: When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will be handed out, and the THP will be stuck on whatever node the chunk was originally referenced from. If many remote nodes need to do work on that same chunk, they'll be making remote accesses. With THP disabled, 4K pages can be handed out to separate nodes as they're needed, greatly reducing the amount of remote accesses to memory. This patch is based on some of my work combined with some suggestions/patches given by Oleg Nesterov. The main goal here is to add a prctl switch to allow us to disable to THP on a per mm_struct basis. Here's a bit of test data with the new patch in place... First with the flag unset: # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g Setting thp_disabled for this task... thp_disable: 0 Set thp_disabled state to 0 Process pid = 18027 PF/ MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/ TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES 512 1.120 0.060 0.000 0.110 0.110 0.000 28571428864 -9223372036854775808 55803572 23 Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g': 273719072.841402 task-clock # 641.026 CPUs utilized [100.00%] 1,008,986 context-switches # 0.000 M/sec [100.00%] 7,717 CPU-migrations # 0.000 M/sec [100.00%] 1,698,932 page-faults # 0.000 M/sec 355,222,544,890,379 cycles # 1.298 GHz [100.00%] 536,445,412,234,588 stalled-cycles-frontend # 151.02% frontend cycles idle [100.00%] 409,110,531,310,223 stalled-cycles-backend # 115.17% backend cycles idle [100.00%] 148,286,797,266,411 instructions # 0.42 insns per cycle # 3.62 stalled cycles per insn [100.00%] 27,061,793,159,503 branches # 98.867 M/sec [100.00%] 1,188,655,196 branch-misses # 0.00% of all branches 427.001706337 seconds time elapsed Now with the flag set: # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g Setting thp_disabled for this task... thp_disable: 1 Set thp_disabled state to 1 Process pid = 144957 PF/ MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/ TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES 512 0.620 0.260 0.250 0.320 0.570 0.001 51612901376 128000000000 100806448 23 Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g': 138789390.540183 task-clock # 641.959 CPUs utilized [100.00%] 534,205 context-switches # 0.000 M/sec [100.00%] 4,595 CPU-migrations # 0.000 M/sec [100.00%] 63,133,119 page-faults # 0.000 M/sec 147,977,747,269,768 cycles # 1.066 GHz [100.00%] 200,524,196,493,108 stalled-cycles-frontend # 135.51% frontend cycles idle [100.00%] 105,175,163,716,388 stalled-cycles-backend # 71.07% backend cycles idle [100.00%] 180,916,213,503,160 instructions # 1.22 insns per cycle # 1.11 stalled cycles per insn [100.00%] 26,999,511,005,868 branches # 194.536 M/sec [100.00%] 714,066,351 branch-misses # 0.00% of all branches 216.196778807 seconds time elapsed As with previous versions of the patch, We're getting about a 2x performance increase here. Here's a link to the test case I used, along with the little wrapper to activate the flag: http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz This patch (of 3): Revert commit 8e72033f and add in code to fix up any issues caused by the revert. The revert is necessary because hugepage_madvise would return -EINVAL when VM_NOHUGEPAGE is set, which will break subsequent chunks of this patch set. Here's a snip of an e-mail from Gerald detailing the original purpose of this code, and providing justification for the revert: "The intent of commit 8e72033f was to guard against any future programming errors that may result in an madvice(MADV_HUGEPAGE) on guest mappings, which would crash the kernel. Martin suggested adding the bit to arch/s390/mm/pgtable.c, if 8e72033f was to be reverted, because that check will also prevent a kernel crash in the case described above, it will now send a SIGSEGV instead. This would now also allow to do the madvise on other parts, if needed, so it is a more flexible approach. One could also say that it would have been better to do it this way right from the beginning..." Signed-off-by: NAlex Thorlton <athorlton@sgi.com> Suggested-by: NOleg Nesterov <oleg@redhat.com> Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
It is just for clean-up to reduce code size and improve readability. There is no functional change. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
isolation_suitable() and migrate_async_suitable() is used to be sure that this pageblock range is fine to be migragted. It isn't needed to call it on every page. Current code do well if not suitable, but, don't do well when suitable. 1) It re-checks isolation_suitable() on each page of a pageblock that was already estabilished as suitable. 2) It re-checks migrate_async_suitable() on each page of a pageblock that was not entered through the next_pageblock: label, because last_pageblock_nr is not otherwise updated. This patch fixes situation by 1) calling isolation_suitable() only once per pageblock and 2) always updating last_pageblock_nr to the pageblock that was just checked. Additionally, move PageBuddy() check after pageblock unit check, since pageblock check is the first thing we should do and makes things more simple. [vbabka@suse.cz: rephrase commit description] Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th pfn page. This may results in below situation while isolating migratepage. 1. try isolate 0x0 ~ 0x200 pfn pages. 2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop the spinlock. 3. Then, to complete isolating, retry to aquire the lock. I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the criteria about dropping the lock. This has no harm 0x0 pfn, because, at this time, locked variable would be false. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
suitable_migration_target() checks that pageblock is suitable for migration target. In isolate_freepages_block(), it is called on every page and this is inefficient. So make it called once per pageblock. suitable_migration_target() also checks if page is highorder or not, but it's criteria for highorder is pageblock order. So calling it once within pageblock range has no problem. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
Purpose of compaction is to get a high order page. Currently, if we find high-order page while searching migration target page, we break it to order-0 pages and use them as migration target. It is contrary to purpose of compaction, so disallow high-order page to be used for migration target. Additionally, clean-up logic in suitable_migration_target() to simplify the code. There is no functional changes from this clean-up. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
We had a report about strange OOM killer strikes on a PPC machine although there was a lot of swap free and a tons of anonymous memory which could be swapped out. In the end it turned out that the OOM was a side effect of zone reclaim which wasn't unmapping and swapping out and so the system was pushed to the OOM. Although this sounds like a bug somewhere in the kswapd vs. zone reclaim vs. direct reclaim interaction numactl on the said hardware suggests that the zone reclaim should not have been set in the first place: node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 0 MB node 0 free: 0 MB node 2 cpus: node 2 size: 7168 MB node 2 free: 6019 MB node distances: node 0 2 0: 10 40 2: 40 10 So all the CPUs are associated with Node0 which doesn't have any memory while Node2 contains all the available memory. Node distances cause an automatic zone_reclaim_mode enabling. Zone reclaim is intended to keep the allocations local but this doesn't make any sense on the memoryless nodes. So let's exclude such nodes for init_zone_allows_reclaim which evaluates zone reclaim behavior and suitable reclaim_nodes. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Tested-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Davidlohr Bueso 提交于
The described issue now occurs inside mmap_region(). And unfortunately is still valid. Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Weijie Yang 提交于
We abort direct reclaim if we find the zone is ready for compaction. Sometimes the zone is just a promoted highmem zone to force a scan of highmem, which is not the intended zone the caller want to allocate a page from. In this situation, setting aborted_reclaim to indicate the caller turned back to retry the allocation is waste of time and could cause a loop in __alloc_pages_slowpath(). This patch does not check compaction_ready() on promoted zones to avoid the above situation. Only set aborted_reclaim if the caller intended zone is ready for compaction. Signed-off-by: NWeijie Yang <weijie.yang@samsung.com> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Weijie Yang 提交于
We promote sc->gfp_mask to __GFP_HIGHMEM to forcibly scan highmem if there are too many buffer_heads pinning highmem. See cc715d99 ("mm: vmscan: forcibly scan highmem if there are too many buffer_heads pinning highmem"). This patch restores sc->gfp_mask to its caller original value after finishing the scan job, to avoid the impact on other invocations from its upper caller, such as vmpressure_prio(), shrink_slab(). Signed-off-by: NWeijie Yang <weijie.yang@samsung.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rik van Riel 提交于
The NUMA scanning code can end up iterating over many gigabytes of unpopulated memory, especially in the case of a freshly started KVM guest with lots of memory. This results in the mmu notifier code being called even when there are no mapped pages in a virtual address range. The amount of time wasted can be enough to trigger soft lockup warnings with very large KVM guests. This patch moves the mmu notifier call to the pmd level, which represents 1GB areas of memory on x86-64. Furthermore, the mmu notifier code is only called from the address in the PMD where present mappings are first encountered. The hugetlbfs code is left alone for now; hugetlb mappings are not relocatable, and as such are left alone by the NUMA code, and should never trigger this problem to begin with. Signed-off-by: NRik van Riel <riel@redhat.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Reported-by: NXing Gang <gang.xing@hp.com> Tested-by: NChegu Vinod <chegu_vinod@hp.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Sasha reported the following bug using trinity kernel BUG at mm/mprotect.c:149! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G W 3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105 task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000 RIP: change_protection_range+0x3b3/0x500 Call Trace: change_protection+0x25/0x30 change_prot_numa+0x1b/0x30 task_numa_work+0x279/0x360 task_work_run+0xae/0xf0 do_notify_resume+0x8e/0xe0 retint_signal+0x4d/0x92 The VM_BUG_ON was added in -mm by the patch "mm,numa: reorganize change_pmd_range". The race existed without the patch but was just harder to hit. The problem is that a transhuge check is made without holding the PTL. It's possible at the time of the check that a parallel fault clears the pmd and inserts a new one which then triggers the VM_BUG_ON check. This patch removes the VM_BUG_ON but fixes the race by rechecking transhuge under the PTL when marking page tables for NUMA hinting and bailing if a race occurred. It is not a problem for calls to mprotect() as they hold mmap_sem for write. Signed-off-by: NMel Gorman <mgorman@suse.de> Reported-by: NSasha Levin <sasha.levin@oracle.com> Reviewed-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rik van Riel 提交于
Reorganize the order of ifs in change_pmd_range a little, in preparation for the next patch. [akpm@linux-foundation.org: fix indenting, per David] Signed-off-by: NRik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Reported-by: NXing Gang <gang.xing@hp.com> Tested-by: NChegu Vinod <chegu_vinod@hp.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
huge_pte_offset() could return NULL, so we need NULL check to avoid potential NULL pointer dereferences. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 4月, 2014 1 次提交
-
-
由 Hugh Dickins 提交于
get_user_pages(write=1, force=1) has always had odd behaviour on write- protected shared mappings: although it demands FMODE_WRITE-access to the underlying object (do_mmap_pgoff sets neither VM_SHARED nor VM_MAYWRITE without that), it ends up with do_wp_page substituting private anonymous Copied-On-Write pages for the shared file pages in the area. That was long ago intentional, as a safety measure to prevent ptrace setting a breakpoint (or POKETEXT or POKEDATA) from inadvertently corrupting the underlying executable. Yet exec and dynamic loaders open the file read-only, and use MAP_PRIVATE rather than MAP_SHARED. The traditional odd behaviour still causes surprises and bugs in mm, and is probably not what any caller wants - even the comment on the flag says "You do not want this" (although it's undoubtedly necessary for overriding userspace protections in some contexts, and good when !write). Let's stop doing that. But it would be dangerous to remove the long- standing safety at this stage, so just make get_user_pages(write,force) fail with EFAULT when applied to a write-protected shared area. Infiniband may in future want to force write through to underlying object: we can add another FOLL_flag later to enable that if required. Odd though the old behaviour was, there is no doubt that we may turn out to break userspace with this change, and have to revert it quickly. Issue a WARN_ON_ONCE to help debug the changed case (easily triggered by userspace, so only once to prevent spamming the logs); and delay a few associated cleanups until this change is proved. get_user_pages callers who might see trouble from this change: ptrace poking, or writing to /proc/<pid>/mem drivers/infiniband/ drivers/media/v4l2-core/ drivers/gpu/drm/exynos/exynos_drm_gem.c drivers/staging/tidspbridge/core/tiomap3430.c if they ever apply get_user_pages to write-protected shared mappings of an object which was opened for writing. I went to apply the same change to mm/nommu.c, but retreated. NOMMU has no place for COW, and its VM_flags conventions are not the same: I'd be more likely to screw up NOMMU than make an improvement there. Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 4月, 2014 24 次提交
-
-
由 Raghavendra K T 提交于
Currently max_sane_readahead() returns zero on the cpu whose NUMA node has no local memory which leads to readahead failure. Fix this readahead failure by returning minimum of (requested pages, 512). Users running applications on a memory-less cpu which needs readahead such as streaming application see considerable boost in the performance. Result: fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement. fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on the normal NUMA cases w/ patch. Kernel Avg Stddev base 7.4975 3.92% patched 7.4174 3.26% [Andrew: making return value PAGE_SIZE independent] Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: NJan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
We release the slab_mutex while calling sysfs_slab_add from __kmem_cache_create since commit 66c4c35c ("slub: Do not hold slub_lock when calling sysfs_slab_add()"), because kobject_uevent called by sysfs_slab_add might block waiting for the usermode helper to exec, which would result in a deadlock if we took the slab_mutex while executing it. However, apart from complicating synchronization rules, releasing the slab_mutex on kmem cache creation can result in a kmemcg-related race. The point is that we check if the memcg cache exists before going to __kmem_cache_create, but register the new cache in memcg subsys after it. Since we can drop the mutex there, several threads can see that the memcg cache does not exist and proceed to creating it, which is wrong. Fortunately, recently kobject_uevent was patched to call the usermode helper with the UMH_NO_WAIT flag, making the deadlock impossible. Therefore there is no point in releasing the slab_mutex while calling sysfs_slab_add, so let's simplify kmem_cache_create synchronization and fix the kmemcg-race mentioned above by holding the slab_mutex during the whole cache creation path. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Greg KH <greg@kroah.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Hansen 提交于
There is plenty of anecdotal evidence and a load of blog posts suggesting that using "drop_caches" periodically keeps your system running in "tip top shape". Perhaps adding some kernel documentation will increase the amount of accurate data on its use. If we are not shrinking caches effectively, then we have real bugs. Using drop_caches will simply mask the bugs and make them harder to find, but certainly does not fix them, nor is it an appropriate "workaround" to limit the size of the caches. On the contrary, there have been bug reports on issues that turned out to be misguided use of cache dropping. Dropping caches is a very drastic and disruptive operation that is good for debugging and running tests, but if it creates bug reports from production use, kernel developers should be aware of its use. Add a bit more documentation about it, a syslog message to track down abusers, and vmstat drop counters to help analyze problem reports. [akpm@linux-foundation.org: checkpatch fixes] [hannes@cmpxchg.org: add runtime suppression control] Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sasha Levin 提交于
This patch removes read_cache_page_async() which wasn't really needed anywhere and simplifies the code around it a bit. read_cache_page_async() is useful when we want to read a page into the cache without waiting for it to complete. This happens when the appropriate callback 'filler' doesn't complete its read operation and releases the page lock immediately, and instead queues a different completion routine to do that. This never actually happened anywhere in the code. read_cache_page_async() had 3 different callers: - read_cache_page() which is the sync version, it would just wait for the requested read to complete using wait_on_page_read(). - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler function it supplied doesn't do any async reads, and would complete before the filler function returns - making it actually a sync read. - CRAMFS would call it using the read_mapping_page_async() wrapper, with a similar story to JFFS2 - the filler function doesn't do anything that reminds async reads and would always complete before the filler function returns. To sum it up, the code in mm/filemap.c never took advantage of having read_cache_page_async(). While there are filler callbacks that do async reads (such as the block one), we always called it with the read_cache_page(). This patch adds a mandatory wait for read to complete when adding a new page to the cache, and removes read_cache_page_async() and its wrappers. Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
I've realized that there's no need for do_huge_pmd_wp_zero_page_fallback(). We can just split zero page with split_huge_page_pmd() and return VM_FAULT_FALLBACK. handle_pte_fault() will handle write-protection fault for us. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Extract and consolidate code to setup pte from do_read_fault(), do_cow_fault() and do_shared_fault(). Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
There are two functions which need to call vm_ops->page_mkwrite(): do_shared_fault() and do_wp_page(). We can consolidate preparation code. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Introduce do_shared_fault(). The function does what do_fault() does for write faults to shared mappings Unlike do_fault(), do_shared_fault() is relatively clean and straight-forward. Old do_fault() is not needed anymore. Let it die. [lliubbo@gmail.com: fix NULL pointer dereference] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NBob Liu <bob.liu@oracle.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Introduce do_cow_fault(). The function does what do_fault() does for write page faults to private mappings. Unlike do_fault(), do_read_fault() is relatively clean and straight-forward. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Introduce do_read_fault(). The function does what do_fault() does for read page faults. Unlike do_fault(), do_read_fault() is pretty clean and straightforward. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Extract code to vm_ops->do_fault() and basic error handling to separate function. The code will be reused. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Current __do_fault() is awful and unmaintainable. These patches try to sort it out by split __do_fault() into three destinct codepaths: - to handle read page fault; - to handle write page fault to private mappings; - to handle write page fault to shared mappings; I also found page refcount leak in PageHWPoison() path of __do_fault(). This patch (of 7): do_fault() is unused: no reason for underscores. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rashika Kheria 提交于
Mark function as static in nobootmem.c because it is not used outside this file. This eliminates the following warning in mm/nobootmem.c: mm/nobootmem.c:324:15: warning: no previous prototype for `___alloc_bootmem_node' [-Wmissing-prototypes] Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com> Reviewed-by: NJosh Triplett <josh@joshtriplett.org> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rashika Kheria 提交于
Mark functions as static in page_cgroup.c because they are not used outside this file. This eliminates the following warning in mm/page_cgroup.c: mm/page_cgroup.c:177:6: warning: no previous prototype for `__free_page_cgroup' [-Wmissing-prototypes] mm/page_cgroup.c:190:15: warning: no previous prototype for `online_page_cgroup' [-Wmissing-prototypes] mm/page_cgroup.c:225:15: warning: no previous prototype for `offline_page_cgroup' [-Wmissing-prototypes] Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com> Reviewed-by: NJosh Triplett <josh@joshtriplett.org> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rashika Kheria 提交于
Mark function as static in process_vm_access.c because it is not used outside this file. This eliminates the following warning in mm/process_vm_access.c: mm/process_vm_access.c:416:1: warning: no previous prototype for `compat_process_vm_rw' [-Wmissing-prototypes] [akpm@linux-foundation.org: remove unneeded asmlinkage - compat_process_vm_rw isn't referenced from asm] Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com> Reviewed-by: NJosh Triplett <josh@joshtriplett.org> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rashika Kheria 提交于
Mark function as static in mmap.c because they are not used outside this file. This eliminates the following warning in mm/mmap.c: mm/mmap.c:407:6: warning: no previous prototype for `validate_mm' [-Wmissing-prototypes] Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com> Reviewed-by: NJosh Triplett <josh@joshtriplett.org> Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rashika Kheria 提交于
mark functions as static in memory.c because they are not used outside this file. This eliminates the following warnings in mm/memory.c: mm/memory.c:3530:5: warning: no previous prototype for `numa_migrate_prep' [-Wmissing-prototypes] mm/memory.c:3545:5: warning: no previous prototype for `do_numa_page' [-Wmissing-prototypes] Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com> Reviewed-by: NJosh Triplett <josh@joshtriplett.org> Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rashika Kheria 提交于
Mark function as static in compaction.c because it is not used outside this file. This eliminates the following warning from mm/compaction.c: mm/compaction.c:1190:9: warning: no previous prototype for `sysfs_compact_node' [-Wmissing-prototypes Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com> Reviewed-by: NJosh Triplett <josh@joshtriplett.org> Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
Page migration will fail for memory that is pinned in memory with, for example, get_user_pages(). In this case, it is unnecessary to take zone->lru_lock or isolating the page and passing it to page migration which will ultimately fail. This is a racy check, the page can still change from under us, but in that case we'll just fail later when attempting to move the page. This avoids very expensive memory compaction when faulting transparent hugepages after pinning a lot of memory with a Mellanox driver. On a 128GB machine and pinning ~120GB of memory, before this patch we see the enormous disparity in the number of page migration failures because of the pinning (from /proc/vmstat): compact_pages_moved 8450 compact_pagemigrate_failed 15614415 0.05% of pages isolated are successfully migrated and explicitly triggering memory compaction takes 102 seconds. After the patch: compact_pages_moved 9197 compact_pagemigrate_failed 7 99.9% of pages isolated are now successfully migrated in this configuration and memory compaction takes less than one second. Signed-off-by: NDavid Rientjes <rientjes@google.com> Acked-by: NHugh Dickins <hughd@google.com> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Rik van Riel <riel@redhat.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
Both prep_compound_huge_page() and prep_compound_gigantic_page() are only called at bootstrap and can be marked as __init. The __SetPageTail(page) in prep_compound_gigantic_page() happening before page->first_page is initialized is not concerning since this is bootstrap. Signed-off-by: NDavid Rientjes <rientjes@google.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NBob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Reclaim will be leaving shadow entries in the page cache radix tree upon evicting the real page. As those pages are found from the LRU, an iput() can lead to the inode being freed concurrently. At this point, reclaim must no longer install shadow pages because the inode freeing code needs to ensure the page tree is really empty. Add an address_space flag, AS_EXITING, that the inode freeing code sets under the tree lock before doing the final truncate. Reclaim will check for this flag before installing shadow pages. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
shmem mappings already contain exceptional entries where swap slot information is remembered. To be able to store eviction information for regular page cache, prepare every site dealing with the radix trees directly to handle entries other than pages. The common lookup functions will filter out non-page entries and return NULL for page cache holes, just as before. But provide a raw version of the API which returns non-page entries as well, and switch shmem over to use it. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-