- 28 8月, 2010 1 次提交
-
-
由 Yinghai Lu 提交于
According to node range in early_node_map[] with __memblock_find_in_range to find free range. Will be used by memblock_x86_find_in_range_node() memblock_x86_find_in_range_node will be used to find right buffer for NODE_DATA Signed-off-by: NYinghai Lu <yinghai@kernel.org> Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
-
- 19 7月, 2010 1 次提交
-
-
由 Dave Chinner 提交于
The current shrinker implementation requires the registered callback to have global state to work from. This makes it difficult to shrink caches that are not global (e.g. per-filesystem caches). Pass the shrinker structure to the callback so that users can embed the shrinker structure in the context the shrinker needs to operate on and get back to it in the callback via container_of(). Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
- 25 5月, 2010 3 次提交
-
-
由 Chris Metcalf 提交于
This ensures that platforms with lowmem PAs above 32 bits work correctly by avoiding truncating the PA during a left shift. Signed-off-by: NChris Metcalf <cmetcalf@tilera.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
This patch is the core of a mechanism which compacts memory in a zone by relocating movable pages towards the end of the zone. A single compaction run involves a migration scanner and a free scanner. Both scanners operate on pageblock-sized areas in the zone. The migration scanner starts at the bottom of the zone and searches for all movable pages within each area, isolating them onto a private list called migratelist. The free scanner starts at the top of the zone and searches for suitable areas and consumes the free pages within making them available for the migration scanner. The pages isolated for migration are then migrated to the newly isolated free pages. [aarcange@redhat.com: Fix unsafe optimisation] [mel@csn.ul.ie: do not schedule work on other CPUs for compaction] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
mm: migration: avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks Page migration requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it is possible to leave one behind if the second rmap_walk fails to find the page. If this page is later faulted, migration_entry_to_page() will call BUG because the page is locked indicating the page was migrated by the migration PTE not cleaned up. For example kernel BUG at include/linux/swapops.h:105! invalid opcode: 0000 [#1] PREEMPT SMP ... Call Trace: [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e [<ffffffff813099b5>] page_fault+0x25/0x30 [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b [<ffffffff8111329b>] search_binary_handler+0x173/0x313 [<ffffffff81114896>] do_execve+0x219/0x30a [<ffffffff8100a5c6>] sys_execve+0x43/0x5e [<ffffffff8100320a>] stub_execve+0x6a/0xc0 RIP [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129 There is a race between shift_arg_pages and migration that triggers this bug. A temporary stack is setup during exec and later moved. If migration moves a page in the temporary stack and the VMA is then removed before migration completes, the migration PTE may not be found leading to a BUG when the stack is faulted. This patch causes pages within the temporary stack during exec to be skipped by migration. It does this by marking the VMA covering the temporary stack with an otherwise impossible combination of VMA flags. These flags are cleared when the temporary stack is moved to its final location. [kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 4月, 2010 1 次提交
-
-
由 Naoya Horiguchi 提交于
When we look into pagemap using page-types with option -p, the value of pfn for hugepages looks wrong (see below.) This is because pte was evaluated only once for one vma although it should be updated for each hugepage. This patch fixes it. $ page-types -p 3277 -Nl -b huge voffset offset len flags 7f21e8a00 11e400 1 ___U___________H_G________________ 7f21e8a01 11e401 1ff ________________TG________________ ^^^ 7f21e8c00 11e400 1 ___U___________H_G________________ 7f21e8c01 11e401 1ff ________________TG________________ ^^^ One hugepage contains 1 head page and 511 tail pages in x86_64 and each two lines represent each hugepage. Voffset and offset mean virtual address and physical address in the page unit, respectively. The different hugepages should not have the same offset value. With this patch applied: $ page-types -p 3386 -Nl -b huge voffset offset len flags 7fec7a600 112c00 1 ___UD__________H_G________________ 7fec7a601 112c01 1ff ________________TG________________ ^^^ 7fec7a800 113200 1 ___UD__________H_G________________ 7fec7a801 113201 1ff ________________TG________________ ^^^ OK More info: - This patch modifies walk_page_range()'s hugepage walker. But the change only affects pagemap_read(), which is the only caller of hugepage callback. - Without this patch, hugetlb_entry() callback is called per vma, that doesn't match the natural expectation from its name. - With this patch, hugetlb_entry() is called per hugepte entry and the callback can become much simpler. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMatt Mackall <mpm@selenic.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 3月, 2010 1 次提交
-
-
由 Peter Zijlstra 提交于
Support for the PMU's BTS features has been upstreamed in v2.6.32, but we still have the old and disabled ptrace-BTS, as Linus noticed it not so long ago. It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without regard for other uses (perf) and doesn't provide the flexibility needed for perf either. Its users are ptrace-block-step and ptrace-bts, since ptrace-bts was never used and ptrace-block-step can be implemented using a much simpler approach. So axe all 3000 lines of it. That includes the *locked_memory*() APIs in mm/mlock.c as well. Reported-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Markus Metzger <markus.t.metzger@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <20100325135413.938004390@chello.nl> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 13 3月, 2010 2 次提交
-
-
由 Wu Fengguang 提交于
- introduce dump_page() to print the page info for debugging some error condition. - convert three mm users: bad_page(), print_bad_pte() and memory offline failure. - print an extra field: the symbolic names of page->flags Example dump_page() output: [ 157.521694] page:ffffea0000a7cba8 count:2 mapcount:1 mapping:ffff88001c901791 index:0x147 [ 157.525570] page flags: 0x100000000100068(uptodate|lru|active|swapbacked) Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alex Chiang <achiang@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Mel Gorman <mel@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Commit 34e55232 ("mm: avoid false sharing of mm_counter") added sync_mm_rss() for syncing loosely accounted rss counters. It's for CONFIG_MMU but sync_mm_rss is called even in NOMMU enviroment (kerne/exit.c, fs/exec.c). Above commit doesn't handle it well. This patch changes SPLIT_RSS_COUNTING depends on SPLIT_PTLOCKS && CONFIG_MMU And for avoid unnecessary function calls, sync_mm_rss changed to be inlined noop function in header file. Reported-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NMike Frysinger <vapier@gentoo.org> Signed-off-by: NMichal Simek <monstr@monstr.eu> Signed-off-by: NDavid Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 3月, 2010 4 次提交
-
-
由 Rik van Riel 提交于
When a VMA is in an inconsistent state during setup or teardown, the worst that can happen is that the rmap code will not be able to find the page. The mapping is in the process of being torn down (PTEs just got invalidated by munmap), or set up (no PTEs have been instantiated yet). It is also impossible for the rmap code to follow a pointer to an already freed VMA, because the rmap code holds the anon_vma->lock, which the VMA teardown code needs to take before the VMA is removed from the anon_vma chain. Hence, we should not need the VM_LOCK_RMAP locking at all. Signed-off-by: NRik van Riel <riel@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rik van Riel 提交于
The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [akpm@linux-foundation.org: cleanups] Signed-off-by: NRik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Considering the nature of per mm stats, it's the shared object among threads and can be a cache-miss point in the page fault path. This patch adds per-thread cache for mm_counter. RSS value will be counted into a struct in task_struct and synchronized with mm's one at events. Now, in this patch, the event is the number of calls to handle_mm_fault. Per-thread value is added to mm at each 64 calls. rough estimation with small benchmark on parallel thread (2threads) shows [before] 4.5 cache-miss/faults [after] 4.0 cache-miss/faults Anyway, the most contended object is mmap_sem if the number of threads grows. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Presently, per-mm statistics counter is defined by macro in sched.h This patch modifies it to - defined in mm.h as inlinf functions - use array instead of macro's name creation. This patch is for reducing patch size in future patch to modify implementation of per-mm counter. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 2月, 2010 2 次提交
-
-
由 Yinghai Lu 提交于
Add vmemmap_alloc_block_buf for mem map only. It will fallback to the old way if it cannot get a block that big. Before this patch, when a node have 128g ram installed, memmap are split into two parts or more. [ 0.000000] [ffffea0000000000-ffffea003fffffff] PMD -> [ffff880100600000-ffff88013e9fffff] on node 1 [ 0.000000] [ffffea0040000000-ffffea006fffffff] PMD -> [ffff88013ec00000-ffff88016ebfffff] on node 1 [ 0.000000] [ffffea0070000000-ffffea007fffffff] PMD -> [ffff882000600000-ffff8820105fffff] on node 0 [ 0.000000] [ffffea0080000000-ffffea00bfffffff] PMD -> [ffff882010800000-ffff8820507fffff] on node 0 [ 0.000000] [ffffea00c0000000-ffffea00dfffffff] PMD -> [ffff882050a00000-ffff8820709fffff] on node 0 [ 0.000000] [ffffea00e0000000-ffffea00ffffffff] PMD -> [ffff884000600000-ffff8840205fffff] on node 2 [ 0.000000] [ffffea0100000000-ffffea013fffffff] PMD -> [ffff884020800000-ffff8840607fffff] on node 2 [ 0.000000] [ffffea0140000000-ffffea014fffffff] PMD -> [ffff884060a00000-ffff8840709fffff] on node 2 [ 0.000000] [ffffea0150000000-ffffea017fffffff] PMD -> [ffff886000600000-ffff8860305fffff] on node 3 [ 0.000000] [ffffea0180000000-ffffea01bfffffff] PMD -> [ffff886030800000-ffff8860707fffff] on node 3 [ 0.000000] [ffffea01c0000000-ffffea01ffffffff] PMD -> [ffff888000600000-ffff8880405fffff] on node 4 [ 0.000000] [ffffea0200000000-ffffea022fffffff] PMD -> [ffff888040800000-ffff8880707fffff] on node 4 [ 0.000000] [ffffea0230000000-ffffea023fffffff] PMD -> [ffff88a000600000-ffff88a0105fffff] on node 5 [ 0.000000] [ffffea0240000000-ffffea027fffffff] PMD -> [ffff88a010800000-ffff88a0507fffff] on node 5 [ 0.000000] [ffffea0280000000-ffffea029fffffff] PMD -> [ffff88a050a00000-ffff88a0709fffff] on node 5 [ 0.000000] [ffffea02a0000000-ffffea02bfffffff] PMD -> [ffff88c000600000-ffff88c0205fffff] on node 6 [ 0.000000] [ffffea02c0000000-ffffea02ffffffff] PMD -> [ffff88c020800000-ffff88c0607fffff] on node 6 [ 0.000000] [ffffea0300000000-ffffea030fffffff] PMD -> [ffff88c060a00000-ffff88c0709fffff] on node 6 [ 0.000000] [ffffea0310000000-ffffea033fffffff] PMD -> [ffff88e000600000-ffff88e0305fffff] on node 7 [ 0.000000] [ffffea0340000000-ffffea037fffffff] PMD -> [ffff88e030800000-ffff88e0707fffff] on node 7 after patch will get [ 0.000000] [ffffea0000000000-ffffea006fffffff] PMD -> [ffff880100200000-ffff88016e5fffff] on node 0 [ 0.000000] [ffffea0070000000-ffffea00dfffffff] PMD -> [ffff882000200000-ffff8820701fffff] on node 1 [ 0.000000] [ffffea00e0000000-ffffea014fffffff] PMD -> [ffff884000200000-ffff8840701fffff] on node 2 [ 0.000000] [ffffea0150000000-ffffea01bfffffff] PMD -> [ffff886000200000-ffff8860701fffff] on node 3 [ 0.000000] [ffffea01c0000000-ffffea022fffffff] PMD -> [ffff888000200000-ffff8880701fffff] on node 4 [ 0.000000] [ffffea0230000000-ffffea029fffffff] PMD -> [ffff88a000200000-ffff88a0701fffff] on node 5 [ 0.000000] [ffffea02a0000000-ffffea030fffffff] PMD -> [ffff88c000200000-ffff88c0701fffff] on node 6 [ 0.000000] [ffffea0310000000-ffffea037fffffff] PMD -> [ffff88e000200000-ffff88e0701fffff] on node 7 -v2: change buf to vmemmap_buf instead according to Ingo also add CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER according to Ingo -v3: according to Andrew, use sizeof(name) instead of hard coded 15 Signed-off-by: NYinghai Lu <yinghai@kernel.org> LKML-Reference: <1265793639-15071-19-git-send-email-yinghai@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: NChristoph Lameter <cl@linux-foundation.org> Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
-
由 Yinghai Lu 提交于
Finally we can use early_res to replace bootmem for x86_64 now. Still can use CONFIG_NO_BOOTMEM to enable it or not. -v2: fix 32bit compiling about MAX_DMA32_PFN -v3: folded bug fix from LKML message below Signed-off-by: NYinghai Lu <yinghai@kernel.org> LKML-Reference: <4B747239.4070907@kernel.org> Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
-
- 02 2月, 2010 1 次提交
-
-
由 Wu Fengguang 提交于
Move page_is_ram() declaration to mm.h, it makes no sense in <linux/ioport.h>. Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> LKML-Reference: <20100127030639.GD8132@localhost> Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
-
- 17 1月, 2010 1 次提交
-
-
由 David Howells 提交于
Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen over the end of a truncation. The problem is that ramfs_nommu_check_mappings() checks that the reduced file size against the VMA tree, but not the vm_region tree. The following sequence of events can cause the problem: fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600); ftruncate(fd, 32 * 1024); a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); munmap(a, 32 * 1024); ftruncate(fd, 16 * 1024); c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b' sees that the vm_region from 'a' is covering the region it wants and so shares it, pinning it in memory. Mapping 'a' then goes away and the file is truncated to the end of VMA 'b'. However, the region allocated by 'a' is still in effect, and has _not_ been reduced. Mapping 'c' is then created, and because there's a vm_region covering the desired region, get_unmapped_area() is _not_ called to repeat the check, and the mapping is granted, even though the pages from the latter half of the mapping have been discarded. However: d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'd' should work, and should end up sharing the region allocated by 'a'. To deal with this, we shrink the vm_region struct during the truncation, lest do_mmap_pgoff() take it as licence to share the full region automatically without calling the get_unmapped_area() file op again. Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NAl Viro <viro@zeniv.linux.org.uk> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 1月, 2010 1 次提交
-
-
由 Christoph Lameter 提交于
Use the per cpu allocator functionality to avoid per cpu arrays in struct zone. This drastically reduces the size of struct zone for systems with large amounts of processors and allows placement of critical variables of struct zone in one cacheline even on very large systems. Another effect is that the pagesets of one processor are placed near one another. If multiple pagesets from different zones fit into one cacheline then additional cacheline fetches can be avoided on the hot paths when allocating memory from multiple zones. Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs are reduced and we can drop the zone_pcp macro. Hotplug handling is also simplified since cpu alloc can bring up and shut down cpu areas for a specific cpu as a whole. So there is no need to allocate or free individual pagesets. V7-V8: - Explain chicken egg dilemmna with percpu allocator. V4-V5: - Fix up cases where per_cpu_ptr is called before irq disable - Integrate the bootstrap logic that was separate before. tj: Build failure in pageset_cpuup_callback() due to missing ret variable fixed. Reviewed-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NChristoph Lameter <cl@linux-foundation.org> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 17 12月, 2009 1 次提交
-
-
由 Yinghai Lu 提交于
Found one system that boot from socket1 instead of socket0, SRAT get rejected... [ 0.000000] SRAT: Node 1 PXM 0 0-a0000 [ 0.000000] SRAT: Node 1 PXM 0 100000-80000000 [ 0.000000] SRAT: Node 1 PXM 0 100000000-2080000000 [ 0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000 [ 0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000 [ 0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000 [ 0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000 [ 0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000 [ 0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000 [ 0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000 ... [ 0.000000] NUMA: Allocated memnodemap from 500000 - 701040 [ 0.000000] NUMA: Using 20 for the hash shift. [ 0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used [ 0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used [ 0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used [ 0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used [ 0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used [ 0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used [ 0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used [ 0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used [ 0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used [ 0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used [ 0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used. [ 0.000000] SRAT: SRAT not used. the early_node_map is not sorted because node0 with non zero start come first. so try to sort it right away after all regions are registered. also fixs refression by 8716273c (x86: Export srat physical topology) -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g) -v3: update comments. Reported-and-tested-by: NJens Axboe <jens.axboe@oracle.com> Signed-off-by: NYinghai Lu <yinghai@kernel.org> LKML-Reference: <4B2579D2.3010201@kernel.org> Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
-
- 16 12月, 2009 7 次提交
-
-
由 Andi Kleen 提交于
This is a simpler, gentler variant of memory_failure() for soft page offlining controlled from user space. It doesn't kill anything, just tries to invalidate and if that doesn't work migrate the page away. This is useful for predictive failure analysis, where a page has a high rate of corrected errors, but hasn't gone bad yet. Instead it can be offlined early and avoided. The offlining is controlled from sysfs, including a new generic entry point for hard page offlining for symmetry too. We use the page isolate facility to prevent re-allocation race. Normally this is only used by memory hotplug. To avoid races with memory allocation I am using lock_system_sleep(). This avoids the situation where memory hotplug is about to isolate a page range and then hwpoison undoes that work. This is a big hammer currently, but the simplest solution currently. When the page is not free or LRU we try to free pages from slab and other caches. The slab freeing is currently quite dumb and does not try to focus on the specific slab cache which might own the page. This could be potentially improved later. Thanks to Fengguang Wu and Haicheng Li for some fixes. [Added fix from Andrew Morton to adapt to new migrate_pages prototype] Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-
由 Wu Fengguang 提交于
The unpoisoning interface is useful for stress testing tools to reclaim poisoned pages (to prevent OOM) There is no hardware level unpoisioning, so this cannot be used for real memory errors, only for software injected errors. Note that it may leak pages silently - those who have been removed from LRU cache, but not isolated from page cache/swap cache at hwpoison time. Especially the stress test of dirty swap cache pages shall reboot system before exhausting memory. AK: Fix comments, add documentation, add printks, rename symbol Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-
由 Andi Kleen 提交于
Now that "ref" is just a boolean turn it into a flags argument. First step is only a single flag that makes the code's intention more clear, but more may follow. Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-
由 Andi Kleen 提交于
shake_page handles more types of page caches than lru_drain_all() - per cpu page allocator pages - per CPU LRU Stops early when the page became free. Used in followon patches. Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-
由 Naoya Horiguchi 提交于
This patch enables extraction of the pfn of a hugepage from /proc/pid/pagemap in an architecture independent manner. Details ------- My test program (leak_pagemap) works as follows: - creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,) - read()/write() something on it, - call page-types with option -p, - munmap() and unlink() the file on hugetlbfs Without my patches ------------------ $ ./leak_pagemap flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 1 0 __________________________________ 0x0000000000000804 1 0 __R________M______________________ referenced,mmap 0x000000000000086c 81 0 __RU_lA____M______________________ referenced,uptodate,lru,active,mmap 0x0000000000005808 5 0 ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked 0x0000000000005868 12 0 ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x000000000000586c 1 0 __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 101 0 The output of page-types don't show any hugepage. With my patches --------------- $ ./leak_pagemap flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000000 1 0 __________________________________ 0x0000000000030000 51100 199 ________________TG________________ compound_tail,huge 0x0000000000028018 100 0 ___UD__________H_G________________ uptodate,dirty,compound_head,huge 0x0000000000000804 1 0 __R________M______________________ referenced,mmap 0x000000000000080c 1 0 __RU_______M______________________ referenced,uptodate,mmap 0x000000000000086c 80 0 __RU_lA____M______________________ referenced,uptodate,lru,active,mmap 0x0000000000005808 4 0 ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked 0x0000000000005868 12 0 ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked 0x000000000000586c 1 0 __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked total 51300 200 The output of page-types shows 51200 pages contributing to hugepages, containing 100 head pages and 51100 tail pages as expected. [akpm@linux-foundation.org: build fix] Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Shijie 提交于
The check code for CONFIG_SWAP is redundant, because there is a non-CONFIG_SWAP version for PageSwapCache() which just returns 0. Signed-off-by: NHuang Shijie <shijie8@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set in page->mapping, with the higher bits a pointer to the anon_vma; and have defined PageKsm(page) as that with NULL anon_vma. But KSM swapping will need to store a pointer there: so in preparation for that, now define PAGE_MAPPING_FLAGS as the low two bits, including PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some other use for the bit emerges). Declare page_rmapping(page) to return the pointer part of page->mapping, and page_anon_vma(page) to return the anon_vma pointer when that's what it is. Use these in a few appropriate places: notably, unuse_vma() has been testing page->mapping, but is better to be testing page_anon_vma() (cases may be added in which flag bits are set without any pointer). Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 9月, 2009 1 次提交
-
-
由 David Howells 提交于
The NOMMU fallback for is_vmalloc_or_module_addr() should be static inline, not just static, in linux/mm.h. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 9月, 2009 2 次提交
-
-
由 Alexey Dobriyan 提交于
It's unused. It isn't needed -- read or write flag is already passed and sysctl shouldn't care about the rest. It _was_ used in two places at arch/frv for some reason. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 npiggin@suse.de 提交于
Introduce new truncate helpers truncate_pagecache and inode_newsize_ok. vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and into mm/truncate.c. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 23 9月, 2009 1 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area. This is handled only in x86-64. This patch make it more generic. And we can use vread/vwrite to access the area. Fix it. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 9月, 2009 7 次提交
-
-
由 Hugh Dickins 提交于
CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make more sense of it by removing CONFIG_TMPFS altogether, always enabling its code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on CONFIG_TMPFS off that we'd better leave that as is. But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off: make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL shmem_acl.o being pointlessly built into the kernel when SHMEM is off. And a selfish change, to prevent the world from being rebuilt when I switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the header files is mm.h shmem_lock() - give that a shmem.c stub instead. Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: NMatt Mackall <mpm@selenic.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
__get_user_pages() has been taking its own GUP flags, then processing them into FOLL flags for follow_page(). Though oddly named, the FOLL flags are more widely used, so pass them to __get_user_pages() now. Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct. (The patch to __get_user_pages() looks peculiar, with both gup_flags and foll_flags: the gup_flags remain constant; but as before there's an exceptional case, out of scope of the patch, in which foll_flags per page have FOLL_WRITE masked off.) Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
The "FOLL_ANON optimization" and its use_zero_page() test have caused confusion and bugs: why does it test VM_SHARED? for the very good but unsatisfying reason that VMware crashed without. As we look to maybe reinstating anonymous use of the ZERO_PAGE, we need to sort this out. Easily done: it's silly for __get_user_pages() and follow_page() to be guessing whether it's safe to assume that they're being used for a coredump (which can take a shortcut snapshot where other uses must handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP. get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine. Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
In preparation for the next patch, add a simple get_dump_page(addr) interface for the CONFIG_ELF_CORE dumpers to use, instead of calling get_user_pages() directly. They're not interested in errors: they just want to use holes as much as possible, to save space and make sure that the data is aligned where the headers said it would be. Oh, and don't use that horrid DUMP_SEEK(off) macro! Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: NRik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Beulich 提交于
Sizing of memory allocations shouldn't depend on the number of physical pages found in a system, as that generally includes (perhaps a huge amount of) non-RAM pages. The amount of what actually is usable as storage should instead be used as a basis here. Some of the calculations (i.e. those not intending to use high memory) should likely even use (totalram_pages - totalhigh_pages). Signed-off-by: NJan Beulich <jbeulich@novell.com> Acked-by: NRusty Russell <rusty@rustcorp.com.au> Acked-by: NIngo Molnar <mingo@elte.hu> Cc: Dave Airlie <airlied@linux.ie> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
This patch presents the mm interface to a dummy version of ksm.c, for better scrutiny of that interface: the real ksm.c follows later. When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and MADV_UNMERGEABLE with EINVAL, since that seems more helpful than pretending that they can be serviced. But when CONFIG_KSM=y, accept them even if KSM is not currently running, and even on areas which KSM will not touch (e.g. hugetlb or shared file or special driver mappings). Like other madvices, report ENOMEM despite success if any area in the range is unmapped, and use EAGAIN to report out of memory. Define vma flag VM_MERGEABLE to identify an area on which KSM may try merging pages: leave it to ksm_madvise() to decide whether to set it. Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain VM_MERGEABLE areas, to minimize callouts when forking or exiting. Based upon earlier patches by Chris Wright and Izik Eidus. Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: NChris Wright <chrisw@redhat.com> Signed-off-by: NIzik Eidus <ieidus@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shaohua Li 提交于
In my test, 128M memory is hot added, but zone's pcp batch is 0, which is an obvious error. When pages are onlined, zone pcp should be updated accordingly. [akpm@linux-foundation.org: fix warnings] Signed-off-by: NShaohua Li <shaohua.li@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Yakui Zhao <yakui.zhao@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 9月, 2009 3 次提交
-
-
由 Andi Kleen 提交于
Add the high level memory handler that poisons pages that got corrupted by hardware (typically by a two bit flip in a DIMM or a cache) on the Linux level. The goal is to prevent everyone from accessing these pages in the future. This done at the VM level by marking a page hwpoisoned and doing the appropriate action based on the type of page it is. The code that does this is portable and lives in mm/memory-failure.c To quote the overview comment: High level machine check handler. Handles pages reported by the hardware as being corrupted usually due to a 2bit ECC memory or cache failure. This focuses on pages detected as corrupted in the background. When the current CPU tries to consume corruption the currently running process can just be killed directly instead. This implies that if the error cannot be handled for some reason it's safe to just ignore it because no corruption has been consumed yet. Instead when that happens another machine check will happen. Handles page cache pages in various states. The tricky part here is that we can access any page asynchronous to other VM users, because memory failures could happen anytime and anywhere, possibly violating some of their assumptions. This is why this code has to be extremely careful. Generally it tries to use normal locking rules, as in get the standard locks, even if that means the error handling takes potentially a long time. Some of the operations here are somewhat inefficient and have non linear algorithmic complexity, because the data structures have not been optimized for this case. This is in particular the case for the mapping from a vma to a process. Since this case is expected to be rare we hope we can get away with this. There are in principle two strategies to kill processes on poison: - just unmap the data and wait for an actual reference before killing - kill as soon as corruption is detected. Both have advantages and disadvantages and should be used in different situations. Right now both are implemented and can be switched with a new sysctl vm.memory_failure_early_kill The default is early kill. The patch does some rmap data structure walking on its own to collect processes to kill. This is unusual because normally all rmap data structure knowledge is in rmap.c only. I put it here for now to keep everything together and rmap knowledge has been seeping out anyways Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu, Nick Piggin (who did a lot of great work) and others. Cc: npiggin@suse.de Cc: riel@redhat.com Signed-off-by: NAndi Kleen <ak@linux.intel.com> Acked-by: NRik van Riel <riel@redhat.com> Reviewed-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
-
由 Andi Kleen 提交于
Truncating metadata pages is not safe right now before we haven't audited all file systems. To enable truncation only for data address space define a new address_space callback error_remove_page. This is used for memory_failure.c memory error handling. This can be then set to truncate_inode_page() This patch just defines the new operation and adds documentation. Callers and users come in followon patches. Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-
由 Wu Fengguang 提交于
Add a simple way to invalidate a single page This is just a refactoring of the truncate.c code. Originally from Fengguang, modified by Andi Kleen. Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-