- 28 4月, 2008 11 次提交
-
-
由 Mel Gorman 提交于
Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists consume memory and occupy cache lines. This patch replaces the multiple zonelists per-node with two zonelists. The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages. The second contains all populated zones in the node suitable for GFP_THISNODE allocations. An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
On NUMA, zone_statistics() is used to record events like numa hit, miss and foreign. It assumes that the first zone in a zonelist is the preferred zone. When multiple zonelists are replaced by one that is filtered, this is no longer the case. This patch records what the preferred zone is rather than assuming the first zone in the zonelist is it. This simplifies the reading of later patches in this set. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: NChristoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Introduce a node_zonelist() helper function. It is used to lookup the appropriate zonelist given a node and a GFP mask. The patch on its own is a cleanup but it helps clarify parts of the two-zonelist-per-node patchset. If necessary, it can be merged with the next patch in this set without problems. Reviewed-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The following patches replace multiple zonelists per node with two zonelists that are filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. The first patch cleans up an inconsistency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch introduces a helper function node_zonelist() for looking up the appropriate zonelist for a GFP mask which simplifies patches later in the set. The third patch defines/remembers the "preferred zone" for numa statistics, as it is no longer always the first zone in a zonelist. The forth patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The fifth patch introduces helper macros for retrieving the zone and node indices of entries in a zonelist. The final patch introduces filtering of the zonelists based on a nodemask. Two zonelists exist per node, one for normal allocations and one for __GFP_THISNODE. Performance results varied depending on the machine configuration. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.24-rc4-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. loss to gain Total CPU time on Kernbench: -0.86% to 1.13% Elapsed time on Kernbench: -0.79% to 0.76% page_test from aim9: -4.37% to 0.79% brk_test from aim9: -0.71% to 4.07% fork_test from aim9: -1.84% to 4.60% exec_test from aim9: -0.71% to 1.08% This patch: The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Nothing in the tree uses nopage any more. Remove support for it in the core mm code and documentation (and a few stray references to it in comments). Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
It is not easy to actually understand the "if (!file || !vma_merge())" code, turn it into "if (file && vma_merge())". This makes immediately obvious that the subsequent "if (file)" is superfluous. As Hugh Dickins pointed out, we can also factor out the ->i_writecount corrections, and add a small comment about that. Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hisashi Hifumi 提交于
DIO invalidates page cache through invalidate_inode_pages2_range(). invalidate_inode_pages2_range() sets ret=-EIO when invalidate_complete_page2() fails, but this ret is cleared if do_launder_page() succeed on a page of next index. In this case, dio is carried out even if invalidate_complete_page2() fails on some pages. This can cause inconsistency between memory and blocks on HDD because the page cache still exists. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Chuck Lever <cel@citi.umich.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jeremy Fitzhardinge 提交于
All architectures use an effectively identical definition of online_page(), so just make it common code. x86-64, ia64, powerpc and sh are actually identical; x86-32 is slightly different. x86-32's differences arise because it puts its hotplug pages in the highmem zone. We can handle this in the generic code by inspecting the page to see if its in highmem, and update the totalhigh_pages count appropriately. This leaves init_32.c:free_new_highpage with a single caller, so I folded it into add_one_highpage_init. I also removed an incorrect comment referring to the NUMA case; any NUMA details have already been dealt with by the time online_page() is called. [akpm@linux-foundation.org: fix indenting] Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: NDave Hansen <dave@linux.vnet.ibm.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com> Tested-by: NKAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Christoph Lameter <clameter@sgi.com> Acked-by: NIngo Molnar <mingo@elte.hu> Acked-by: NYasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Badari Pulavarty 提交于
Generic helper function to remove section mappings and sysfs entries for the section of the memory we are removing. offline_pages() correctly adjusted zone and marked the pages reserved. TODO: Yasunori Goto is working on patches to free up allocations from bootmem. Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com> Acked-by: NYasunori Goto <y-goto@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
After the loop in walk_pte_range() pte might point to the first address after the pmd it walks. The pte_unmap() is then applied to something bad. Spotted by Roel Kluin and Andreas Schwab. Signed-off-by: NJohannes Weiner <hannes@saeurebad.de> Cc: Roel Kluin <12o3l@tiscali.nl> Cc: Andreas Schwab <schwab@suse.de> Acked-by: NMatt Mackall <mpm@selenic.com> Acked-by: NMikael Pettersson <mikpe@it.uu.se> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 4月, 2008 6 次提交
-
-
由 Christian Borntraeger 提交于
This patch changes the s390 memory management defintions to use the pgste field for dirty and reference bit tracking of host and guest code. Usually on s390, dirty and referenced are tracked in storage keys, which belong to the physical page. This changes with virtualization: The guest and host dirty/reference bits are defined to be the logical OR of the values for the mapping and the physical page. This patch implements the necessary changes in pgtable.h for s390. There is a common code change in mm/rmap.c, the call to page_test_and_clear_young must be moved. This is a no-op for all architecture but s390. page_referenced checks the referenced bits for the physiscal page and for all mappings: o The physical page is checked with page_test_and_clear_young. o The mappings are checked with ptep_test_and_clear_young and friends. Without pgstes (the current implementation on Linux s390) the physical page check is implemented but the mapping callbacks are no-ops because dirty and referenced are not tracked in the s390 page tables. The pgstes introduces guest and host dirty and reference bits for s390 in the host mapping. These mapping must be checked before page_test_and_clear_young resets the reference bit. Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com> Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NCarsten Otte <cotte@de.ibm.com> Signed-off-by: NAvi Kivity <avi@qumranet.com>
-
由 Yinghai Lu 提交于
On big systems with lots of memory, don't print out too much during bootup, and make it easy to find if it is continuous. on 256G 8 sockets system will get [ffffe20000000000-ffffe20002bfffff] PMD -> [ffff810001400000-ffff810003ffffff] on node 0 [ffffe2001c700000-ffffe2001c7fffff] potential offnode page_structs [ffffe20002c00000-ffffe2001c7fffff] PMD -> [ffff81000c000000-ffff8100255fffff] on node 0 [ffffe20038700000-ffffe200387fffff] potential offnode page_structs [ffffe2001c800000-ffffe200387fffff] PMD -> [ffff810820200000-ffff81083c1fffff] on node 1 [ffffe20040000000-ffffe2007fffffff] PUD ->ffff811027a00000 on node 2 [ffffe20038800000-ffffe2003fffffff] PMD -> [ffff811020200000-ffff8110279fffff] on node 2 [ffffe20054700000-ffffe200547fffff] potential offnode page_structs [ffffe20040000000-ffffe200547fffff] PMD -> [ffff811027c00000-ffff81103c3fffff] on node 2 [ffffe20070700000-ffffe200707fffff] potential offnode page_structs [ffffe20054800000-ffffe200707fffff] PMD -> [ffff811820200000-ffff81183c1fffff] on node 3 [ffffe20080000000-ffffe200bfffffff] PUD ->ffff81202fa00000 on node 4 [ffffe20070800000-ffffe2007fffffff] PMD -> [ffff812020200000-ffff81202f9fffff] on node 4 [ffffe2008c700000-ffffe2008c7fffff] potential offnode page_structs [ffffe20080000000-ffffe2008c7fffff] PMD -> [ffff81202fc00000-ffff81203c3fffff] on node 4 [ffffe200a8700000-ffffe200a87fffff] potential offnode page_structs [ffffe2008c800000-ffffe200a87fffff] PMD -> [ffff812820200000-ffff81283c1fffff] on node 5 [ffffe200c0000000-ffffe200ffffffff] PUD ->ffff813037a00000 on node 6 [ffffe200a8800000-ffffe200bfffffff] PMD -> [ffff813020200000-ffff8130379fffff] on node 6 [ffffe200c4700000-ffffe200c47fffff] potential offnode page_structs [ffffe200c0000000-ffffe200c47fffff] PMD -> [ffff813037c00000-ffff81303c3fffff] on node 6 [ffffe200c4800000-ffffe200e07fffff] PMD -> [ffff813820200000-ffff81383c1fffff] on node 7 instead of a very long print out... Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: NIngo Molnar <mingo@elte.hu> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Yinghai Lu 提交于
split reserve_bootmem_core() into two functions, one which checks conflicts, and one which sets the bits. and make reserve_bootmem to loop bdata_list to cross the nodes. user could be crashkernel and ramdisk..., in case the range provided by those externalities crosses the nodes. Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Yinghai Lu 提交于
need offset alignment when node_boot_start's alignment is less than the alignment required. use local node_boot_start to match alignment - so don't add extra operation in search loop. Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Yinghai Lu 提交于
Make the nodes other than node 0 use bdata->last_success for fast search too. We need to use __alloc_bootmem_core() for vmemmap allocation for other nodes when numa and sparsemem/vmemmap are enabled. Also, make fail_block path increase i with incr only after ALIGN to avoid extra increase when size is larger than align. Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Yinghai Lu 提交于
vmemmap allocation currently has this layout: [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0 [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001800000 on node 0 [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001c00000 on node 0 [ffffe20000600000-ffffe200007fffff] PMD ->ffff810002000000 on node 0 [ffffe20000800000-ffffe200009fffff] PMD ->ffff810002400000 on node 0 ... note that there is a 2M hole between them - not optimal. the root cause is that usemap (24 bytes) will be allocated after every 2M mem_map, and it will push next vmemmap (2M) to the next (2M) alignment. solution: try to allocate the mem_map continously. after the patch, we get: [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0 [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001600000 on node 0 [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001800000 on node 0 [ffffe20000600000-ffffe200007fffff] PMD ->ffff810001a00000 on node 0 [ffffe20000800000-ffffe200009fffff] PMD ->ffff810001c00000 on node 0 ... which is the ideal layout. and usemap will share a page because of they are allocated continuously too: sparse_early_usemap_alloc: usemap = ffff810024e00000 size = 24 sparse_early_usemap_alloc: usemap = ffff810024e00080 size = 24 sparse_early_usemap_alloc: usemap = ffff810024e00100 size = 24 sparse_early_usemap_alloc: usemap = ffff810024e00180 size = 24 ... so we make the bootmem allocation more compact and use less memory for usemap => mission accomplished ;-) Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 24 4月, 2008 1 次提交
-
-
由 Christoph Lameter 提交于
Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 4月, 2008 1 次提交
-
-
由 Pavel Machek 提交于
These are small cleanups all over the tree. Trivial style and comment changes to fs/select.c, kernel/signal.c, kernel/stop_machine.c & mm/pdflush.c Signed-off-by: NPavel Machek <pavel@suse.cz> Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com>
-
- 20 4月, 2008 4 次提交
-
-
由 Daniel Walker 提交于
Signed-off-by: NDaniel Walker <dwalker@mvista.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
-
由 Mike Travis 提交于
* Use new node_to_cpumask_ptr. This creates a pointer to the cpumask for a given node. This definition is in mm patch: asm-generic-add-node_to_cpumask_ptr-macro.patch * Use new set_cpus_allowed_ptr function. Depends on: [mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch [sched-devel]: sched: add new set_cpus_allowed_ptr function [x86/latest]: x86: add cpus_scnprintf function Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Greg Banks <gnb@melbourne.sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: NMike Travis <travis@sgi.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Mike Travis 提交于
* Modify cpuset_cpus_allowed to return the currently allowed cpuset via a pointer argument instead of as the function return value. * Use new set_cpus_allowed_ptr function. * Cleanup CPU_MASK_ALL and NODE_MASK_ALL uses. Depends on: [sched-devel]: sched: add new set_cpus_allowed_ptr function Signed-off-by: NMike Travis <travis@sgi.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Mike Travis 提交于
* Replace usages of CPU_MASK_NONE, CPU_MASK_ALL, NODE_MASK_NONE, NODE_MASK_ALL to reduce stack requirements for large NR_CPUS and MAXNODES counts. * In some cases, the cpumask variable was initialized but then overwritten with another value. This is the case for changes like this: - cpumask_t oldmask = CPU_MASK_ALL; + cpumask_t oldmask; Signed-off-by: NMike Travis <travis@sgi.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 18 4月, 2008 2 次提交
-
-
由 Jason Wessel 提交于
Fix two regressions dealing with the kgdb core. 1) kgdb_skipexception and kgdb_post_primary_code are optional functions that are only required on archs that need special exception fixups. 2) The kernel address space scope must be set on any probe_kernel_* function or archs such as ARCH=arm will not allow access to the kernel memory space. As an example, it is required to allow the full kernel address space is when you the kernel debugger to inspect a system call. Signed-off-by: NJason Wessel <jason.wessel@windriver.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Ingo Molnar 提交于
add probe_kernel_read() and probe_kernel_write(). Uninlined and restricted to kernel range memory only, as suggested by Linus. Signed-off-by: NIngo Molnar <mingo@elte.hu> Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
-
- 16 4月, 2008 3 次提交
-
-
由 KOSAKI Motohiro 提交于
In a5d76b54 (memory unplug: page isolation by KAMEZAWA Hiroyuki), "isolate" migratetype added. but unfortunately, it doesn't treat /proc/pagetypeinfo display logic. this patch add "Isolate" to pagetype name field. /proc/pagetype before: ------------------------------------------------------------------------------------------------------------------------ Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 1 2 2 2 1 2 2 1 1 0 0 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Movable 2 3 3 1 3 3 2 0 0 0 0 Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone DMA, type <NULL> 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Unmovable 1 9 7 4 1 1 1 1 0 0 0 Node 0, zone Normal, type Reclaimable 5 2 0 0 1 1 0 0 0 1 0 Node 0, zone Normal, type Movable 0 1 1 0 0 0 1 0 0 1 60 Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone Normal, type <NULL> 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone HighMem, type Unmovable 0 0 1 1 1 0 1 1 2 2 0 Node 0, zone HighMem, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone HighMem, type Movable 236 62 6 2 2 1 1 0 1 1 16 Node 0, zone HighMem, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone HighMem, type <NULL> 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Reclaimable Movable Reserve <NULL> Node 0, zone DMA 1 0 2 1 0 Node 0, zone Normal 10 40 169 1 0 Node 0, zone HighMem 2 0 283 1 0 after: ------------------------------------------------------------------------------------------------------------------------ Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 1 2 2 2 1 2 2 1 1 0 0 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Movable 2 3 3 1 3 3 2 0 0 0 0 Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Unmovable 0 2 1 1 0 1 0 0 0 0 0 Node 0, zone Normal, type Reclaimable 1 1 1 1 1 0 1 1 1 0 0 Node 0, zone Normal, type Movable 0 1 1 1 0 1 0 1 0 0 196 Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone HighMem, type Unmovable 0 1 0 0 0 1 1 1 2 2 0 Node 0, zone HighMem, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone HighMem, type Movable 1 0 1 1 0 0 0 0 1 0 200 Node 0, zone HighMem, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone HighMem, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Reclaimable Movable Reserve Isolate Node 0, zone DMA 1 0 2 1 0 Node 0, zone Normal 8 4 207 1 0 Node 0, zone HighMem 2 0 283 1 0 Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Li Zefan 提交于
When I used a test program to fork mass processes and immediately move them to a cgroup where the memory limit is low enough to trigger oom kill, I got oops: BUG: unable to handle kernel NULL pointer dereference at 0000000000000808 IP: [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 PGD 4c95f067 PUD 4406c067 PMD 0 Oops: 0002 [1] SMP CPU 2 Modules linked in: Pid: 11973, comm: a.out Not tainted 2.6.25-rc7 #5 RIP: 0010:[<ffffffff8045c47f>] [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 RSP: 0018:ffff8100448c7c30 EFLAGS: 00010002 RAX: 0000000000000202 RBX: 0000000000000009 RCX: 000000000001c9f3 RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000808 RBP: ffff81007e444080 R08: 0000000000000000 R09: ffff8100448c7900 R10: ffff81000105f480 R11: 00000100ffffffff R12: ffff810067c84140 R13: 0000000000000001 R14: ffff8100441d0018 R15: ffff81007da56200 FS: 00007f70eb1856f0(0000) GS:ffff81007fbad3c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000808 CR3: 000000004498a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process a.out (pid: 11973, threadinfo ffff8100448c6000, task ffff81007da533e0) Stack: ffffffff8023ef5a 00000000000000d0 ffffffff80548dc0 00000000000000d0 ffff810067c84140 ffff81007e444080 ffffffff8026cef9 00000000000000d0 ffff8100441d0000 00000000000000d0 ffff8100441d0000 ffff8100505445c0 Call Trace: [<ffffffff8023ef5a>] ? force_sig_info+0x25/0xb9 [<ffffffff8026cef9>] ? oom_kill_task+0x77/0xe2 [<ffffffff8026d696>] ? mem_cgroup_out_of_memory+0x55/0x67 [<ffffffff802910ad>] ? mem_cgroup_charge_common+0xec/0x202 [<ffffffff8027997b>] ? handle_mm_fault+0x24e/0x77f [<ffffffff8022c4af>] ? default_wake_function+0x0/0xe [<ffffffff8027a17a>] ? get_user_pages+0x2ce/0x3af [<ffffffff80290fee>] ? mem_cgroup_charge_common+0x2d/0x202 [<ffffffff8027a441>] ? make_pages_present+0x8e/0xa4 [<ffffffff8027d1ab>] ? mmap_region+0x373/0x429 [<ffffffff8027d7eb>] ? do_mmap_pgoff+0x2ff/0x364 [<ffffffff80210471>] ? sys_mmap+0xe5/0x111 [<ffffffff8020bfc9>] ? tracesys+0xdc/0xe1 Code: 00 00 01 48 8b 3c 24 e9 46 d4 dd ff f0 ff 07 48 8b 3c 24 e9 3a d4 dd ff fe 07 48 8b 3c 24 e9 2f d4 dd ff 9c 58 fa ba 00 01 00 00 <f0> 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 c3 fa b8 00 01 00 RIP [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 RSP <ffff8100448c7c30> CR2: 0000000000000808 ---[ end trace c3702fa668021ea4 ]--- It's reproducable in a x86_64 box, but doesn't happen in x86_32. This is because tsk->sighand is not guarded by RCU, so we have to hold tasklist_lock, just as what out_of_memory() does. Signed-off-by: NLi Zefan <lizf@cn.fujitsu> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: David Rientjes <rientjes@cs.washington.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ingo Molnar 提交于
Fix memory corruption and crash on 32-bit x86 systems. If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of RAM, then we call memory_present() with a start/end that goes outside the scope of MAX_PHYSMEM_BITS. That causes this loop to happily walk over the limit of the sparse memory section map: for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) { unsigned long section = pfn_to_section_nr(pfn); struct mem_section *ms; sparse_index_init(section, nid); set_section_nid(section, nid); ms = __nr_to_section(section); if (!ms->section_mem_map) ms->section_mem_map = sparse_encode_early_nid(nid) | SECTION_MARKED_PRESENT; 'ms' will be out of bounds and we'll corrupt a small amount of memory by encoding the node ID and writing SECTION_MARKED_PRESENT (==0x1) over it. The corruption might happen when encoding a non-zero node ID, or due to the SECTION_MARKED_PRESENT which is 0x1: mmzone.h:#define SECTION_MARKED_PRESENT (1UL<<0) The fix is to sanity check anything the architecture passes to sparsemem. This bug seems to be rather old (as old as sparsemem support itself), but the exact incarnation depended on random details like configs, which made this bug more prominent in v2.6.25-to-be. An additional enhancement might be to print a warning about ignored or trimmed memory ranges. Signed-off-by: NIngo Molnar <mingo@elte.hu> Tested-by: NChristoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Yinghai Lu <Yinghai.Lu@sun.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 4月, 2008 6 次提交
-
-
由 Christoph Lameter 提交于
The per node counters are used mainly for showing data through the sysfs API. If that API is not compiled in then there is no point in keeping track of this data. Disable counters for the number of slabs and the number of total slabs if !SLUB_DEBUG. Incrementing the per node counters is also accessing a potentially contended cacheline so this could actually be a performance benefit to embedded systems. SLABINFO support is also affected. It now must depends on SLUB_DEBUG (which is on by default). Patch also avoids a check for a NULL kmem_cache_node pointer in new_slab() if the system is not compiled with NUMA support. [penberg@cs.helsinki.fi: fix oops and move ->nr_slabs into CONFIG_SLUB_DEBUG] Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
-
由 Christoph Lameter 提交于
__free_slab does some diagnostics. The resetting of mapcount etc in discard_slab() can interfere with debug processing. So move the reset immediately before the page is freed. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
-
由 Christoph Lameter 提交于
Only output per cpu stats if the kernel is build for SMP. Use a capital "C" as a leading character for the processor number (same as the numa statistics that also use a capital letter "N"). Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
-
由 Christoph Lameter 提交于
count_partial() is used by both slabinfo and the sysfs proc support. Move the function directly before the beginning of the sysfs code so that it can be easily found. Rework the preprocessor conditional to take into account that slub sysfs support depends on CONFIG_SYSFS *and* CONFIG_SLUB_DEBUG. Make CONFIG_SLUB_STATS depend on CONFIG_SLUB_DEBUG and CONFIG_SYSFS. There is no point of keeping statistics if no one can restrive them. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
-
由 Christoph Lameter 提交于
Move the definition of kmalloc_caches_dma() into a later #ifdef CONFIG_ZONE_DMA. This saves one #ifdef and leaves us with a total of two #ifdefs for dma slab support. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
-
由 Pekka Enberg 提交于
As spotted by kmemcheck, we need to initialize the per-CPU ->stat array before using it. [kmem_cache_cpu structures are usually allocated from arrays defined via DEFINE_PER_CPU that are zeroed so we have not noticed this so far --cl]. Reported-by: NVegard Nossum <vegard.nossum@gmail.com> Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
-
- 09 4月, 2008 1 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
This should be N_NORMAL_MEMORY. N_NORMAL_MEMORY is "true" if a node has memory for the kernel. N_HIGH_MEMORY is "true" if a node has memory for HIGHMEM. (If CONFIG_HIGHMEM=n, always "true") This check is used for testing whether we can use kmalloc_node() on a node. Then, if there is a node which only contains HIGHMEM, the system will call kmalloc_node() which doesn't contain memory for the kernel. If it happens under SLUB, the kernel will panic. I think this only happens on x86_32-numa. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 4月, 2008 1 次提交
-
-
由 Balbir Singh 提交于
A boot option for the memory controller was discussed on lkml. It is a good idea to add it, since it saves memory for people who want to turn off the memory controller. By default the option is on for the following two reasons: 1. It provides compatibility with the current scheme where the memory controller turns on if the config option is enabled 2. It allows for wider testing of the memory controller, once the config option is enabled We still allow the create, destroy callbacks to succeed, since they are not aware of boot options. We do not populate the directory will memory resource controller specific files. Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 4月, 2008 1 次提交
-
-
由 Christoph Lameter 提交于
Small typo in the patch recently merged to avoid the unused symbol message for count_partial(). Discussion thread with confirmation of fix at http://marc.info/?t=120696854400001&r=1&w=2 Typo in the check if we need the count_partial function that was introduced by 53625b42Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 3月, 2008 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 28 3月, 2008 1 次提交
-
-
由 Linus Torvalds 提交于
This reverts commit 3811dbf6. The masking was not at all useless, and it was sensible. We handle GFP_ZERO in the caller, and passing it down to any page allocator logic is buggy and wrong. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 3月, 2008 1 次提交
-
-
由 Nishanth Aravamudan 提交于
Running the counters testcase from libhugetlbfs results in on 2.6.25-rc5 and 2.6.25-rc5-mm1: BUG: soft lockup - CPU#3 stuck for 61s! [counters:10531] NIP: c0000000000d1f3c LR: c0000000000d1f2c CTR: c0000000001b5088 REGS: c000005db12cb360 TRAP: 0901 Not tainted (2.6.25-rc5-autokern1) MSR: 8000000000009032 <EE,ME,IR,DR> CR: 48008448 XER: 20000000 TASK = c000005dbf3d6000[10531] 'counters' THREAD: c000005db12c8000 CPU: 3 GPR00: 0000000000000004 c000005db12cb5e0 c000000000879228 0000000000000004 GPR04: 0000000000000010 0000000000000000 0000000000200200 0000000000100100 GPR08: c0000000008aba10 000000000000ffff 0000000000000004 0000000000000000 GPR12: 0000000028000442 c000000000770080 NIP [c0000000000d1f3c] .return_unused_surplus_pages+0x84/0x18c LR [c0000000000d1f2c] .return_unused_surplus_pages+0x74/0x18c Call Trace: [c000005db12cb5e0] [c000005db12cb670] 0xc000005db12cb670 (unreliable) [c000005db12cb670] [c0000000000d24c4] .hugetlb_acct_memory+0x2e0/0x354 [c000005db12cb740] [c0000000001b5048] .truncate_hugepages+0x1d4/0x214 [c000005db12cb890] [c0000000001b50a4] .hugetlbfs_delete_inode+0x1c/0x3c [c000005db12cb920] [c000000000103fd8] .generic_delete_inode+0xf8/0x1c0 [c000005db12cb9b0] [c0000000001b5100] .hugetlbfs_drop_inode+0x3c/0x24c [c000005db12cba50] [c00000000010287c] .iput+0xdc/0xf8 [c000005db12cbad0] [c0000000000fee54] .dentry_iput+0x12c/0x194 [c000005db12cbb60] [c0000000000ff050] .d_kill+0x6c/0xa4 [c000005db12cbbf0] [c0000000000ffb74] .dput+0x18c/0x1b0 [c000005db12cbc70] [c0000000000e9e98] .__fput+0x1a4/0x1e8 [c000005db12cbd10] [c0000000000e61ec] .filp_close+0xb8/0xe0 [c000005db12cbda0] [c0000000000e62d0] .sys_close+0xbc/0x134 [c000005db12cbe30] [c00000000000872c] syscall_exit+0x0/0x40 Instruction dump: ebbe8038 38800010 e8bf0002 3bbd0008 7fa3eb78 38a50001 7ca507b4 4818df25 60000000 38800010 38a00000 7c601b78 <7fa3eb78> 2f800010 409d0008 38000010 This was tracked down to a potential livelock in return_unused_surplus_hugepages(). In the case where we have surplus pages on some node, but no free pages on the same node, we may never break out of the loop. To avoid this livelock, terminate the search if we iterate a number of times equal to the number of online nodes without freeing a page. Thanks to Andy Whitcroft and Adam Litke for helping with debugging and the patch. Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-