- 08 7月, 2008 1 次提交
-
-
由 Yinghai Lu 提交于
in case we have kva before ramdisk on a node, we still need to use those ranges. v2: reserve_early kva ram area, in case there are holes in highmem, to avoid those area could be treat as free high pages. Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 04 7月, 2008 1 次提交
-
-
由 Mel Gorman 提交于
The non-NUMA case of build_zonelist_cache() would initialize the zlcache_ptr for both node_zonelists[] to NULL. Which is problematic, since non-NUMA only has a single node_zonelists[] entry, and trying to zero the non-existent second one just overwrote the nr_zones field instead. As kswapd uses this value to determine what reclaim work is necessary, the result is that kswapd never reclaims. This causes processes to stall frequently in low-memory situations as they always direct reclaim. This patch initialises zlcache_ptr correctly. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Tested-by: NDan Williams <dan.j.williams@intel.com> [ Simplified patch a bit ] Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 6月, 2008 2 次提交
-
-
由 Yinghai Lu 提交于
Now we are using register_e820_active_regions() instead of add_active_range() directly. So end_pfn could be different between the value in early_node_map to node_end_pfn. So we need to make shrink_active_range() smarter. shrink_active_range() is a generic MM function in mm/page_alloc.c but it is only used on 32-bit x86. Should we move it back to some file in arch/x86? Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Russ Anderson 提交于
Minor source code cleanup of page flags in mm/page_alloc.c. Move the definition of the groups of bits to page-flags.h. The purpose of this clean up is that the next patch will conditionally add a page flag to the groups. Doing that in a header file is cleaner than adding #ifdefs to the C code. Signed-off-by: NRuss Anderson <rja@sgi.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 6月, 2008 1 次提交
-
-
由 Yinghai Lu 提交于
also fix the print out of node_remap_end_vaddr Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 25 5月, 2008 3 次提交
-
-
由 Heiko Carstens 提交于
Trying to add memory via add_memory() from within an initcall function results in bootmem alloc of 163840 bytes failed! Kernel panic - not syncing: Out of memory This is caused by zone_wait_table_init() which uses system_state to decide if it should use the bootmem allocator or not. When initcalls are handled the system_state is still SYSTEM_BOOTING but the bootmem allocator doesn't work anymore. So the allocation will fail. To fix this use slab_is_available() instead as indicator like we do it everywhere else. [akpm@linux-foundation.org: coding-style fix] Reviewed-by: NAndy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NYasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Whitcroft 提交于
When booting 2.6.26-rc3 on a multi-node x86_32 numa system we are seeing panics when trying node local allocations: BUG: unable to handle kernel NULL pointer dereference at 0000034c IP: [<c1042507>] get_page_from_freelist+0x4a/0x18e *pdpt = 00000000013a7001 *pde = 0000000000000000 Oops: 0000 [#1] SMP Modules linked in: Pid: 0, comm: swapper Not tainted (2.6.26-rc3-00003-g5abc28d #82) EIP: 0060:[<c1042507>] EFLAGS: 00010282 CPU: 0 EIP is at get_page_from_freelist+0x4a/0x18e EAX: c1371ed8 EBX: 00000000 ECX: 00000000 EDX: 00000000 ESI: f7801180 EDI: 00000000 EBP: 00000000 ESP: c1371ec0 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process swapper (pid: 0, ti=c1370000 task=c12f5b40 task.ti=c1370000) Stack: 00000000 00000000 00000000 00000000 000612d0 000412d0 00000000 000412d0 f7801180 f7c0101c f7c01018 c10426e4 f7c01018 00000001 00000044 00000000 00000001 c12f5b40 00000001 00000010 00000000 000412d0 00000286 000412d0 Call Trace: [<c10426e4>] __alloc_pages_internal+0x99/0x378 [<c10429ca>] __alloc_pages+0x7/0x9 [<c105e0e8>] kmem_getpages+0x66/0xef [<c105ec55>] cache_grow+0x8f/0x123 [<c105f117>] ____cache_alloc_node+0xb9/0xe4 [<c105f427>] kmem_cache_alloc_node+0x92/0xd2 [<c122118c>] setup_cpu_cache+0xaf/0x177 [<c105e6ca>] kmem_cache_create+0x2c8/0x353 [<c13853af>] kmem_cache_init+0x1ce/0x3ad [<c13755c5>] start_kernel+0x178/0x1ee This occurs when we are scanning the zonelists looking for a ZONE_NORMAL page. In this system there is only ZONE_DMA and ZONE_NORMAL memory on node 0, all other nodes are mapped above 4GB physical. Here is a dump of the zonelists from this system: zonelists pgdat=c1400000 0: c14006c0:2 f7c006c0:2 f7e006c0:2 c1400360:1 c1400000:0 1: c14006c0:2 c1400360:1 c1400000:0 zonelists pgdat=f7c00000 0: f7c006c0:2 f7e006c0:2 c14006c0:2 c1400360:1 c1400000:0 1: f7c006c0:2 zonelists pgdat=f7e00000 0: f7e006c0:2 c14006c0:2 f7c006c0:2 c1400360:1 c1400000:0 1: f7e006c0:2 When performing a node local allocation we call get_page_from_freelist() looking for a page. It in turn calls first_zones_zonelist() which returns a preferred_zone. Where there are no applicable zones this will be NULL. However we use this unconditionally, leading to this panic. Where there are no applicable zones there is no possibility of a successful allocation, so simply fail the allocation. Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
In a zone's present pages number, account for all pages occupied by the memory map, including a partial. Signed-off-by: NJohannes Weiner <hannes@saeurebad.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 5月, 2008 1 次提交
-
-
由 Heiko Carstens 提交于
Trying to online a new memory section that was added via memory hotplug sometimes results in crashes when the new pages are added via __free_page. Reason for that is that the pageblock bitmap isn't initialized and hence contains random stuff. That means that get_pageblock_migratetype() returns also random stuff and therefore list_add(&page->lru, &zone->free_area[order].free_list[migratetype]); in __free_one_page() tries to do a list_add to something that isn't even necessarily a list. This happens since 86051ca5 ("mm: fix usemap initialization") which makes sure that the pageblock bitmap gets only initialized for pages present in a zone. Unfortunately for hot-added memory the zones "grow" after the memmap and the pageblock memmap have been initialized. Which means that the new pages have an unitialized bitmap. To solve this the calls to grow_zone_span() and grow_pgdat_span() are moved to __add_zone() just before the initialization happens. The patch also moves the two functions since __add_zone() is the only caller and I didn't want to add a forward declaration. Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 4月, 2008 1 次提交
-
-
由 Thomas Gleixner 提交于
We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NIngo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 4月, 2008 3 次提交
-
-
由 Nishanth Aravamudan 提交于
Because of page order checks in __alloc_pages(), hugepage (and similarly large order) allocations will not retry unless explicitly marked __GFP_REPEAT. However, the current retry logic is nearly an infinite loop (or until reclaim does no progress whatsoever). For these costly allocations, that seems like overkill and could potentially never terminate. Mel observed that allowing current __GFP_REPEAT semantics for hugepage allocations essentially killed the system. I believe this is because we may continue to reclaim small orders of pages all over, but never have enough to satisfy the hugepage allocation request. This is clearly only a problem for large order allocations, of which hugepages are the most obvious (to me). Modify try_to_free_pages() to indicate how many pages were reclaimed. Use that information in __alloc_pages() to eventually fail a large __GFP_REPEAT allocation when we've reclaimed an order of pages equal to or greater than the allocation's order. This relies on lumpy reclaim functioning as advertised. Due to fragmentation, lumpy reclaim may not be able to free up the order needed in one invocation, so multiple iterations may be requred. In other words, the more fragmented memory is, the more retry attempts __GFP_REPEAT will make (particularly for higher order allocations). This changes the semantics of __GFP_REPEAT subtly, but *only* for allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size allocations, we will try up to some point (at least 1<<order reclaimed pages), rather than forever (which is the case for allocations <= PAGE_ALLOC_COSTLY_ORDER). This change improves the /proc/sys/vm/nr_hugepages interface with a follow-on patch that makes pool allocations use __GFP_REPEAT. Rather than administrators repeatedly echo'ing a particular value into the sysctl, and forcing reclaim into action manually, this change allows for the sysctl to attempt a reasonable effort itself. Similarly, dynamic pool growth should be more successful under load, as lumpy reclaim can try to free up pages, rather than failing right away. Choosing to reclaim only up to the order of the requested allocation strikes a balance between not failing hugepage allocations and returning to the caller when it's unlikely to every succeed. Because of lumpy reclaim, if we have freed the order requested, hopefully it has been in big chunks and those chunks will allow our allocation to succeed. If that isn't the case after freeing up the current order, I don't think it is likely to succeed in the future, although it is possible given a particular fragmentation pattern. Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Tested-by: NMel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nishanth Aravamudan 提交于
The definition and use of __GFP_REPEAT, __GFP_NOFAIL and __GFP_NORETRY in the core VM have somewhat differing comments as to their actual semantics. Annoyingly, the flags definition has inline and header comments, which might be interpreted as not being equivalent. Just add references to the header comments in the inline ones so they don't go out of sync in the future. In their use in __alloc_pages() clarify that the current implementation treats low-order allocations and __GFP_REPEAT allocations as distinct cases. To clarify, the flags' semantics are: __GFP_NORETRY means try no harder than one run through __alloc_pages __GFP_REPEAT means __GFP_NOFAIL __GFP_NOFAIL means repeat forever order <= PAGE_ALLOC_COSTLY_ORDER means __GFP_NOFAIL Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
usemap must be initialized only when pfn is within zone. If not, it corrupts memory. And this patch also reduces the number of calls to set_pageblock_migratetype() from (pfn & (pageblock_nr_pages -1) to !(pfn & (pageblock_nr_pages-1) it should be called once per pageblock. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hugh@veritas.com> Cc: Shi Weihua <shiwh@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 28 4月, 2008 10 次提交
-
-
由 Pavel Machek 提交于
Remove hand-coded get_order() from page_alloc.c. Signed-off-by: NPavel Machek <pavel@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yasunori Goto 提交于
This patch is to free memmaps which is allocated by bootmem. Freeing usemap is not necessary. The pages of usemap may be necessary for other sections. If removing section is last section on the node, its section is the final user of usemap page. (usemaps are allocated on its section by previous patch.) But it shouldn't be freed too, because the section must be logical offline state which all pages are isolated against page allocater. If it is freed, page alloctor may use it which will be removed physically soon. It will be disaster. So, this patch keeps it as it is. Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Remove aliases of PG_xxx. We can easily drop those now and alias by specifying the PG_xxx flag in the macro that generates the functions. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 S.Caglar Onur 提交于
zlc_setup(): handle jiffies wraparound (10ed273f) changes tab with spaces Signed-off-by: NS.Caglar Onur <caglar@pardus.org.tr> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The MPOL_BIND policy creates a zonelist that is used for allocations controlled by that mempolicy. As the per-node zonelist is already being filtered based on a zone id, this patch adds a version of __alloc_pages() that takes a nodemask for further filtering. This eliminates the need for MPOL_BIND to create a custom zonelist. A positive benefit of this is that allocations using MPOL_BIND now use the local node's distance-ordered zonelist instead of a custom node-id-ordered zonelist. I.e., pages will be allocated from the closest allowed node with available memory. [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments] [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask] [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists consume memory and occupy cache lines. This patch replaces the multiple zonelists per-node with two zonelists. The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages. The second contains all populated zones in the node suitable for GFP_THISNODE allocations. An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
On NUMA, zone_statistics() is used to record events like numa hit, miss and foreign. It assumes that the first zone in a zonelist is the preferred zone. When multiple zonelists are replaced by one that is filtered, this is no longer the case. This patch records what the preferred zone is rather than assuming the first zone in the zonelist is it. This simplifies the reading of later patches in this set. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: NChristoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Introduce a node_zonelist() helper function. It is used to lookup the appropriate zonelist given a node and a GFP mask. The patch on its own is a cleanup but it helps clarify parts of the two-zonelist-per-node patchset. If necessary, it can be merged with the next patch in this set without problems. Reviewed-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The following patches replace multiple zonelists per node with two zonelists that are filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. The first patch cleans up an inconsistency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch introduces a helper function node_zonelist() for looking up the appropriate zonelist for a GFP mask which simplifies patches later in the set. The third patch defines/remembers the "preferred zone" for numa statistics, as it is no longer always the first zone in a zonelist. The forth patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The fifth patch introduces helper macros for retrieving the zone and node indices of entries in a zonelist. The final patch introduces filtering of the zonelists based on a nodemask. Two zonelists exist per node, one for normal allocations and one for __GFP_THISNODE. Performance results varied depending on the machine configuration. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.24-rc4-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. loss to gain Total CPU time on Kernbench: -0.86% to 1.13% Elapsed time on Kernbench: -0.79% to 0.76% page_test from aim9: -4.37% to 0.79% brk_test from aim9: -0.71% to 4.07% fork_test from aim9: -1.84% to 4.60% exec_test from aim9: -0.71% to 1.08% This patch: The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 4月, 2008 1 次提交
-
-
由 Mike Travis 提交于
* Use new node_to_cpumask_ptr. This creates a pointer to the cpumask for a given node. This definition is in mm patch: asm-generic-add-node_to_cpumask_ptr-macro.patch * Use new set_cpus_allowed_ptr function. Depends on: [mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch [sched-devel]: sched: add new set_cpus_allowed_ptr function [x86/latest]: x86: add cpus_scnprintf function Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Greg Banks <gnb@melbourne.sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: NMike Travis <travis@sgi.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 05 3月, 2008 2 次提交
-
-
由 Hugh Dickins 提交于
Replace free_hot_cold_page's VM_BUG_ON(page_get_page_cgroup(page)) by a "Bad page state" and clear: most users don't have CONFIG_DEBUG_VM on, and if it were set here, it'd likely cause corruption when the page is reused. Don't use page_assign_page_cgroup to clear it: that should be private to memcontrol.c, and always called with the lock taken; and memmap_init_zone doesn't need it either - like page->mapping and other pointers throughout the kernel, Linux assumes pointers in zeroed structures are NULL pointers. Instead use page_reset_bad_cgroup, added to memcontrol.h for this only. Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
jiffies subtraction may cause an overflow problem. It should be using time_after(). [akpm@linux-foundation.org: include jiffies.h] Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Paul Jackson <pj@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 2月, 2008 1 次提交
-
-
由 Alexander van Heukelum 提交于
WARNING: vmlinux.o(.meminit.text+0x649): Section mismatch in reference from the function free_area_init_core() to the function .init.text:setup_usemap() The function __meminit free_area_init_core() references a function __init setup_usemap(). If free_area_init_core is only used by setup_usemap then annotate free_area_init_core with a matching annotation. The warning is covers this stack of functions in mm/page_alloc.c: alloc_bootmem_node must be marked __init. alloc_bootmem_node is used by setup_usemap, if !SPARSEMEM. (usemap_size is only used by setup_usemap, if !SPARSEMEM.) setup_usemap is only used by free_area_init_core. free_area_init_core is only used by free_area_init_node. free_area_init_node is used by: arch/alpha/mm/numa.c: __init paging_init() arch/arm/mm/init.c: __init bootmem_init_node() arch/avr32/mm/init.c: __init paging_init() arch/cris/arch-v10/mm/init.c: __init paging_init() arch/cris/arch-v32/mm/init.c: __init paging_init() arch/m32r/mm/discontig.c: __init zone_sizes_init() arch/m32r/mm/init.c: __init zone_sizes_init() arch/m68k/mm/motorola.c: __init paging_init() arch/m68k/mm/sun3mmu.c: __init paging_init() arch/mips/sgi-ip27/ip27-memory.c: __init paging_init() arch/parisc/mm/init.c: __init paging_init() arch/sparc/mm/srmmu.c: __init srmmu_paging_init() arch/sparc/mm/sun4c.c: __init sun4c_paging_init() arch/sparc64/mm/init.c: __init paging_init() mm/page_alloc.c: __init free_area_init_nodes() mm/page_alloc.c: __init free_area_init() and mm/memory_hotplug.c: hotadd_new_pgdat() hotadd_new_pgdat can not be an __init function, but: It is compiled for MEMORY_HOTPLUG configurations only MEMORY_HOTPLUG depends on SPARSEMEM || X86_64_ACPI_NUMA X86_64_ACPI_NUMA depends on X86_64 ARCH_FLATMEM_ENABLE depends on X86_32 ARCH_DISCONTIGMEM_ENABLE depends on X86_32 So X86_64_ACPI_NUMA implies SPARSEMEM, right? So we can mark the stack of functions __init for !SPARSEMEM, but we must mark them __meminit for SPARSEMEM configurations. This is ok, because then the calls to alloc_bootmem_node are also avoided. Compile-tested on: silly minimal config defconfig x86_32 defconfig x86_64 defconfig x86_64 -HIBERNATION +MEMORY_HOTPLUG Signed-off-by: NAlexander van Heukelum <heukelum@fastmail.fm> Reviewed-by: NSam Ravnborg <sam@ravnborg.org> Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 2月, 2008 1 次提交
-
-
由 Harvey Harrison 提交于
Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 2月, 2008 1 次提交
-
-
由 Balbir Singh 提交于
Add the accounting hooks. The accounting is carried out for RSS and Page Cache (unmapped) pages. There is now a common limit and accounting for both. The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap() time. Page cache is accounted at add_to_page_cache(), __delete_from_page_cache(). Swap cache is also accounted for. Each page's page_cgroup is protected with the last bit of the page_cgroup pointer, this makes handling of race conditions involving simultaneous mappings of a page easier. A reference count is kept in the page_cgroup to deal with cases where a page might be unmapped from the RSS of all tasks, but still lives in the page cache. Credits go to Vaidyanathan Srinivasan for helping with reference counting work of the page cgroup. Almost all of the page cache accounting code has help from Vaidyanathan Srinivasan. [hugh@veritas.com: fix swapoff breakage] [akpm@linux-foundation.org: fix locking] Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: <Valdis.Kletnieks@vt.edu> Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 2月, 2008 5 次提交
-
-
由 Larry Woodman 提交于
The show_mem() output does not include the total number of pagecache pages. This would be helpful when analyzing the debug information in the /var/log/messages file after OOM kills occur. This patch includes the total pagecache pages in that output. Signed-off-by: NLarry Woodman <lwoodman@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Harvey Harrison 提交于
fastcall is always defined to be empty, remove it [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andi Kleen 提交于
Signed-off-by: NAndi Kleen <ak@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
We have repeatedly discussed if the cold pages still have a point. There is one way to join the two lists: Use a single list and put the cold pages at the end and the hot pages at the beginning. That way a single list can serve for both types of allocations. The discussion of the RFC for this and Mel's measurements indicate that there may not be too much of a point left to having separate lists for hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2). Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Martin Bligh <mbligh@mbligh.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
- Add comments explaing how drain_pages() works. - Eliminate useless functions - Rename drain_all_local_pages to drain_all_pages(). It does drain all pages not only those of the local processor. - Eliminate useless interrupt off / on sequences. drain_pages() disables interrupts on its own. The execution thread is pinned to processor by the caller. So there is no need to disable interrupts. - Put drain_all_pages() declaration in gfp.h and remove the declarations from suspend.h and from mm/memory_hotplug.c - Make software suspend call drain_all_pages(). The draining of processor local pages is may not the right approach if software suspend wants to support SMP. If they call drain_all_pages then we can make drain_pages() static. [akpm@linux-foundation.org: fix build] Signed-off-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Daniel Walker <dwalker@mvista.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 1月, 2008 1 次提交
-
-
由 Sam Ravnborg 提交于
With CONFIG_HOTPLUG=n and CONFIG_HOTPLUG_CPU=y we saw following warning: WARNING: mm/built-in.o(.text+0x6864): Section mismatch: reference to .init.text: (between 'process_zones' and 'pageset_cpuup_callback') The culprit was zone_batchsize() which were annotated __devinit but used from process_zones() which is annotated __cpuinit. zone_batchsize() are used from another function annotated __meminit so the only valid option is to drop the annotation of zone_batchsize() so we know it is always valid to use it. Signed-off-by: NSam Ravnborg <sam@ravnborg.org> Acked-by: NIngo Molnar <mingo@elte.hu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 1月, 2008 1 次提交
-
-
由 Thomas Bogendoerfer 提交于
When using FLAT_MEMORY and ARCH_PFN_OFFSET is not 0, the kernel crashes in memmap_init_zone(). This bug got introduced by commit c713216dSigned-off-by: NThomas Bogendoerfer <tsbogend@alpha.franken.de> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: Bob Picco <bob.picco@hp.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Andi Kleen <ak@muc.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Keith Mannthey" <kmannth@gmail.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 12月, 2007 1 次提交
-
-
由 Mel Gorman 提交于
In some cases the IO subsystem is able to merge requests if the pages are adjacent in physical memory. This was achieved in the allocator by having expand() return pages in physically contiguous order in situations were a large buddy was split. However, list-based anti-fragmentation changed the order pages were returned in to avoid searching in buffered_rmqueue() for a page of the appropriate migrate type. This patch restores behaviour of rmqueue_bulk() preserving the physical order of pages returned by the allocator without incurring increased search costs for anti-fragmentation. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Mark Lord <mlord@pobox.com Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 11月, 2007 1 次提交
-
-
由 Mel Gorman 提交于
Ordinarily the size of a pageblock is determined at compile-time based on the hugepage size. On PPC64, the hugepage size is determined at runtime based on what is supported by the machine. With legacy machines such as iSeries that do not support hugepages, HPAGE_SHIFT is 0. This results in pageblock_order being set to -PAGE_SHIFT and a crash results shortly afterwards. This patch adds a function to select a sensible value for pageblock order by default when HUGETLB_PAGE_SIZE_VARIABLE is set. It checks that HPAGE_SHIFT is a sensible value before using the hugepage size; if it is not MAX_ORDER-1 is used. This is a fix for 2.6.24. Credit goes to Stephen Rothwell for identifying the bug and testing candidate patches. Additional credit goes to Andy Whitcroft for spotting a problem with respects to IA-64 before releasing. Additional credit to David Gibson for testing with the libhugetlbfs test suite. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Tested-by: NStephen Rothwell <sfr@canb.auug.org.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 11月, 2007 1 次提交
-
-
由 Hugh Dickins 提交于
2.6.11 gave __GFP_ZERO's prep_zero_page a bogus "highmem may have to wait" assertion. Presumably added under the misconception that clear_highpage uses nonatomic kmap; but then and now it uses kmap_atomic, so no problem. Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 11月, 2007 1 次提交
-
-
由 Linus Torvalds 提交于
This reverts commit 5adc5be7. Alexey Dobriyan reports that it causes huge slowdowns under some loads, in his case a "mkfs.ext2" on a 30G partition. With the placement bias, the mkfs took over four minutes, with it reverted it's back to about ten seconds for Alexey. Reported-and-tested-by: NAlexey Dobriyan <adobriyan@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-