- 29 4月, 2008 1 次提交
-
-
由 Nishanth Aravamudan 提交于
Because of page order checks in __alloc_pages(), hugepage (and similarly large order) allocations will not retry unless explicitly marked __GFP_REPEAT. However, the current retry logic is nearly an infinite loop (or until reclaim does no progress whatsoever). For these costly allocations, that seems like overkill and could potentially never terminate. Mel observed that allowing current __GFP_REPEAT semantics for hugepage allocations essentially killed the system. I believe this is because we may continue to reclaim small orders of pages all over, but never have enough to satisfy the hugepage allocation request. This is clearly only a problem for large order allocations, of which hugepages are the most obvious (to me). Modify try_to_free_pages() to indicate how many pages were reclaimed. Use that information in __alloc_pages() to eventually fail a large __GFP_REPEAT allocation when we've reclaimed an order of pages equal to or greater than the allocation's order. This relies on lumpy reclaim functioning as advertised. Due to fragmentation, lumpy reclaim may not be able to free up the order needed in one invocation, so multiple iterations may be requred. In other words, the more fragmented memory is, the more retry attempts __GFP_REPEAT will make (particularly for higher order allocations). This changes the semantics of __GFP_REPEAT subtly, but *only* for allocations > PAGE_ALLOC_COSTLY_ORDER. With this patch, for those size allocations, we will try up to some point (at least 1<<order reclaimed pages), rather than forever (which is the case for allocations <= PAGE_ALLOC_COSTLY_ORDER). This change improves the /proc/sys/vm/nr_hugepages interface with a follow-on patch that makes pool allocations use __GFP_REPEAT. Rather than administrators repeatedly echo'ing a particular value into the sysctl, and forcing reclaim into action manually, this change allows for the sysctl to attempt a reasonable effort itself. Similarly, dynamic pool growth should be more successful under load, as lumpy reclaim can try to free up pages, rather than failing right away. Choosing to reclaim only up to the order of the requested allocation strikes a balance between not failing hugepage allocations and returning to the caller when it's unlikely to every succeed. Because of lumpy reclaim, if we have freed the order requested, hopefully it has been in big chunks and those chunks will allow our allocation to succeed. If that isn't the case after freeing up the current order, I don't think it is likely to succeed in the future, although it is possible given a particular fragmentation pattern. Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Tested-by: NMel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 28 4月, 2008 3 次提交
-
-
由 Mel Gorman 提交于
Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Currently a node has two sets of zonelists, one for each zone type in the system and a second set for GFP_THISNODE allocations. Based on the zones allowed by a gfp mask, one of these zonelists is selected. All of these zonelists consume memory and occupy cache lines. This patch replaces the multiple zonelists per-node with two zonelists. The first contains all populated zones in the system, ordered by distance, for fallback allocations when the target/preferred node has no free pages. The second contains all populated zones in the node suitable for GFP_THISNODE allocations. An iterator macro is introduced called for_each_zone_zonelist() that interates through each zone allowed by the GFP flags in the selected zonelist. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The following patches replace multiple zonelists per node with two zonelists that are filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. The first patch cleans up an inconsistency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch introduces a helper function node_zonelist() for looking up the appropriate zonelist for a GFP mask which simplifies patches later in the set. The third patch defines/remembers the "preferred zone" for numa statistics, as it is no longer always the first zone in a zonelist. The forth patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The fifth patch introduces helper macros for retrieving the zone and node indices of entries in a zonelist. The final patch introduces filtering of the zonelists based on a nodemask. Two zonelists exist per node, one for normal allocations and one for __GFP_THISNODE. Performance results varied depending on the machine configuration. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.24-rc4-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. loss to gain Total CPU time on Kernbench: -0.86% to 1.13% Elapsed time on Kernbench: -0.79% to 0.76% page_test from aim9: -4.37% to 0.79% brk_test from aim9: -0.71% to 4.07% fork_test from aim9: -1.84% to 4.60% exec_test from aim9: -0.71% to 1.08% This patch: The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 4月, 2008 1 次提交
-
-
由 Mike Travis 提交于
* Use new node_to_cpumask_ptr. This creates a pointer to the cpumask for a given node. This definition is in mm patch: asm-generic-add-node_to_cpumask_ptr-macro.patch * Use new set_cpus_allowed_ptr function. Depends on: [mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch [sched-devel]: sched: add new set_cpus_allowed_ptr function [x86/latest]: x86: add cpus_scnprintf function Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Greg Banks <gnb@melbourne.sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: NMike Travis <travis@sgi.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 25 3月, 2008 1 次提交
-
-
由 Andrew Morton 提交于
Revert commit f1a9ee75: Author: Rik van Riel <riel@redhat.com> Date: Thu Feb 7 00:14:08 2008 -0800 kswapd should only wait on IO if there is IO The current kswapd (and try_to_free_pages) code has an oddity where the code will wait on IO, even if there is no IO in flight. This problem is notable especially when the system scans through many unfreeable pages, causing unnecessary stalls in the VM. Additionally, tasks without __GFP_FS or __GFP_IO in the direct reclaim path will sleep if a significant number of pages are encountered that should be written out. This gives kswapd a chance to write out those pages, while the direct reclaim task sleeps. Signed-off-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Because of large latencies and interactivity problems reported by Carlos, here: http://lkml.org/lkml/2008/3/22/211 Cc: Rik van Riel <riel@redhat.com> Cc: "Carlos R. Mafra" <crmafra2@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 3月, 2008 2 次提交
-
-
由 Hugh Dickins 提交于
Each caller of mem_cgroup_move_lists is having to use page_get_page_cgroup: it's more convenient if it acts upon the page itself not the page_cgroup; and in a later patch this becomes important to handle within memcontrol.c. Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Balbir Singh 提交于
Rename Memory Controller to Memory Resource Controller. Reflect the same changes in the CONFIG definition for the Memory Resource Controller. Group together the config options for Resource Counters and Memory Resource Controller. Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 2月, 2008 7 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
per-zone and reclaim enhancements for memory controller: modifies vmscan.c for isolate globa/cgroup lru activity When using memory controller, there are 2 levels of memory reclaim. 1. zone memory reclaim because of system/zone memory shortage. 2. memory cgroup memory reclaim because of hitting limit. These two can be distinguished by sc->mem_cgroup parameter. (scan_global_lru() macro) This patch tries to make memory cgroup reclaim routine avoid affecting system/zone memory reclaim. This patch inserts if (scan_global_lru()) and hook to memory_cgroup reclaim support functions. This patch can be a help for isolating system lru activity and group lru activity and shows what additional functions are necessary. * mem_cgroup_calc_mapped_ratio() ... calculate mapped ratio for cgroup. * mem_cgroup_reclaim_imbalance() ... calculate active/inactive balance in cgroup. * mem_cgroup_calc_reclaim_active() ... calculate the number of active pages to be scanned in this priority in mem_cgroup. * mem_cgroup_calc_reclaim_inactive() ... calculate the number of inactive pages to be scanned in this priority in mem_cgroup. * mem_cgroup_all_unreclaimable() .. checks cgroup's page is all unreclaimable or not. * mem_cgroup_get_reclaim_priority() ... * mem_cgroup_note_reclaim_priority() ... record reclaim priority (temporal) * mem_cgroup_remember_reclaim_priority() .... record reclaim priority as zone->prev_priority. This value is used for calc reclaim_mapped. [akpm@linux-foundation.org: fix unused var warning] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
This is used to detect which scan_control scans global lru or mem_cgroup lru. And compiled to be static value (1) when memory controller is not configured. This may make the meaning obvious. Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Because NODE_DATA(node)->node_zonelists[] is guaranteed to contain all necessary zones, it is not necessary to use for_each_online_node. And this for_each_online_node() makes reclaim routine start always from node 0. This is not good. This patch makes reclaim start from caller's node and just use usual (default) zonelist order. [akpm@linux-foundation.org: fix warning] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rik van Riel 提交于
The current kswapd (and try_to_free_pages) code has an oddity where the code will wait on IO, even if there is no IO in flight. This problem is notable especially when the system scans through many unfreeable pages, causing unnecessary stalls in the VM. Additionally, tasks without __GFP_FS or __GFP_IO in the direct reclaim path will sleep if a significant number of pages are encountered that should be written out. This gives kswapd a chance to write out those pages, while the direct reclaim task sleeps. Signed-off-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Balbir Singh 提交于
Nick Piggin pointed out that swap cache and page cache addition routines could be called from non GFP_KERNEL contexts. This patch makes the charging routine aware of the gfp context. Charging might fail if the cgroup is over it's limit, in which case a suitable error is returned. This patch was tested on a Powerpc box. I am still looking at being able to test the path, through which allocations happen in non GFP_KERNEL contexts. [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE] Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Balbir Singh 提交于
Make page_referenced() cgroup aware. Without this patch, page_referenced() can cause a page to be skipped while reclaiming pages. This patch ensures that other cgroups do not hold pages in a particular cgroup hostage. It is required to ensure that shared pages are freed from a cgroup when they are not actively referenced from the cgroup that brought them in Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Balbir Singh 提交于
Add the page_cgroup to the per cgroup LRU. The reclaim algorithm has been modified to make the isolate_lru_pages() as a pluggable component. The scan_control data structure now accepts the cgroup on behalf of which reclaims are carried out. try_to_free_pages() has been extended to become cgroup aware. [akpm@linux-foundation.org: fix warning] [Lee.Schermerhorn@hp.com: initialize all scan_control's isolate_pages member] [bunk@kernel.org: make do_try_to_free_pages() static] [hugh@veritas.com: memcgroup: fix try_to_free order] [kamezawa.hiroyu@jp.fujitsu.com: this unlock_page_cgroup() is unnecessary] Signed-off-by: NPavel Emelianov <xemul@openvz.org> Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 10月, 2007 1 次提交
-
-
由 Simon Arlott 提交于
Spelling fixes in mm/. Signed-off-by: NSimon Arlott <simon@fire.lp0.eu> Signed-off-by: NAdrian Bunk <bunk@kernel.org>
-
- 19 10月, 2007 1 次提交
-
-
由 Stephen Hemminger 提交于
Get rid of sparse related warnings from places that use integer as NULL pointer. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NStephen Hemminger <shemminger@linux-foundation.org> Cc: Andi Kleen <ak@suse.de> Cc: Jeff Garzik <jeff@garzik.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Ian Kent <raven@themaw.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 10月, 2007 7 次提交
-
-
由 David Rientjes 提交于
Introduces new zone flag interface for testing and setting flags: int zone_test_and_set_flag(struct zone *zone, zone_flags_t flag) Instead of setting and clearing ZONE_RECLAIM_LOCKED each time shrink_zone() is called, this flag is test and set before starting zone reclaim. Zone reclaim starts in __alloc_pages() when a zone's watermark fails and the system is in zone_reclaim_mode. If it's already in reclaim, there's no need to start again so it is simply considered full for that allocation attempt. There is a change of behavior with regard to concurrent zone shrinking. It is now possible for try_to_free_pages() or kswapd to already be shrinking a particular zone when __alloc_pages() starts zone reclaim. In this case, it is possible for two concurrent threads to invoke shrink_zone() for a single zone. This change forbids a zone to be in zone reclaim twice, which was always the behavior, but allows for concurrent try_to_free_pages() or kswapd shrinking when starting zone reclaim. Cc: Andrea Arcangeli <andrea@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
Convert the int all_unreclaimable member of struct zone to unsigned long flags. This can now be used to specify several different zone flags such as all_unreclaimable and reclaim_in_progress, which can now be removed and converted to a per-zone flag. Flags are set and cleared as follows: zone_set_flag(struct zone *zone, zone_flags_t flag) zone_clear_flag(struct zone *zone, zone_flags_t flag) Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED, which have the same semantics as the old zone->all_unreclaimable and zone->reclaim_in_progress, respectively. Also converts all current users that set or clear either flag to use the new interface. Helper functions are defined to test the flags: int zone_is_all_unreclaimable(const struct zone *zone) int zone_is_reclaim_locked(const struct zone *zone) All flag operators are of the atomic variety because there are currently readers that are implemented that do not take zone->lock. [akpm@linux-foundation.org: add needed include] Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Swappiness isn't a safe sysctl. Setting it to 0 for example can hang a system. That's a corner case but even setting it to 10 or lower can waste enormous amounts of cpu without making much progress. We've customers who wants to use swappiness but they can't because of the current implementation (if you change it so the system stops swapping it really stops swapping and nothing works sane anymore if you really had to swap something to make progress). This patch from Kurt Garloff makes swappiness safer to use (no more huge cpu usage or hangs with low swappiness values). I think the prev_priority can also be nuked since it wastes 4 bytes per zone (that would be an incremental patch but I wait the nr_scan_[in]active to be nuked first for similar reasons). Clearly somebody at some point noticed how broken that thing was and they had to add min(priority, prev_priority) to give it some reliability, but they didn't go the last mile to nuke prev_priority too. Calculating distress only in function of not-racy priority is correct and sure more than enough without having to add randomness into the equation. Patch is tested on older kernels but it compiles and it's quite simple so... Overall I'm not very satisified by the swappiness tweak, since it doesn't rally do anything with the dirty pagecache that may be inactive. We need another kind of tweak that controls the inactive scan and tunes the can_writepage feature (not yet in mainline despite having submitted it a few times), not only the active one. That new tweak will tell the kernel how hard to scan the inactive list for pure clean pagecache (something the mainline kernel isn't capable of yet). We already have that feature working in all our enterprise kernels with the default reasonable tune, or they can't even run a readonly backup with tar without triggering huge write I/O. I think it should be available also in mainline later. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: NKurt Garloff <garloff@suse.de> Signed-off-by: NAndrea Arcangeli <andrea@suse.de> Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yasunori Goto 提交于
When a cpu is onlined on memory-less-node box, kernel panics due to touch NULL pointer of pgdat->kswapd. Current kswapd runs only nodes which have memory. So, calling of set_cpus_allowed() is not necessary for memory-less node. This is fix for it. Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
We need the check for a node with cpu in zone reclaim. Zone reclaim will not allow remote zone reclaim if a node has a cpu. [Lee.Schermerhorn@hp.com: Move setup of N_CPU node state mask] Signed-off-by: NChristoph Lameter <clameter@sgi.com> Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NBob Picco <bob.picco@hp.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
A node without memory does not need a kswapd. So use the memory map instead of the online map when starting kswapd. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NNishanth Aravamudan <nacc@us.ibm.com> Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NBob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rik van Riel 提交于
The current VM can get itself into trouble fairly easily on systems with a small ZONE_HIGHMEM, which is common on i686 computers with 1GB of memory. On one side, page_alloc() will allocate down to zone->pages_low, while on the other side, kswapd() and balance_pgdat() will try to free memory from every zone, until every zone has more free pages than zone->pages_high. Highmem can be filled up to zone->pages_low with page tables, ramfs, vmalloc allocations and other unswappable things quite easily and without many bad side effects, since we still have a huge ZONE_NORMAL to do future allocations from. However, as long as the number of free pages in the highmem zone is below zone->pages_high, kswapd will continue swapping things out from ZONE_NORMAL, too! Sami Farin managed to get his system into a stage where kswapd had freed about 700MB of low memory and was still "going strong". The attached patch will make kswapd stop paging out data from zones when there is more than enough memory free. We do go above zone->pages_high in order to keep pressure between zones equal in normal circumstances, but the patch should prevent the kind of excesses that made Sami's computer totally unusable. Signed-off-by: NRik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 8月, 2007 2 次提交
-
-
由 Andy Whitcroft 提交于
Lumpy reclaim works by selecting a lead page from the LRU list and then selecting pages for reclaim from the order-aligned area of pages. In the situation were all pages in that region are inactive and not referenced by any process over time, it works well. In the situation where there is even light load on the system, the pages may not free quickly. Out of a area of 1024 pages, maybe only 950 of them are freed when the allocation attempt occurs because lumpy reclaim returned early. This patch alters the behaviour of direct reclaim for large contiguous blocks. The first attempt to call shrink_page_list() is asynchronous but if it fails, the pages are submitted a second time and the calling process waits for the IO to complete. This may stall allocators waiting for contiguous memory but that should be expected behaviour for high-order users. It is preferable behaviour to potentially queueing unnecessary areas for IO. Note that kswapd will not stall in this fashion. [apw@shadowen.org: update to version 2] [apw@shadowen.org: update to version 3] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Whitcroft 提交于
As pointed out by Mel when reclaim is applied at higher orders a significant amount of IO may be started. As this takes finite time to drain reclaim will consider more areas than ultimatly needed to satisfy the request. This leads to more reclaim than strictly required and reduced success rates. I was able to confirm Mel's test results on systems locally. These show that even under light load the success rates drop off far more than expected. Testing with a modified version of his patch (which follows) I was able to allocate almost all of ZONE_MOVABLE with a near idle system. I ran 5 test passes sequentially following system boot (the system has 29 hugepages in ZONE_MOVABLE): 2.6.23-rc1 11 8 6 7 7 sync_lumpy 28 28 29 29 26 These show that although hugely better than the near 0% success normally expected we can only allocate about a 1/4 of the zone. Using synchronous reclaim for these allocations we get close to 100% as expected. I have also run our standard high order tests and these show no regressions in allocation success rates at rest, and some significant improvements under load. This patch: We are transitioning pages from active to inactive in clear_active_flags, those need counting as PGDEACTIVATE vm events. Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 7月, 2007 3 次提交
-
-
由 Rafael J. Wysocki 提交于
Currently, the freezer treats all tasks as freezable, except for the kernel threads that explicitly set the PF_NOFREEZE flag for themselves. This approach is problematic, since it requires every kernel thread to either set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't care for the freezing of tasks at all. It seems better to only require the kernel threads that want to or need to be frozen to use some freezer-related code and to remove any freezer-related code from the other (nonfreezable) kernel threads, which is done in this patch. The patch causes all kernel threads to be nonfreezable by default (ie. to have PF_NOFREEZE set by default) and introduces the set_freezable() function that should be called by the freezable kernel threads in order to unset PF_NOFREEZE. It also makes all of the currently freezable kernel threads call set_freezable(), so it shouldn't cause any (intentional) change of behaviour to appear. Additionally, it updates documentation to describe the freezing of tasks more accurately. [akpm@linux-foundation.org: build fixes] Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl> Acked-by: NNigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rusty Russell 提交于
I can never remember what the function to register to receive VM pressure is called. I have to trace down from __alloc_pages() to find it. It's called "set_shrinker()", and it needs Your Help. 1) Don't hide struct shrinker. It contains no magic. 2) Don't allocate "struct shrinker". It's not helpful. 3) Call them "register_shrinker" and "unregister_shrinker". 4) Call the function "shrink" not "shrinker". 5) Reduce the 17 lines of waffly comments to 13, but document it properly. Signed-off-by: NRusty Russell <rusty@rustcorp.com.au> Cc: David Chinner <dgc@sgi.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Whitcroft 提交于
When we are out of memory of a suitable size we enter reclaim. The current reclaim algorithm targets pages in LRU order, which is great for fairness at order-0 but highly unsuitable if you desire pages at higher orders. To get pages of higher order we must shoot down a very high proportion of memory; >95% in a lot of cases. This patch set adds a lumpy reclaim algorithm to the allocator. It targets groups of pages at the specified order anchored at the end of the active and inactive lists. This encourages groups of pages at the requested orders to move from active to inactive, and active to free lists. This behaviour is only triggered out of direct reclaim when higher order pages have been requested. This patch set is particularly effective when utilised with an anti-fragmentation scheme which groups pages of similar reclaimability together. This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms the foundation. Credit to Mel Gorman for sanitity checking. Mel said: The patches have an application with hugepage pool resizing. When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can be resized with greater reliability. Testing on a desktop machine with 2GB of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own was very slow as the success rate was quite low. Without lumpy-reclaim, each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages. With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical. [akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup] [bunk@stusta.de: static declarations for internal functions] [a.p.zijlstra@chello.nl: initial lumpy V2 implementation] Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: Bob Picco <bob.picco@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 5月, 2007 1 次提交
-
-
由 Rafael J. Wysocki 提交于
Since nonboot CPUs are now disabled after tasks and devices have been frozen and the CPU hotplug infrastructure is used for this purpose, we need special CPU hotplug notifications that will help the CPU-hotplug-aware subsystems distinguish normal CPU hotplug events from CPU hotplug events related to a system-wide suspend or resume operation in progress. This patch introduces such notifications and causes them to be used during suspend and resume transitions. It also changes all of the CPU-hotplug-aware subsystems to take these notifications into consideration (for now they are handled in the same way as the corresponding "normal" ones). [oleg@tv-sign.ru: cleanups] Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 5月, 2007 1 次提交
-
-
由 Guillaume Chazarain 提交于
Cleanup: setting an outstanding error on a mapping was open coded too many times. Factor it out in mapping_set_error(). Signed-off-by: NGuillaume Chazarain <guichaz@yahoo.fr> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 5月, 2007 1 次提交
-
-
由 Rafael J. Wysocki 提交于
Currently we can miss freeze_process()->signal_wake_up() in kswapd() if it happens between try_to_freeze() and prepare_to_wait(). To prevent this from happening we should check freezing(current) before calling schedule(). Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 3月, 2007 1 次提交
-
-
由 Andrew Morton 提交于
throttle_vm_writeout() is designed to wait for the dirty levels to subside. But if the caller holds IO or FS locks, we might be holding up that writeout. So change it to take a single nap to give other devices a chance to clean some memory, then return. Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Pete Zaitcev <zaitcev@redhat.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 2月, 2007 1 次提交
-
-
由 Christoph Lameter 提交于
The determination of the dirty ratio to determine writeback behavior is currently based on the number of total pages on the system. However, not all pages in the system may be dirtied. Thus the ratio is always too low and can never reach 100%. The ratio may be particularly skewed if large hugepage allocations, slab allocations or device driver buffers make large sections of memory not available anymore. In that case we may get into a situation in which f.e. the background writeback ratio of 40% cannot be reached anymore which leads to undesired writeback behavior. This patchset fixes that issue by determining the ratio based on the actual pages that may potentially be dirty. These are the pages on the active and the inactive list plus free pages. The problem with those counts has so far been that it is expensive to calculate these because counts from multiple nodes and multiple zones will have to be summed up. This patchset makes these counters ZVC counters. This means that a current sum per zone, per node and for the whole system is always available via global variables and not expensive anymore to calculate. The patchset results in some other good side effects: - Removal of the various functions that sum up free, active and inactive page counts - Cleanup of the functions that display information via the proc filesystem. This patch: The use of a ZVC for nr_inactive and nr_active allows a simplification of some counter operations. More ZVC functionality is used for sums etc in the following patches. [akpm@osdl.org: UP build fix] Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 1月, 2007 1 次提交
-
-
由 Andrew Morton 提交于
At the end of shrink_all_memory() we forget to recalculate lru_pages: it can be zero. Fix that up, and add a helper function for this operation too. Also, recalculate lru_pages each time around the inner loop to get the balancing correct. Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 31 12月, 2006 1 次提交
-
-
由 Shantanu Goel 提交于
Fix a rather obvious buglet. Noticed while instrumenting the VM using /proc/vmstat. Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 23 12月, 2006 1 次提交
-
-
由 Nigel Cunningham 提交于
The version of mm/vmscan.c in Linus' current tree has swapped parameters in the shrink_all_zones declaration and call, used by the various suspend-to-disk implementations. This doesn't seem to have any great adverse effect, but it's clearly wrong. Signed-off-by: NNigel Cunningham <nigel@suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 14 12月, 2006 1 次提交
-
-
由 Paul Jackson 提交于
Elaborate the API for calling cpuset_zone_allowed(), so that users have to explicitly choose between the two variants: cpuset_zone_allowed_hardwall() cpuset_zone_allowed_softwall() Until now, whether or not you got the hardwall flavor depended solely on whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask argument. If you didn't specify __GFP_HARDWALL, you implicitly got the softwall version. Unfortunately, this meant that users would end up with the softwall version without thinking about it. Since only the softwall version might sleep, this led to bugs with possible sleeping in interrupt context on more than one occassion. The hardwall version requires that the current tasks mems_allowed allows the node of the specified zone (or that you're in interrupt or that __GFP_THISNODE is set or that you're on a one cpuset system.) The softwall version, depending on the gfp_mask, might allow a node if it was allowed in the nearest enclusing cpuset marked mem_exclusive (which requires taking the cpuset lock 'callback_mutex' to evaluate.) This patch removes the cpuset_zone_allowed() call, and forces the caller to explicitly choose between the hardwall and the softwall case. If the caller wants the gfp_mask to determine this choice, they should (1) be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the cpuset_zone_allowed_softwall() routine. This adds another 100 or 200 bytes to the kernel text space, due to the few lines of nearly duplicate code at the top of both cpuset_zone_allowed_* routines. It should save a few instructions executed for the calls that turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to set (before the call) then check (within the call) the __GFP_HARDWALL flag. For the most critical call, from get_page_from_freelist(), the same instructions are executed as before -- the old cpuset_zone_allowed() routine it used to call is the same code as the cpuset_zone_allowed_softwall() routine that it calls now. Not a perfect win, but seems worth it, to reduce this chance of hitting a sleeping with irq off complaint again. Signed-off-by: NPaul Jackson <pj@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 08 12月, 2006 2 次提交
-
-
由 Ingo Molnar 提交于
There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn, prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus generating compiler warnings of unused symbols, hence forcing people to add #ifdefs. the compiler can skip truly unused functions just fine: text data bss dec hex filename 1624412 728710 3674856 6027978 5bfaca vmlinux.before 1624412 728710 3674856 6027978 5bfaca vmlinux.after [akpm@osdl.org: topology.c fix] Signed-off-by: NIngo Molnar <mingo@elte.hu> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Nigel Cunningham 提交于
Move process freezing functions from include/linux/sched.h to freezer.h, so that modifications to the freezer or the kernel configuration don't require recompiling just about everything. [akpm@osdl.org: fix ueagle driver] Signed-off-by: NNigel Cunningham <nigel@suspend2.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-