- 17 10月, 2007 5 次提交
-
-
由 Lee Schermerhorn 提交于
Here's a cut at fixing up uses of the online node map in generic code. mm/shmem.c:shmem_parse_mpol() Ensure nodelist is subset of nodes with memory. Use node_states[N_HIGH_MEMORY] as default for missing nodelist for interleave policy. mm/shmem.c:shmem_fill_super() initialize policy_nodes to node_states[N_HIGH_MEMORY] mm/page-writeback.c:highmem_dirtyable_memory() sum over nodes with memory mm/page_alloc.c:zlc_setup() allowednodes - use nodes with memory. mm/page_alloc.c:default_zonelist_order() average over nodes with memory. mm/page_alloc.c:find_next_best_node() skip nodes w/o memory. N_HIGH_MEMORY state mask may not be initialized at this time, unless we want to depend on early_calculate_totalpages() [see below]. Will ZONE_MOVABLE ever be configurable? mm/page_alloc.c:find_zone_movable_pfns_for_nodes() spread kernelcore over nodes with memory. This required calling early_calculate_totalpages() unconditionally, and populating N_HIGH_MEMORY node state therein from nodes in the early_node_map[]. If we can depend on this, we can eliminate the population of N_HIGH_MEMORY mask from __build_all_zonelists() and use the N_HIGH_MEMORY mask in find_next_best_node(). mm/mempolicy.c:mpol_check_policy() Ensure nodes specified for policy are subset of nodes with memory. [akpm@linux-foundation.org: fix warnings] Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NChristoph Lameter <clameter@sgi.com> Cc: Shaohua Li <shaohua.li@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Online nodes now may have no memory. The checks and initialization must therefore be changed to no longer use the online functions. This will correctly initialize the interleave on bootup to only target nodes with memory and will make sys_move_pages return an error when a page is to be moved to a memoryless node. Similarly we will get an error if MPOL_BIND and MPOL_INTERLEAVE is used on a memoryless node. These are somewhat new semantics. So far one could specify memoryless nodes and we would maybe do the right thing and just ignore the node (or we'd do something strange like with MPOL_INTERLEAVE). If we want to allow the specification of memoryless nodes via memory policies then we need to keep checking for online nodes. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NNishanth Aravamudan <nacc@us.ibm.com> Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NBob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
MPOL_INTERLEAVE currently simply loops over all nodes. Allocations on memoryless nodes will be redirected to nodes with memory. This results in an imbalance because the neighboring nodes to memoryless nodes will get significantly more interleave hits that the rest of the nodes on the system. We can avoid this imbalance by clearing the nodes in the interleave node set that have no memory. If we use the node map of the memory nodes instead of the online nodes then we have only the nodes we want. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com> Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NBob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lee Schermerhorn 提交于
Allow an application to query the memories allowed by its context. Updated numa_memory_policy.txt to mention that applications can use this to obtain allowed memories for constructing valid policies. TODO: update out-of-tree libnuma wrapper[s], or maybe add a new wrapper--e.g., numa_get_mems_allowed() ? Also, update numa syscall man pages. Tested with memtoy V>=0.13. Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NChristoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jesper Juhl 提交于
This patch cleans up duplicate includes in mm/ Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com> Acked-by: NPaul Mundt <lethal@linux-sh.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 9月, 2007 1 次提交
-
-
由 Lee Schermerhorn 提交于
This patch proposes fixes to the reference counting of memory policy in the page allocation paths and in show_numa_map(). Extracted from my "Memory Policy Cleanups and Enhancements" series as stand-alone. Shared policy lookup [shmem] has always added a reference to the policy, but this was never unrefed after page allocation or after formatting the numa map data. Default system policy should not require additional ref counting, nor should the current task's task policy. However, show_numa_map() calls get_vma_policy() to examine what may be [likely is] another task's policy. The latter case needs protection against freeing of the policy. This patch adds a reference count to a mempolicy returned by get_vma_policy() when the policy is a vma policy or another task's mempolicy. Again, shared policy is already reference counted on lookup. A matching "unref" [__mpol_free()] is performed in alloc_page_vma() for shared and vma policies, and in show_numa_map() for shared and another task's mempolicy. We can call __mpol_free() directly, saving an admittedly inexpensive inline NULL test, because we know we have a non-NULL policy. Handling policy ref counts for hugepages is a bit trickier. huge_zonelist() returns a zone list that might come from a shared or vma 'BIND policy. In this case, we should hold the reference until after the huge page allocation in dequeue_hugepage(). The patch modifies huge_zonelist() to return a pointer to the mempolicy if it needs to be unref'd after allocation. Kernel Build [16cpu, 32GB, ia64] - average of 10 runs: w/o patch w/ refcount patch Avg Std Devn Avg Std Devn Real: 100.59 0.38 100.63 0.43 User: 1209.60 0.37 1209.91 0.31 System: 81.52 0.42 81.64 0.34 Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NAndi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 8月, 2007 1 次提交
-
-
由 Christoph Lameter 提交于
Page migration currently does not check if the target of the move contains nodes that that are invalid (if root attempts to migrate pages) and may try to allocate from invalid nodes if these are specified leading to oopses. Return -EINVAL if an offline node is specified. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 8月, 2007 1 次提交
-
-
由 Mel Gorman 提交于
The NUMA layer only supports NUMA policies for the highest zone. When ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes ZONE_MOVABLE. The result is that policies are only applied to allocations like anonymous pages and page cache allocated from ZONE_MOVABLE when the zone is used. This patch applies policies to the two highest zones when the highest zone is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real" zone, it's always functionally equivalent. The patch has been tested on a variety of machines both NUMA and non-NUMA covering x86, x86_64 and ppc64. No abnormal results were seen in kernbench, tbench, dbench or hackbench. It passes regression tests from the numactl package with and without kernelcore= once numactl tests are patched to wait for vmstat counters to update. akpm: this is the nasty hack to fix NUMA mempolicies in the presence of ZONE_MOVABLE and kernelcore= in 2.6.23. Christoph says "For .24 either merge the mobility or get the other solution that Mel is working on. That solution would only use a single zonelist per node and filter on the fly. That may help performance and also help to make memory policies work better." Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NChristoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 7月, 2007 1 次提交
-
-
由 Paul Mundt 提交于
Slab destructors were no longer supported after Christoph's c59def9f change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
-
- 18 7月, 2007 2 次提交
-
-
由 Mel Gorman 提交于
Huge pages are not movable so are not allocated from ZONE_MOVABLE. However, as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it can be used to satisfy hugepage allocations even when the system has been running a long time. This allows an administrator to resize the hugepage pool at runtime depending on the size of ZONE_MOVABLE. This patch adds a new sysctl called hugepages_treat_as_movable. When a non-zero value is written to it, future allocations for the huge page pool will use ZONE_MOVABLE. Despite huge pages being non-movable, we do not introduce additional external fragmentation of note as huge pages are always the largest contiguous block we care about. [akpm@linux-foundation.org: various fixes] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
It is often known at allocation time whether a page may be migrated or not. This patch adds a flag called __GFP_MOVABLE and a new mask called GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated using the page migration mechanism or reclaimed by syncing with backing storage and discarding. An API function very similar to alloc_zeroed_user_highpage() is added for __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable(). The flags used by alloc_zeroed_user_highpage() are not changed because it would change the semantics of an existing API. After this patch is applied there are no in-kernel users of alloc_zeroed_user_highpage() so it probably should be marked deprecated if this patch is merged. Note that this patch includes a minor cleanup to the use of __GFP_ZERO in shmem.c to keep all flag modifications to inode->mapping in the shmem_dir_alloc() helper function. This clean-up suggestion is courtesy of Hugh Dickens. Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the concept. Credit to Hugh Dickens for catching issues with shmem swap vector and ramfs allocations. [akpm@linux-foundation.org: build fix] [hugh@veritas.com: __GFP_ZERO cleanup] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 7月, 2007 2 次提交
-
-
由 Paul Mundt 提交于
Enabling debugging fails to build due to the nodemask variable in do_mbind() having changed names, and then oopses on boot due to the assumption that the nodemask can be dereferenced -- which doesn't work out so well when the policy is changed to MPOL_DEFAULT with a NULL nodemask by numa_default_policy(). This fixes it up, and switches from PDprintk() to pr_debug() while we're at it. Signed-off-by: NPaul Mundt <lethal@linux-sh.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Paul Mundt 提交于
This converts the default system init memory policy to use a dynamically created node map instead of defaulting to all online nodes. Nodes of a certain size (>= 16MB) are judged to be suitable for interleave, and are added to the map. If all nodes are smaller in size, the largest one is automatically selected. Without this, tiny nodes find themselves out of memory before we even make it to userspace. Systems with large nodes will notice no change. Only the system init policy is effected by this change, the regular MPOL_DEFAULT policy is still switched to later on in the boot process as normal. Signed-off-by: NPaul Mundt <lethal@linux-sh.org> Cc: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 3月, 2007 1 次提交
-
-
由 Christoph Lameter 提交于
Currently we do not check for vma flags if sys_move_pages is called to move individual pages. If sys_migrate_pages is called to move pages then we check for vm_flags that indicate a non migratable vma but that still includes VM_LOCKED and we can migrate mlocked pages. Extract the vma_migratable check from mm/mempolicy.c, fix it and put it into migrate.h so that is can be used from both locations. Problem was spotted by Lee Schermerhorn Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 2月, 2007 1 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
bind_zonelist() can create zero-length zonelist if there is a memory-less-node. This patch checks the length of zonelist. If length is 0, returns -EINVAL. tested on ia64/NUMA with memory-less-node. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NAndi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 2月, 2007 1 次提交
-
-
由 Christoph Lameter 提交于
This patchset follows up on the earlier work in Andrew's tree to reduce the number of zones. The patches allow to go to a minimum of 2 zones. This one allows also to make ZONE_DMA optional and therefore the number of zones can be reduced to one. ZONE_DMA is usually used for ISA DMA devices. There are a number of reasons why we would not want to have ZONE_DMA 1. Some arches do not need ZONE_DMA at all. 2. With the advent of IOMMUs DMA zones are no longer needed. The necessity of DMA zones may drastically be reduced in the future. This patchset allows a compilation of a kernel without that overhead. 3. Devices that require ISA DMA get rare these days. All my systems do not have any need for ISA DMA. 4. The presence of an additional zone unecessarily complicates VM operations because it must be scanned and balancing logic must operate on its. 5. With only ZONE_NORMAL one can reach the situation where we have only one zone. This will allow the unrolling of many loops in the VM and allows the optimization of varous code paths in the VM. 6. Having only a single zone in a NUMA system results in a 1-1 correspondence between nodes and zones. Various additional optimizations to critical VM paths become possible. Many systems today can operate just fine with a single zone. If you look at what is in ZONE_DMA then one usually sees that nothing uses it. The DMA slabs are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL will be empty instead). On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty. Why constantly look at an empty zone in /proc/zoneinfo and empty slab in /proc/slabinfo? Non i386 also frequently have no need for ZONE_DMA and zones stay empty. The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA). The RFC posted earlier (see http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots of #ifdefs in them. An effort has been made to minize the number of #ifdefs and make this as compact as possible. The job was made much easier by the ongoing efforts of others to extract common arch specific functionality. I have been running this for awhile now on my desktop and finally Linux is using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched: christoph@pentium940:~$ cat /proc/zoneinfo Node 0, zone Normal pages free 4435 min 1448 low 1810 high 2172 active 241786 inactive 210170 scanned 0 (a: 0 i: 0) spanned 524224 present 524224 nr_anon_pages 61680 nr_mapped 14271 nr_file_pages 390264 nr_slab_reclaimable 27564 nr_slab_unreclaimable 1793 nr_page_table_pages 449 nr_dirty 39 nr_writeback 0 nr_unstable 0 nr_bounce 0 cpu: 0 pcp: 0 count: 156 high: 186 batch: 31 cpu: 0 pcp: 1 count: 9 high: 62 batch: 15 vm stats threshold: 20 cpu: 1 pcp: 0 count: 177 high: 186 batch: 31 cpu: 1 pcp: 1 count: 12 high: 62 batch: 15 vm stats threshold: 20 all_unreclaimable: 0 prev_priority: 12 temp_priority: 12 start_pfn: 0 This patch: In two places in the VM we use ZONE_DMA to refer to the first zone. If ZONE_DMA is optional then other zones may be first. So simply replace ZONE_DMA with zone 0. This also fixes ZONETABLE_PGSHIFT. If we have only a single zone then ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone number related to a pgdat. However, we still need a zonetable to index all the zones for each node if this is a NUMA system. Therefore define ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags. [apw@shadowen.org: fix mismerge] Acked-by: NChristoph Hellwig <hch@infradead.org> Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <willy@debian.org> Cc: James Bottomley <James.Bottomley@steeleye.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 1月, 2007 1 次提交
-
-
由 Christoph Lameter 提交于
Currently one can specify an arbitrary node mask to mbind that includes nodes not allowed. If that is done with an interleave policy then we will go around all the nodes. Those outside of the currently allowed cpuset will be redirected to the border nodes. Interleave will then create imbalances at the borders of the cpuset. This patch restricts the nodes to the currently allowed cpuset. The RFC for this patch was discussed at http://marc.theaimsgroup.com/?t=116793842100004&r=1&w=2Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 12月, 2006 1 次提交
-
-
由 Josef Sipek 提交于
Signed-off-by: NJosef Sipek <jsipek@fsl.cs.sunysb.edu> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 08 12月, 2006 4 次提交
-
-
由 Helge Deller 提交于
- move some file_operations structs into the .rodata section - move static strings from policy_types[] array into the .rodata section - fix generic seq_operations usages, so that those structs may be defined as "const" as well [akpm@osdl.org: couple of fixes] Signed-off-by: NHelge Deller <deller@gmx.de> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
SLAB_KERNEL is an alias of GFP_KERNEL. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Andy Whitcroft 提交于
NUMA node ids are passed as either int or unsigned int almost exclusivly page_to_nid and zone_to_nid both return unsigned long. This is a throw back to when page_to_nid was a #define and was thus exposing the real type of the page flags field. In addition to fixing up the definitions of page_to_nid and zone_to_nid I audited the users of these functions identifying the following incorrect uses: 1) mm/page_alloc.c show_node() -- printk dumping the node id, 2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison against numa_node_id() which returns an int from cpu_to_node(), and 3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which uses bit_set which in generic code takes an int. Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Paul Jackson 提交于
Optimize the critical zonelist scanning for free pages in the kernel memory allocator by caching the zones that were found to be full recently, and skipping them. Remembers the zones in a zonelist that were short of free memory in the last second. And it stashes a zone-to-node table in the zonelist struct, to optimize that conversion (minimize its cache footprint.) Recent changes: This differs in a significant way from a similar patch that I posted a week ago. Now, instead of having a nodemask_t of recently full nodes, I have a bitmask of recently full zones. This solves a problem that last weeks patch had, which on systems with multiple zones per node (such as DMA zone) would take seeing any of these zones full as meaning that all zones on that node were full. Also I changed names - from "zonelist faster" to "zonelist cache", as that seemed to better convey what we're doing here - caching some of the key zonelist state (for faster access.) See below for some performance benchmark results. After all that discussion with David on why I didn't need them, I went and got some ;). I wanted to verify that I had not hurt the normal case of memory allocation noticeably. At least for my one little microbenchmark, I found (1) the normal case wasn't affected, and (2) workloads that forced scanning across multiple nodes for memory improved up to 10% fewer System CPU cycles and lower elapsed clock time ('sys' and 'real'). Good. See details, below. I didn't have the logic in get_page_from_freelist() for various full nodes and zone reclaim failures correct. That should be fixed up now - notice the new goto labels zonelist_scan, this_zone_full, and try_next_zone, in get_page_from_freelist(). There are two reasons I persued this alternative, over some earlier proposals that would have focused on optimizing the fake numa emulation case by caching the last useful zone: 1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems) have seen real customer loads where the cost to scan the zonelist was a problem, due to many nodes being full of memory before we got to a node we could use. Or at least, I think we have. This was related to me by another engineer, based on experiences from some time past. So this is not guaranteed. Most likely, though. The following approach should help such real numa systems just as much as it helps fake numa systems, or any combination thereof. 2) The effort to distinguish fake from real numa, using node_distance, so that we could cache a fake numa node and optimize choosing it over equivalent distance fake nodes, while continuing to properly scan all real nodes in distance order, was going to require a nasty blob of zonelist and node distance munging. The following approach has no new dependency on node distances or zone sorting. See comment in the patch below for a description of what it actually does. Technical details of note (or controversy): - See the use of "zlc_active" and "did_zlc_setup" below, to delay adding any work for this new mechanism until we've looked at the first zone in zonelist. I figured the odds of the first zone having the memory we needed were high enough that we should just look there, first, then get fancy only if we need to keep looking. - Some odd hackery was needed to add items to struct zonelist, while not tripping up the custom zonelists built by the mm/mempolicy.c code for MPOL_BIND. My usual wordy comments below explain this. Search for "MPOL_BIND". - Some per-node data in the struct zonelist is now modified frequently, with no locking. Multiple CPU cores on a node could hit and mangle this data. The theory is that this is just performance hint data, and the memory allocator will work just fine despite any such mangling. The fields at risk are the struct 'zonelist_cache' fields 'fullzones' (a bitmask) and 'last_full_zap' (unsigned long jiffies). It should all be self correcting after at most a one second delay. - This still does a linear scan of the same lengths as before. All I've optimized is making the scan faster, not algorithmically shorter. It is now able to scan a compact array of 'unsigned short' in the case of many full nodes, so one cache line should cover quite a few nodes, rather than each node hitting another one or two new and distinct cache lines. - If both Andi and Nick don't find this too complicated, I will be (pleasantly) flabbergasted. - I removed the comment claiming we only use one cachline's worth of zonelist. We seem, at least in the fake numa case, to have put the lie to that claim. - I pay no attention to the various watermarks and such in this performance hint. A node could be marked full for one watermark, and then skipped over when searching for a page using a different watermark. I think that's actually quite ok, as it will tend to slightly increase the spreading of memory over other nodes, away from a memory stressed node. =============== Performance - some benchmark results and analysis: This benchmark runs a memory hog program that uses multiple threads to touch alot of memory as quickly as it can. Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of the total 96 GBytes on the system, and using 1, 19, 37, or 55 threads (on a 56 CPU system.) System, user and real (elapsed) timings were recorded for each run, shown in units of seconds, in the table below. Two kernels were tested - 2.6.18-mm3 and the same kernel with this zonelist caching patch added. The table also shows the percentage improvement the zonelist caching sys time is over (lower than) the stock *-mm kernel. number 2.6.18-mm3 zonelist-cache delta (< 0 good) percent GBs N ------------ -------------- ---------------- systime mem threads sys user real sys user real sys user real better 12 1 153 24 177 151 24 176 -2 0 -1 1% 12 19 99 22 8 99 22 8 0 0 0 0% 12 37 111 25 6 112 25 6 1 0 0 -0% 12 55 115 25 5 110 23 5 -5 -2 0 4% 38 1 502 74 576 497 73 570 -5 -1 -6 0% 38 19 426 78 48 373 76 39 -53 -2 -9 12% 38 37 544 83 36 547 82 36 3 -1 0 -0% 38 55 501 77 23 511 80 24 10 3 1 -1% 64 1 917 125 1042 890 124 1014 -27 -1 -28 2% 64 19 1118 138 119 965 141 103 -153 3 -16 13% 64 37 1202 151 94 1136 150 81 -66 -1 -13 5% 64 55 1118 141 61 1072 140 58 -46 -1 -3 4% 90 1 1342 177 1519 1275 174 1450 -67 -3 -69 4% 90 19 2392 199 192 2116 189 176 -276 -10 -16 11% 90 37 3313 238 175 2972 225 145 -341 -13 -30 10% 90 55 1948 210 104 1843 213 100 -105 3 -4 5% Notes: 1) This test ran a memory hog program that started a specified number N of threads, and had each thread allocate and touch 1/N'th of the total memory to be used in the test run in a single loop, writing a constant word to memory, one store every 4096 bytes. Watching this test during some earlier trial runs, I would see each of these threads sit down on one CPU and stay there, for the remainder of the pass, a different CPU for each thread. 2) The 'real' column is not comparable to the 'sys' or 'user' columns. The 'real' column is seconds wall clock time elapsed, from beginning to end of that test pass. The 'sys' and 'user' columns are total CPU seconds spent on that test pass. For a 19 thread test run, for example, the sum of 'sys' and 'user' could be up to 19 times the number of 'real' elapsed wall clock seconds. 3) Tests were run on a fresh, single-user boot, to minimize the amount of memory already in use at the start of the test, and to minimize the amount of background activity that might interfere. 4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM. 5) Notice that the 'real' time gets large for the single thread runs, even though the measured 'sys' and 'user' times are modest. I'm not sure what that means - probably something to do with it being slow for one thread to be accessing memory along ways away. Perhaps the fake numa system, running ostensibly the same workload, would not show this substantial degradation of 'real' time for one thread on many nodes -- lets hope not. 6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs) ran quite efficiently, as one might expect. Each pair of threads needed to allocate and touch the memory on the node the two threads shared, a pleasantly parallizable workload. 7) The intermediate thread count passes, when asking for alot of memory forcing them to go to a few neighboring nodes, improved the most with this zonelist caching patch. Conclusions: * This zonelist cache patch probably makes little difference one way or the other for most workloads on real numa hardware, if those workloads avoid heavy off node allocations. * For memory intensive workloads requiring substantial off-node allocations on real numa hardware, this patch improves both kernel and elapsed timings up to ten per-cent. * For fake numa systems, I'm optimistic, but will have to leave that up to Rohit Seth to actually test (once I get him a 2.6.18 backport.) Signed-off-by: NPaul Jackson <pj@sgi.com> Cc: Rohit Seth <rohitseth@google.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: David Rientjes <rientjes@cs.washington.edu> Cc: Paul Menage <menage@google.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 12 10月, 2006 1 次提交
-
-
由 Keith Owens 提交于
With CONFIG_MIGRATION=n mm/mempolicy.c: In function 'do_mbind': mm/mempolicy.c:796: warning: passing argument 2 of 'migrate_pages' from incompatible pointer type Signed-off-by: NKeith Owens <kaos@ocs.com.au> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 01 10月, 2006 1 次提交
-
-
由 Alexey Dobriyan 提交于
Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 27 9月, 2006 1 次提交
-
-
由 Christoph Lameter 提交于
This patch insures that the slab node lists in the NUMA case only contain slabs that belong to that specific node. All slab allocations use GFP_THISNODE when calling into the page allocator. If an allocation fails then we fall back in the slab allocator according to the zonelists appropriate for a certain context. This allows a replication of the behavior of alloc_pages and alloc_pages node in the slab layer. Currently allocations requested from the page allocator may be redirected via cpusets to other nodes. This results in remote pages on nodelists and that in turn results in interrupt latency issues during cache draining. Plus the slab is handing out memory as local when it is really remote. Fallback for slab memory allocations will occur within the slab allocator and not in the page allocator. This is necessary in order to be able to use the existing pools of objects on the nodes that we fall back to before adding more pages to a slab. The fallback function insures that the nodes we fall back to obey cpuset restrictions of the current context. We do not allocate objects from outside of the current cpuset context like before. Note that the implementation of locality constraints within the slab allocator requires importing logic from the page allocator. This is a mischmash that is not that great. Other allocators (uncached allocator, vmalloc, huge pages) face similar problems and have similar minimal reimplementations of the basic fallback logic of the page allocator. There is another way of implementing a slab by avoiding per node lists (see modular slab) but this wont work within the existing slab. V1->V2: - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another #ifdef [akpm@osdl.org: build fix] Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 26 9月, 2006 5 次提交
-
-
由 Christoph Lameter 提交于
There are many places where we need to determine the node of a zone. Currently we use a difficult to read sequence of pointer dereferencing. Put that into an inline function and use throughout VM. Maybe we can find a way to optimize the lookup in the future. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
[PATCH] Add __GFP_THISNODE to avoid fallback to other nodes and ignore cpuset/memory policy restrictions Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes. This flag is essential if a kernel component requires memory to be located on a certain node. It will be needed for alloc_pages_node() to force allocation on the indicated node and for alloc_pages() to force allocation on the current node. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
I wonder why we need this bitmask indexing into zone->node_zonelists[]? We always start with the highest zone and then include all lower zones if we build zonelists. Are there really cases where we need allocation from ZONE_DMA or ZONE_HIGHMEM but not ZONE_NORMAL? It seems that the current implementation of highest_zone() makes that already impossible. If we go linear on the index then gfp_zone() == highest_zone() and a lot of definitions fall by the wayside. We can now revert back to the use of gfp_zone() in mempolicy.c ;-) Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
After we have done this we can now do some typing cleanup. The memory policy layer keeps a policy_zone that specifies the zone that gets memory policies applied. This variable can now be of type enum zone_type. The check_highest_zone function and the build_zonelists funnctionm must then also take a enum zone_type parameter. Plus there are a number of loops over zones that also should use zone_type. We run into some troubles at some points with functions that need a zone_type variable to become -1. Fix that up. [pj@sgi.com: fix set_mempolicy() crash] Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NPaul Jackson <pj@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
There is a check in zonelist_policy that compares pieces of the bitmap obtained from a gfp mask via GFP_ZONETYPES with a zone number in function zonelist_policy(). The bitmap is an ORed mask of __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM. The policy_zone is a zone number with the possible values of ZONE_DMA, ZONE_DMA32, ZONE_HIGHMEM and ZONE_NORMAL. These are two different domains of values. For some reason seemed to work before the zone reduction patchset (It definitely works on SGI boxes since we just have one zone and the check cannot fail). With the zone reduction patchset this check definitely fails on systems with two zones if the system actually has memory in both zones. This is because ZONE_NORMAL is selected using no __GFP flag at all and thus gfp_zone(gfpmask) == 0. ZONE_DMA is selected when __GFP_DMA is set. __GFP_DMA is 0x01. So gfp_zone(gfpmask) == 1. policy_zone is set to ZONE_NORMAL (==1) if ZONE_NORMAL and ZONE_DMA are populated. For ZONE_NORMAL gfp_zone(<no _GFP_DMA>) yields 0 which is < policy_zone(ZONE_NORMAL) and so policy is not applied to regular memory allocations! Instead gfp_zone(__GFP_DMA) == 1 which results in policy being applied to DMA allocations! What we realy want in that place is to establish the highest allowable zone for a given gfp_mask. If the highest zone is higher or equal to the policy_zone then memory policies need to be applied. We have such a highest_zone() function in page_alloc.c. So move the highest_zone() function from mm/page_alloc.c into include/linux/gfp.h. On the way we simplify the function and use the new zone_type that was also introduced with the zone reduction patchset plus we also specify the right type for the gfp flags parameter. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 02 9月, 2006 1 次提交
-
-
由 Nishanth Aravamudan 提交于
Since vma->vm_pgoff is in units of smallpages, VMAs for huge pages have the lower HPAGE_SHIFT - PAGE_SHIFT bits always cleared, which results in badd offsets to the interleave functions. Take this difference from small pages into account when calculating the offset. This does add a 0-bit shift into the small-page path (via alloc_page_vma()), but I think that is negligible. Also add a BUG_ON to prevent the offset from growing due to a negative right-shift, which probably shouldn't be allowed anyways. Tested on an 8-memory node ppc64 NUMA box and got the interleaving I expected. Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: NAdam Litke <agl@us.ibm.com> Cc: Andi Kleen <ak@muc.de> Acked-by: NChristoph Lameter <clameter@engr.sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 01 7月, 2006 1 次提交
-
-
由 Christoph Lameter 提交于
The numa statistics are really event counters. But they are per node and so we have had special treatment for these counters through additional fields on the pcp structure. We can now use the per zone nature of the zoned VM counters to realize these. This will shrink the size of the pcp structure on NUMA systems. We will have some room to add additional per zone counters that will all still fit in the same cacheline. Bits Prior pcp size Size after patch We can add ------------------------------------------------------------------ 64 128 bytes (16 words) 80 bytes (10 words) 48 32 76 bytes (19 words) 56 bytes (14 words) 8 (64 byte cacheline) 72 (128 byte) Remove the special statistics for numa and replace them with zoned vm counters. This has the side effect that global sums of these events now show up in /proc/vmstat. Also take the opportunity to move the zone_statistics() function from page_alloc.c into vmstat.c. Discussions: V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2Signed-off-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NAndi Kleen <ak@suse.de> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 27 6月, 2006 1 次提交
-
-
由 Eric W. Biederman 提交于
Every inode in /proc holds a reference to a struct task_struct. If a directory or file is opened and remains open after the the task exits this pinning continues. With 8K stacks on a 32bit machine the amount pinned per file descriptor is about 10K. Normally I would figure a reasonable per user process limit is about 100 processes. With 80 processes, with a 1000 file descriptors each I can trigger the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless data. This patch replaces the struct task_struct pointer with a pointer to a struct task_ref which has a struct task_struct pointer. The so the pinning of dead tasks does not happen. The code now has to contend with the fact that the task may now exit at any time. Which is a little but not muh more complicated. With this change it takes about 1000 processes each opening up 1000 file descriptors before I can trigger the OOM killer. Much better. [mlp@google.com: task_mmu small fixes] Signed-off-by: NEric W. Biederman <ebiederm@xmission.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Paul Jackson <pj@sgi.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Albert Cahalan <acahalan@gmail.com> Signed-off-by: NPrasanna Meda <mlp@google.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 26 6月, 2006 1 次提交
-
-
由 Christoph Lameter 提交于
Hooks for calling vma specific migration functions With this patch a vma may define a vma->vm_ops->migrate function. That function may perform page migration on its own (some vmas may not contain page structs and therefore cannot be handled by regular page migration. Pages in a vma may require special preparatory treatment before migration is possible etc) . Only mmap_sem is held when the migration function is called. The migrate() function gets passed two sets of nodemasks describing the source and the target of the migration. The flags parameter either contains MPOL_MF_MOVE which means that only pages used exclusively by the specified mm should be moved or MPOL_MF_MOVE_ALL which means that pages shared with other processes should also be moved. The migration function returns 0 on success or an error condition. An error condition will prevent regular page migration from occurring. On its own this patch cannot be included since there are no users for this functionality. But it seems that the uncached allocator will need this functionality at some point. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 23 6月, 2006 4 次提交
-
-
由 David Quigley 提交于
This patch inserts security_task_movememory hook calls into memory management code to enable security modules to mediate this operation between tasks. Since the last posting, the hook has been renamed following feedback from Christoph Lameter. Signed-off-by: NDavid Quigley <dpquigl@tycho.nsa.gov> Acked-by: NStephen Smalley <sds@tycho.nsa.gov> Signed-off-by: NJames Morris <jmorris@namei.org> Cc: Andi Kleen <ak@muc.de> Acked-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NChris Wright <chrisw@sous-sol.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
move_pages() is used to move individual pages of a process. The function can be used to determine the location of pages and to move them onto the desired node. move_pages() returns status information for each page. long move_pages(pid, number_of_pages_to_move, addresses_of_pages[], nodes[] or NULL, status[], flags); The addresses of pages is an array of void * pointing to the pages to be moved. The nodes array contains the node numbers that the pages should be moved to. If a NULL is passed instead of an array then no pages are moved but the status array is updated. The status request may be used to determine the page state before issuing another move_pages() to move pages. The status array will contain the state of all individual page migration attempts when the function terminates. The status array is only valid if move_pages() completed successfullly. Possible page states in status[]: 0..MAX_NUMNODES The page is now on the indicated node. -ENOENT Page is not present -EACCES Page is mapped by multiple processes and can only be moved if MPOL_MF_MOVE_ALL is specified. -EPERM The page has been mlocked by a process/driver and cannot be moved. -EBUSY Page is busy and cannot be moved. Try again later. -EFAULT Invalid address (no VMA or zero page). -ENOMEM Unable to allocate memory on target node. -EIO Unable to write back page. The page must be written back in order to move it since the page is dirty and the filesystem does not provide a migration function that would allow the moving of dirty pages. -EINVAL A dirty page cannot be moved. The filesystem does not provide a migration function and has no ability to write back pages. The flags parameter indicates what types of pages to move: MPOL_MF_MOVE Move pages that are only mapped by the process. MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes. Requires sufficient capabilities. Possible return codes from move_pages() -ENOENT No pages found that would require moving. All pages are either already on the target node, not present, had an invalid address or could not be moved because they were mapped by multiple processes. -EINVAL Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt to migrate pages in a kernel thread. -EPERM MPOL_MF_MOVE_ALL specified without sufficient priviledges. or an attempt to move a process belonging to another user. -EACCES One of the target nodes is not allowed by the current cpuset. -ENODEV One of the target nodes is not online. -ESRCH Process does not exist. -E2BIG Too many pages to move. -ENOMEM Not enough memory to allocate control array. -EFAULT Parameters could not be accessed. A test program for move_pages() may be found with the patches on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3 From: Christoph Lameter <clameter@sgi.com> Detailed results for sys_move_pages() Pass a pointer to an integer to get_new_page() that may be used to indicate where the completion status of a migration operation should be placed. This allows sys_move_pags() to report back exactly what happened to each page. Wish there would be a better way to do this. Looks a bit hacky. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Jes Sorensen <jes@trained-monkey.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andi Kleen <ak@muc.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
Instead of passing a list of new pages, pass a function to allocate a new page. This allows the correct placement of MPOL_INTERLEAVE pages during page migration. It also further simplifies the callers of migrate pages. migrate_pages() becomes similar to migrate_pages_to() so drop migrate_pages_to(). The batching of new page allocations becomes unnecessary. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Jes Sorensen <jes@trained-monkey.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
Do not leave pages on the lists passed to migrate_pages(). Seems that we will not need any postprocessing of pages. This will simplify the handling of pages by the callers of migrate_pages(). Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Jes Sorensen <jes@trained-monkey.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 20 4月, 2006 1 次提交
-
-
由 Christoph Lameter 提交于
gather_stats() is called with a spinlock held from check_pte_range. We cannot reschedule with a lock held. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 29 3月, 2006 1 次提交
-
-
由 Alexey Dobriyan 提交于
Fix a lot of typos. Eyeballed by jmc@ in OpenBSD. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-