- 17 10月, 2007 17 次提交
-
-
由 Nick Piggin 提交于
Probing pages and radix_tree_tagged are lockless operations with the lockless radix-tree. Convert these users to RCU locking rather than using tree_lock. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
The commit b5810039 contains the note A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. And indeed this cacheline bouncing has shown up on large SGI systems. There was a situation where an Altix system was essentially livelocked tearing down ZERO_PAGE pagetables when an HPC app aborted during startup. This situation can be avoided in userspace, but it does highlight the potential scalability problem with refcounting ZERO_PAGE, and corner cases where it can really hurt (we don't want the system to livelock!). There are several broad ways to fix this problem: 1. add back some special casing to avoid refcounting ZERO_PAGE 2. per-node or per-cpu ZERO_PAGES 3. remove the ZERO_PAGE completely I will argue for 3. The others should also fix the problem, but they result in more complex code than does 3, with little or no real benefit that I can see. Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a false optimisation: if an application is performance critical, it would not be doing many read faults of new memory, or at least it could be expected to write to that memory soon afterwards. If cache or memory use is critical, it should not be working with a significant number of ZERO_PAGEs anyway (a more compact representation of zeroes should be used). As a sanity check -- mesuring on my desktop system, there are never many mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not increase much without it. When running a make -j4 kernel compile on my dual core system, there are about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000 ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second is torn down without being COWed). So removing ZERO_PAGE will save 1,000 page faults per second when running kbuild, while keeping it only saves less than 1 page clearing operation per second. 1 page clear is cheaper than a thousand faults, presumably, so there isn't an obvious loss. Neither the logical argument nor these basic tests give a guarantee of no regressions. However, this is a reasonable opportunity to try to remove the ZERO_PAGE from the pagefault path. If it is found to cause regressions, we can reintroduce it and just avoid refcounting it. The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see much use to them except on benchmarks. All other users of ZERO_PAGE are converted just to use ZERO_PAGE(0) for simplicity. We can look at replacing them all and maybe ripping out ZERO_PAGE completely when we are more satisfied with this solution. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus "snif" Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
This gets rid of all kmalloc caches larger than page size. A kmalloc request larger than PAGE_SIZE > 2 is going to be passed through to the page allocator. This works both inline where we will call __get_free_pages instead of kmem_cache_alloc and in __kmalloc. kfree is modified to check if the object is in a slab page. If not then the page is freed via the page allocator instead. Roughly similar to what SLOB does. Advantages: - Reduces memory overhead for kmalloc array - Large kmalloc operations are faster since they do not need to pass through the slab allocator to get to the page allocator. - Performance increase of 10%-20% on alloc and 50% on free for PAGE_SIZEd allocations. SLUB must call page allocator for each alloc anyways since the higher order pages which that allowed avoiding the page alloc calls are not available in a reliable way anymore. So we are basically removing useless slab allocator overhead. - Large kmallocs yields page aligned object which is what SLAB did. Bad things like using page sized kmalloc allocations to stand in for page allocate allocs can be transparently handled and are not distinguishable from page allocator uses. - Checking for too large objects can be removed since it is done by the page allocator. Drawbacks: - No accounting for large kmalloc slab allocations anymore - No debugging of large kmalloc slab allocations. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
Convert some 'unsigned long' to pgoff_t. Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
- remove unused local next_index in do_generic_mapping_read() - remove a redudant page_cache_read() declaration Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES. Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
The local copy of ra in do_generic_mapping_read() can now go away. It predates readanead(req_size). In a time when the readahead code was called on *every* single page. Hence a local has to be made to reduce the chance of the readahead state being overwritten by a concurrent reader. More details in: Linux: Random File I/O Regressions In 2.6 <http://kerneltrap.org/node/3039> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
This is a simplified version of the pagecache context based readahead. It handles the case of multiple threads reading on the same fd and invalidating each others' readahead state. It does the trick by scanning the pagecache and recovering the current read stream's readahead status. The algorithm works in a opportunistic way, in that it does not try to detect interleaved reads _actively_, which requires a probe into the page cache (which means a little more overhead for random reads). It only tries to handle a previously started sequential readahead whose state was overwritten by another concurrent stream, and it can do this job pretty well. Negative and positive examples(or what you can expect from it): 1) it cannot detect and serve perfect request-by-request interleaved reads right: time stream 1 stream 2 0 1 1 1001 2 2 3 1002 4 3 5 1003 6 4 7 1004 8 5 9 1005 Here no single readahead will be carried out. 2) However, if it's two concurrent reads by two threads, the chance of the initial sequential readahead be started is huge. Once the first sequential readahead is started for a stream, this patch will ensure that the readahead window continues to rampup and won't be disturbed by other streams. time stream 1 stream 2 0 1 1 2 2 1001 3 3 4 1002 5 1003 6 4 7 5 8 1004 9 6 10 1005 11 7 12 1006 13 1007 Here stream 1 will start a readahead at page 2, and stream 2 will start its first readahead at page 1003. From then on the two streams will be served right. Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
Combine the file_ra_state members unsigned long prev_index unsigned int prev_offset into loff_t prev_pos It is more consistent and better supports huge files. Thanks to Peter for the nice proposal! [akpm@linux-foundation.org: fix shift overflow] Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
Fold file_ra_state.mmap_hit into file_ra_state.mmap_miss and make it an int. Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
Use 'unsigned int' instead of 'unsigned long' for readahead sizes. This helps reduce memory consumption on 64bit CPU when a lot of files are opened. CC: Andi Kleen <andi@firstfloor.org> Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jesper Juhl 提交于
This patch cleans up duplicate includes in mm/ Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com> Acked-by: NPaul Mundt <lethal@linux-sh.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Adrian Bunk 提交于
WARNING: mm/built-in.o(.text+0x24bd3): Section mismatch: reference to .init.text:early_kmem_cache_node_alloc (between 'init_kmem_cache_nodes' and 'calculate_sizes') ... Signed-off-by: NAdrian Bunk <bunk@stusta.de> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Whitcroft 提交于
Convert the common vmemmap population into initialisation helpers for use by architecture vmemmap populators. All architecture implementing the SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate() initialiser, which may make use of the helpers. This allows us to clean up and remove the initialisation Kconfig entries. With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to indicate use of that variant. Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all the arches. It would be great if it could be the default so that we can get rid of various forms of DISCONTIG and other variations on memory maps. So far what has hindered this are the additional lookups that SPARSEMEM introduces for virt_to_page and page_address. This goes so far that the code to do this has to be kept in a separate function and cannot be used inline. This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap is mapped into a virtually contigious area, only the active sections are physically backed. This allows virt_to_page page_address and cohorts become simple shift/add operations. No page flag fields, no table lookups, nothing involving memory is required. The two key operations pfn_to_page and page_to_page become: #define __pfn_to_page(pfn) (vmemmap + (pfn)) #define __page_to_pfn(page) ((page) - vmemmap) By having a virtual mapping for the memmap we allow simple access without wasting physical memory. As kernel memory is typically already mapped 1:1 this introduces no additional overhead. The virtual mapping must be big enough to allow a struct page to be allocated and mapped for all valid physical pages. This vill make a virtual memmap difficult to use on 32 bit platforms that support 36 address bits. However, if there is enough virtual space available and the arch already maps its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM. FLATMEM needs to read the contents of the mem_map variable to get the start of the memmap and then add the offset to the required entry. vmemmap is a constant to which we can simply add the offset. This patch has the potential to allow us to make SPARSMEM the default (and even the only) option for most systems. It should be optimal on UP, SMP and NUMA on most platforms. Then we may even be able to remove the other memory models: FLATMEM, DISCONTIG etc. [apw@shadowen.org: config cleanups, resplit code etc] [kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init] [apw@shadowen.org: vmemmap: remove excess debugging] [apw@shadowen.org: simplify initialisation code and reduce duplication] [apw@shadowen.org: pull out the vmemmap code into its own file] Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Whitcroft 提交于
We have flags to indicate whether a section actually has a valid mem_map associated with it. This is never set and we rely solely on the present bit to indicate a section is valid. By definition a section is not valid if it has no mem_map and there is a window during init where the present bit is set but there is no mem_map, during which pfn_valid() will return true incorrectly. Use the existing SECTION_HAS_MEM_MAP flag to indicate the presence of a valid mem_map. Switch valid_section{,_nr} and pfn_valid() to this bit. Add a new present_section{,_nr} and pfn_present() interfaces for those users who care to know that a section is going to be valid. [akpm@linux-foundation.org: coding-syle fixes] Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Whitcroft 提交于
SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all the arches. It would be great if it could be the default so that we can get rid of various forms of DISCONTIG and other variations on memory maps. So far what has hindered this are the additional lookups that SPARSEMEM introduces for virt_to_page and page_address. This goes so far that the code to do this has to be kept in a separate function and cannot be used inline. This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap is mapped into a virtually contigious area, only the active sections are physically backed. This allows virt_to_page page_address and cohorts become simple shift/add operations. No page flag fields, no table lookups, nothing involving memory is required. The two key operations pfn_to_page and page_to_page become: #define __pfn_to_page(pfn) (vmemmap + (pfn)) #define __page_to_pfn(page) ((page) - vmemmap) By having a virtual mapping for the memmap we allow simple access without wasting physical memory. As kernel memory is typically already mapped 1:1 this introduces no additional overhead. The virtual mapping must be big enough to allow a struct page to be allocated and mapped for all valid physical pages. This vill make a virtual memmap difficult to use on 32 bit platforms that support 36 address bits. However, if there is enough virtual space available and the arch already maps its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM. FLATMEM needs to read the contents of the mem_map variable to get the start of the memmap and then add the offset to the required entry. vmemmap is a constant to which we can simply add the offset. This patch has the potential to allow us to make SPARSMEM the default (and even the only) option for most systems. It should be optimal on UP, SMP and NUMA on most platforms. Then we may even be able to remove the other memory models: FLATMEM, DISCONTIG etc. The current aim is to bring a common virtually mapped mem_map to all architectures. This should facilitate the removal of the bespoke implementations from the architectures. This also brings performance improvements for most architecture making sparsmem vmemmap the more desirable memory model. The ultimate aim of this work is to expand sparsemem support to encompass all the features of the other memory models. This could allow us to drop support for and remove the other models in the longer term. Below are some comparitive kernbench numbers for various architectures, comparing default memory model against SPARSEMEM VMEMMAP. All but ia64 show marginal improvement; we expect the ia64 figures to be sorted out when the larger mapping support returns. x86-64 non-NUMA Base VMEMAP % change (-ve good) User 85.07 84.84 -0.26 System 34.32 33.84 -1.39 Total 119.38 118.68 -0.59 ia64 Base VMEMAP % change (-ve good) User 1016.41 1016.93 0.05 System 50.83 51.02 0.36 Total 1067.25 1067.95 0.07 x86-64 NUMA Base VMEMAP % change (-ve good) User 30.77 431.73 0.22 System 45.39 43.98 -3.11 Total 476.17 475.71 -0.10 ppc64 Base VMEMAP % change (-ve good) User 488.77 488.35 -0.09 System 56.92 56.37 -0.97 Total 545.69 544.72 -0.18 Below are some AIM bencharks on IA64 and x86-64 (thank Bob). The seems pretty much flat as you would expect. ia64 results 2 cpu non-numa 4Gb SCSI disk Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" extreme Jun 1 07:17:24 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 98.9 100 58.9 1.3 1.6482 101 5547.1 95 106.0 79.4 0.9154 201 6377.7 95 183.4 158.3 0.5288 301 6932.2 95 252.7 237.3 0.3838 401 7075.8 93 329.8 316.7 0.2941 501 7235.6 94 403.0 396.2 0.2407 600 7387.5 94 472.7 475.0 0.2052 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" vmemmap Jun 1 09:59:04 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 99.1 100 58.8 1.2 1.6509 101 5480.9 95 107.2 79.2 0.9044 201 6490.3 95 180.2 157.8 0.5382 301 6886.6 94 254.4 236.8 0.3813 401 7078.2 94 329.7 316.0 0.2942 501 7250.3 95 402.2 395.4 0.2412 600 7399.1 94 471.9 473.9 0.2055 open power 710 2 cpu, 4 Gb, SCSI and configured physically Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" extreme May 29 15:42:53 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 25.7 100 226.3 4.3 0.4286 101 1096.0 97 536.4 199.8 0.1809 201 1236.4 96 946.1 389.1 0.1025 301 1280.5 96 1368.0 582.3 0.0709 401 1270.2 95 1837.4 771.0 0.0528 501 1251.4 96 2330.1 955.9 0.0416 601 1252.6 96 2792.4 1139.2 0.0347 701 1245.2 96 3276.5 1334.6 0.0296 918 1229.5 96 4345.4 1728.7 0.0223 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" vmemmap May 30 07:28:26 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 25.6 100 226.9 4.3 0.4275 101 1049.3 97 560.2 198.1 0.1731 201 1199.1 97 975.6 390.7 0.0994 301 1261.7 96 1388.5 591.5 0.0699 401 1256.1 96 1858.1 771.9 0.0522 501 1220.1 96 2389.7 955.3 0.0406 601 1224.6 96 2856.3 1133.4 0.0340 701 1252.0 96 3258.7 1314.1 0.0298 915 1232.8 96 4319.7 1704.0 0.0225 amd64 2 2-core, 4Gb and SATA Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" extreme Jun 2 03:59:48 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 13.0 100 446.4 2.1 0.2173 101 533.4 97 1102.0 110.2 0.0880 201 578.3 97 2022.8 220.8 0.0480 301 583.8 97 3000.6 332.3 0.0323 401 580.5 97 4020.1 442.2 0.0241 501 574.8 98 5072.8 558.8 0.0191 600 566.5 98 6163.8 671.0 0.0157 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" vmemmap Jun 3 04:19:31 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 13.0 100 447.8 2.0 0.2166 101 536.5 97 1095.6 109.7 0.0885 201 567.7 97 2060.5 219.3 0.0471 301 582.1 96 3009.4 330.2 0.0322 401 578.2 96 4036.4 442.4 0.0240 501 585.1 98 4983.2 555.1 0.0195 600 565.5 98 6175.2 660.6 0.0157 This patch: Fix some spelling errors. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 10月, 2007 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 10月, 2007 2 次提交
-
-
由 NeilBrown 提交于
As bi_end_io is only called once when the reqeust is complete, the 'size' argument is now redundant. Remove it. Now there is no need for bio_endio to subtract the size completed from bi_size. So don't do that either. While we are at it, change bi_end_io to return void. Signed-off-by: NNeil Brown <neilb@suse.de> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Jens Axboe 提交于
Hide everything in blkdev.h with CONFIG_BLOCK isn't set, and fixup the (few) files that fail to build because they were relying on blkdev.h pulling in extra includes for them. Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 09 10月, 2007 3 次提交
-
-
由 Yan Zheng 提交于
find_lock_page increases page's usage count, we should decrease it before return VM_FAULT_SIGBUS Signed-off-by: Yan Zheng<yanzheng@21cn.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yan Zheng 提交于
The test for VM_CAN_NONLINEAR always fails Signed-off-by: Yan Zheng<yanzheng@21cn.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Zijlstra 提交于
All the current page_mkwrite() implementations also set the page dirty. Which results in the set_page_dirty_balance() call to _not_ call balance, because the page is already found dirty. This allows us to dirty a _lot_ of pages without ever hitting balance_dirty_pages(). Not good (tm). Force a balance call if ->page_mkwrite() was successful. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 10月, 2007 1 次提交
-
-
由 Jeremy Fitzhardinge 提交于
When pinning and unpinning pagetables, we must protect them against being used by other CPUs, lest they see the pagetable in an intermediate read-only-but-not-pinned state. When using split pte locks, doing this properly would require taking all the pte locks for the pagetable while pinning, but this may overflow the PREEMPT_BITS part of the preempt counter if the process has mapped more than about 512M of memory. However, failing to take the pte locks causes write-protect faults when the pageout code is trying to clear the Access bit on a pte which is part of a freshy created and still being pinned process after fork. This is a short-term fix until the problem is solved properly. Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NHugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andi Kleen <ak@suse.de> Cc: Keir Fraser <keir@xensource.com> Cc: Jan Beulich <jbeulich@novell.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 10月, 2007 1 次提交
-
-
由 Hugh Dickins 提交于
Gurudas Pai reports kernel BUG at arch/i386/mm/highmem.c:15! below sys_remap_file_pages, while running Oracle database test on x86 in 6GB RAM: kunmap thinks we're in_interrupt because the preempt count has wrapped. That's because __do_fault expected to unmap page_table, but one of its two callers do_nonlinear_fault already unmapped it: let do_linear_fault unmap it first too, and then there's no need to pass the page_table arg down. Why have we been so slow to notice this? Probably through forgetting that the mapping_cap_account_dirty test means that sys_remap_file_pages nowadays only goes the full nonlinear vma route on a few memory-backed filesystems like ramfs, tmpfs and hugetlbfs. [ It also depends on CONFIG_HIGHPTE, so it becomes even harder to trigger in practice. Many who have need of large memory have probably migrated to x86-64.. Problem introduced by commit d0217ac0 ("mm: fault feedback #1") -- Linus ] Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: gurudas pai <gurudas.pai@oracle.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 10月, 2007 1 次提交
-
-
由 Ralf Baechle 提交于
The virtual address space argument of clear_user_highpage is supposed to be the virtual address where the page being cleared will eventually be mapped. This allows architectures with virtually indexed caches a few clever tricks. That sort of trick falls over in painful ways if the virtual address argument is wrong. Signed-off-by: NRalf Baechle <ralf@linux-mips.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 9月, 2007 1 次提交
-
-
由 Lee Schermerhorn 提交于
This patch proposes fixes to the reference counting of memory policy in the page allocation paths and in show_numa_map(). Extracted from my "Memory Policy Cleanups and Enhancements" series as stand-alone. Shared policy lookup [shmem] has always added a reference to the policy, but this was never unrefed after page allocation or after formatting the numa map data. Default system policy should not require additional ref counting, nor should the current task's task policy. However, show_numa_map() calls get_vma_policy() to examine what may be [likely is] another task's policy. The latter case needs protection against freeing of the policy. This patch adds a reference count to a mempolicy returned by get_vma_policy() when the policy is a vma policy or another task's mempolicy. Again, shared policy is already reference counted on lookup. A matching "unref" [__mpol_free()] is performed in alloc_page_vma() for shared and vma policies, and in show_numa_map() for shared and another task's mempolicy. We can call __mpol_free() directly, saving an admittedly inexpensive inline NULL test, because we know we have a non-NULL policy. Handling policy ref counts for hugepages is a bit trickier. huge_zonelist() returns a zone list that might come from a shared or vma 'BIND policy. In this case, we should hold the reference until after the huge page allocation in dequeue_hugepage(). The patch modifies huge_zonelist() to return a pointer to the mempolicy if it needs to be unref'd after allocation. Kernel Build [16cpu, 32GB, ia64] - average of 10 runs: w/o patch w/ refcount patch Avg Std Devn Avg Std Devn Real: 100.59 0.38 100.63 0.43 User: 1209.60 0.37 1209.91 0.31 System: 81.52 0.42 81.64 0.34 Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NAndi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 9月, 2007 1 次提交
-
-
由 Christoph Lameter 提交于
This was posted on Aug 28 and fixes an issue that could cause troubles when slab caches >=128k are created. http://marc.info/?l=linux-mm&m=118798149918424&w=2 Currently we simply add the debug flags unconditional when checking for a matching slab. This creates issues for sysfs processing when slabs exist that are exempt from debugging due to their huge size or because only a subset of slabs was selected for debugging. We need to only add the flags if kmem_cache_open() would also add them. Create a function to calculate the flags that would be set if the cache would be opened and use that function to determine the flags before looking for a compatible slab. [akpm@linux-foundation.org: fixlets] Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 8月, 2007 4 次提交
-
-
由 Christoph Lameter 提交于
Page migration currently does not check if the target of the move contains nodes that that are invalid (if root attempts to migrate pages) and may try to allocate from invalid nodes if these are specified leading to oopses. Return -EINVAL if an offline node is specified. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Do not BUG() if we cannot register a slab with sysfs. Just print an error. The only consequence of not registering is that the slab cache is not visible via /sys/slab. A BUG() may not be visible that early during boot and we have had multiple issues here already. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
In migration fallback path, write_page() or lock_page() will be called. This causes sleep with holding rcu_read_lock(). For avoding that, just do rcu_lock if the page is Anon.(this is enough.) Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
Don't try to free memory which we didn't allocate. Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 8月, 2007 8 次提交
-
-
由 Mel Gorman 提交于
The NUMA layer only supports NUMA policies for the highest zone. When ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes ZONE_MOVABLE. The result is that policies are only applied to allocations like anonymous pages and page cache allocated from ZONE_MOVABLE when the zone is used. This patch applies policies to the two highest zones when the highest zone is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real" zone, it's always functionally equivalent. The patch has been tested on a variety of machines both NUMA and non-NUMA covering x86, x86_64 and ppc64. No abnormal results were seen in kernbench, tbench, dbench or hackbench. It passes regression tests from the numactl package with and without kernelcore= once numactl tests are patched to wait for vmstat counters to update. akpm: this is the nasty hack to fix NUMA mempolicies in the presence of ZONE_MOVABLE and kernelcore= in 2.6.23. Christoph says "For .24 either merge the mobility or get the other solution that Mel is working on. That solution would only use a single zonelist per node and filter on the fly. That may help performance and also help to make memory policies work better." Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NChristoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Print a big fat warning and do what is necessary to continue if a node is marked as up (meaning either node is online (upstream) or node has memory (Andrew's tree)) but allocations from the node do not succeed. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
SLUB is using atomic_read() for variables declared atomic_long_t. Switch to atomic_long_read(). Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Adam Litke 提交于
It seems a simple mistake was made when converting follow_hugetlb_page() over to the VM_FAULT flags bitmasks (in "mm: fault feedback #2", commit 83c54070). By using the wrong bitmask, hugetlb_fault() failures are not being recognized. This results in an infinite loop whenever follow_hugetlb_page is involved in a failed fault. Signed-off-by: NAdam Litke <agl@us.ibm.com> Acked-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Siddha, Suresh B 提交于
Skip calling cache_free_alien() when the platform is not numa capable. This will avoid cache misses that happen while accessing slabp (which is per page memory reference) to get nodeid. Instead use a global variable to skip the call, which is mostly likely to be present in the cache. This gives a 0.8% performance boost with the database oltp workload on a quad-core SMP platform and by any means the number is not small :) Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alan Cox 提交于
The new exec code inserts an accounted vma into an mm struct which is not current->mm. The existing memory check code has a hard coded assumption that this does not happen as does the security code. As the correct mm is known we pass the mm to the security method and the helper function. A new security test is added for the case where we need to pass the mm and the existing one is modified to pass current->mm to avoid the need to change large amounts of code. (Thanks to Tobias for fixing rejects and testing) Signed-off-by: NAlan Cox <alan@redhat.com> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Cc: James Morris <jmorris@redhat.com> Cc: Tobias Diedrich <ranma+kernel@tdiedrich.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Whitcroft 提交于
Lumpy reclaim works by selecting a lead page from the LRU list and then selecting pages for reclaim from the order-aligned area of pages. In the situation were all pages in that region are inactive and not referenced by any process over time, it works well. In the situation where there is even light load on the system, the pages may not free quickly. Out of a area of 1024 pages, maybe only 950 of them are freed when the allocation attempt occurs because lumpy reclaim returned early. This patch alters the behaviour of direct reclaim for large contiguous blocks. The first attempt to call shrink_page_list() is asynchronous but if it fails, the pages are submitted a second time and the calling process waits for the IO to complete. This may stall allocators waiting for contiguous memory but that should be expected behaviour for high-order users. It is preferable behaviour to potentially queueing unnecessary areas for IO. Note that kswapd will not stall in this fashion. [apw@shadowen.org: update to version 2] [apw@shadowen.org: update to version 3] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Whitcroft 提交于
As pointed out by Mel when reclaim is applied at higher orders a significant amount of IO may be started. As this takes finite time to drain reclaim will consider more areas than ultimatly needed to satisfy the request. This leads to more reclaim than strictly required and reduced success rates. I was able to confirm Mel's test results on systems locally. These show that even under light load the success rates drop off far more than expected. Testing with a modified version of his patch (which follows) I was able to allocate almost all of ZONE_MOVABLE with a near idle system. I ran 5 test passes sequentially following system boot (the system has 29 hugepages in ZONE_MOVABLE): 2.6.23-rc1 11 8 6 7 7 sync_lumpy 28 28 29 29 26 These show that although hugely better than the near 0% success normally expected we can only allocate about a 1/4 of the zone. Using synchronous reclaim for these allocations we get close to 100% as expected. I have also run our standard high order tests and these show no regressions in allocation success rates at rest, and some significant improvements under load. This patch: We are transitioning pages from active to inactive in clear_active_flags, those need counting as PGDEACTIVATE vm events. Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-