- 20 7月, 2007 5 次提交
-
-
由 Fengguang Wu 提交于
This is a minimal readahead algorithm that aims to replace the current one. It is more flexible and reliable, while maintaining almost the same behavior and performance. Also it is full integrated with adaptive readahead. It is designed to be called on demand: - on a missing page, to do synchronous readahead - on a lookahead page, to do asynchronous readahead In this way it eliminated the awkward workarounds for cache hit/miss, readahead thrashing, retried read, and unaligned read. It also adopts the data structure introduced by adaptive readahead, parameterizes readahead pipelining with `lookahead_index', and reduces the current/ahead windows to one single window. HEURISTICS The logic deals with four cases: - sequential-next found a consistent readahead window, so push it forward - random standalone small read, so read as is - sequential-first create a new readahead window for a sequential/oversize request - lookahead-clueless hit a lookahead page not associated with the readahead window, so create a new readahead window and ramp it up In each case, three parameters are determined: - readahead index: where the next readahead begins - readahead size: how much to readahead - lookahead size: when to do the next readahead (for pipelining) BEHAVIORS The old behaviors are maximally preserved for trivial sequential/random reads. Notable changes are: - It no longer imposes strict sequential checks. It might help some interleaved cases, and clustered random reads. It does introduce risks of a random lookahead hit triggering an unexpected readahead. But in general it is more likely to do good than to do evil. - Interleaved reads are supported in a minimal way. Their chances of being detected and proper handled are still low. - Readahead thrashings are better handled. The current readahead leads to tiny average I/O sizes, because it never turn back for the thrashed pages. They have to be fault in by do_generic_mapping_read() one by one. Whereas the on-demand readahead will redo readahead for them. OVERHEADS The new code reduced the overheads of - excessively calling the readahead routine on small sized reads (the current readahead code insists on seeing all requests) - doing a lot of pointless page-cache lookups for small cached files (the current readahead only turns itself off after 256 cache hits, unfortunately most files are < 1MB, so never see that chance) That accounts for speedup of - 0.3% on 1-page sequential reads on sparse file - 1.2% on 1-page cache hot sequential reads - 3.2% on 256-page cache hot sequential reads - 1.3% on cache hot `tar /lib` However, it does introduce one extra page-cache lookup per cache miss, which impacts random reads slightly. That's 1% overheads for 1-page random reads on sparse file. PERFORMANCE The basic benchmark setup is - 2.6.20 kernel with on-demand readahead - 1MB max readahead size - 2.9GHz Intel Core 2 CPU - 2GB memory - 160G/8M Hitachi SATA II 7200 RPM disk The benchmarks show that - it maintains the same performance for trivial sequential/random reads - sysbench/OLTP performance on MySQL gains up to 8% - performance on readahead thrashing gains up to 3 times iozone throughput (KB/s): roughly the same ========================================== iozone -c -t1 -s 4096m -r 64k 2.6.20 on-demand gain first run " Initial write " 61437.27 64521.53 +5.0% " Rewrite " 47893.02 48335.20 +0.9% " Read " 62111.84 62141.49 +0.0% " Re-read " 62242.66 62193.17 -0.1% " Reverse Read " 50031.46 49989.79 -0.1% " Stride read " 8657.61 8652.81 -0.1% " Random read " 13914.28 13898.23 -0.1% " Mixed workload " 19069.27 19033.32 -0.2% " Random write " 14849.80 14104.38 -5.0% " Pwrite " 62955.30 65701.57 +4.4% " Pread " 62209.99 62256.26 +0.1% second run " Initial write " 60810.31 66258.69 +9.0% " Rewrite " 49373.89 57833.66 +17.1% " Read " 62059.39 62251.28 +0.3% " Re-read " 62264.32 62256.82 -0.0% " Reverse Read " 49970.96 50565.72 +1.2% " Stride read " 8654.81 8638.45 -0.2% " Random read " 13901.44 13949.91 +0.3% " Mixed workload " 19041.32 19092.04 +0.3% " Random write " 14019.99 14161.72 +1.0% " Pwrite " 64121.67 68224.17 +6.4% " Pread " 62225.08 62274.28 +0.1% In summary, writes are unstable, reads are pretty close on average: access pattern 2.6.20 on-demand gain Read 62085.61 62196.38 +0.2% Re-read 62253.49 62224.99 -0.0% Reverse Read 50001.21 50277.75 +0.6% Stride read 8656.21 8645.63 -0.1% Random read 13907.86 13924.07 +0.1% Mixed workload 19055.29 19062.68 +0.0% Pread 62217.53 62265.27 +0.1% aio-stress: roughly the same ============================ aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso 2.6.20 on-demand delta sequential 92.57s 92.54s -0.0% random 311.87s 312.15s +0.1% sysbench fileio: roughly the same ================================= sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \ --file-total-size=4G --file-block-size=64K \ --num-threads=001 --max-requests=10000 --max-time=900 run threads 2.6.20 on-demand delta first run 1 59.1974s 59.2262s +0.0% 2 58.0575s 58.2269s +0.3% 4 48.0545s 47.1164s -2.0% 8 41.0684s 41.2229s +0.4% 16 35.8817s 36.4448s +1.6% 32 32.6614s 32.8240s +0.5% 64 23.7601s 24.1481s +1.6% 128 24.3719s 23.8225s -2.3% 256 23.2366s 22.0488s -5.1% second run 1 59.6720s 59.5671s -0.2% 8 41.5158s 41.9541s +1.1% 64 25.0200s 23.9634s -4.2% 256 22.5491s 20.9486s -7.1% Note that the numbers are not very stable because of the writes. The overall performance is close when we sum all seconds up: sum all up 495.046s 491.514s -0.7% sysbench oltp (trans/sec): up to 8% gain ======================================== sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \ --mysql-socket=/var/run/mysqld/mysqld.sock \ --mysql-user=root --mysql-password=readahead \ --num-threads=064 --max-requests=10000 --max-time=900 run 10000-transactions run threads 2.6.20 on-demand gain 1 62.81 64.56 +2.8% 2 67.97 70.93 +4.4% 4 81.81 85.87 +5.0% 8 94.60 97.89 +3.5% 16 99.07 104.68 +5.7% 32 95.93 104.28 +8.7% 64 96.48 103.68 +7.5% 5000-transactions run 1 48.21 48.65 +0.9% 8 68.60 70.19 +2.3% 64 70.57 74.72 +5.9% 2000-transactions run 1 37.57 38.04 +1.3% 2 38.43 38.99 +1.5% 4 45.39 46.45 +2.3% 8 51.64 52.36 +1.4% 16 54.39 55.18 +1.5% 32 52.13 54.49 +4.5% 64 54.13 54.61 +0.9% That's interesting results. Some investigations show that - MySQL is accessing the db file non-uniformly: some parts are more hot than others - It is mostly doing 4-page random reads, and sometimes doing two reads in a row, the latter one triggers a 16-page readahead. - The on-demand readahead leaves many lookahead pages (flagged PG_readahead) there. Many of them will be hit, and trigger more readahead pages. Which might save more seeks. - Naturally, the readahead windows tend to lie in hot areas, and the lookahead pages in hot areas is more likely to be hit. - The more overall read density, the more possible gain. That also explains the adaptive readahead tricks for clustered random reads. readahead thrashing: 3 times better =================================== We boot kernel with "mem=128m single", and start a 100KB/s stream on every second, until reaching 200 streams. max throughput min avg I/O size 2.6.20: 5MB/s 16KB on-demand: 15MB/s 140KB Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
This patch completes Linus's wish that the fault return codes be made into bit flags, which I agree makes everything nicer. This requires requires all handle_mm_fault callers to be modified (possibly the modifications should go further and do things like fault accounting in handle_mm_fault -- however that would be for another patch). [akpm@linux-foundation.org: fix alpha build] [akpm@linux-foundation.org: fix s390 build] [akpm@linux-foundation.org: fix sparc build] [akpm@linux-foundation.org: fix sparc64 build] [akpm@linux-foundation.org: fix ia64 build] Signed-off-by: NNick Piggin <npiggin@suse.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Bryan Wu <bryan.wu@analog.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Cc: Matthew Wilcox <willy@debian.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Chris Zankel <chris@zankel.net> Acked-by: NKyle McMartin <kyle@mcmartin.ca> Acked-by: NHaavard Skinnemoen <hskinnemoen@atmel.com> Acked-by: NRalf Baechle <ralf@linux-mips.org> Acked-by: NAndi Kleen <ak@muc.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> [ Still apparently needs some ARM and PPC loving - Linus ] Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Change ->fault prototype. We now return an int, which contains VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte. FAULT_RET_ code tells the VM whether a page was found, whether it has been locked, and potentially other things. This is not quite the way he wanted it yet, but that's changed in the next patch (which requires changes to arch code). This means we no longer set VM_CAN_INVALIDATE in the vma in order to say that a page is locked which requires filemap_nopage to go away (because we can no longer remain backward compatible without that flag), but we were going to do that anyway. struct fault_data is renamed to struct vm_fault as Linus asked. address is now a void __user * that we should firmly encourage drivers not to use without really good reason. The page is now returned via a page pointer in the vm_fault struct. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes the virtual address -> file offset differently from linear mappings. ->populate is a layering violation because the filesystem/pagecache code should need to know anything about the virtual memory mapping. The hitch here is that the ->nopage handler didn't pass down enough information (ie. pgoff). But it is more logical to pass pgoff rather than have the ->nopage function calculate it itself anyway (because that's a similar layering violation). Having the populate handler install the pte itself is likewise a nasty thing to be doing. This patch introduces a new fault handler that replaces ->nopage and ->populate and (later) ->nopfn. Most of the old mechanism is still in place so there is a lot of duplication and nice cleanups that can be removed if everyone switches over. The rationale for doing this in the first place is that nonlinear mappings are subject to the pagefault vs invalidate/truncate race too, and it seemed stupid to duplicate the synchronisation logic rather than just consolidate the two. After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in pagecache. Seems like a fringe functionality anyway. NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no users have hit mainline yet. [akpm@linux-foundation.org: cleanup] [randy.dunlap@oracle.com: doc. fixes for readahead] [akpm@linux-foundation.org: build fix] Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Fix the race between invalidate_inode_pages and do_no_page. Andrea Arcangeli identified a subtle race between invalidation of pages from pagecache with userspace mappings, and do_no_page. The issue is that invalidation has to shoot down all mappings to the page, before it can be discarded from the pagecache. Between shooting down ptes to a particular page, and actually dropping the struct page from the pagecache, do_no_page from any process might fault on that page and establish a new mapping to the page just before it gets discarded from the pagecache. The most common case where such invalidation is used is in file truncation. This case was catered for by doing a sort of open-coded seqlock between the file's i_size, and its truncate_count. Truncation will decrease i_size, then increment truncate_count before unmapping userspace pages; do_no_page will read truncate_count, then find the page if it is within i_size, and then check truncate_count under the page table lock and back out and retry if it had subsequently been changed (ptl will serialise against unmapping, and ensure a potentially updated truncate_count is actually visible). Complexity and documentation issues aside, the locking protocol fails in the case where we would like to invalidate pagecache inside i_size. do_no_page can come in anytime and filemap_nopage is not aware of the invalidation in progress (as it is when it is outside i_size). The end result is that dangling (->mapping == NULL) pages that appear to be from a particular file may be mapped into userspace with nonsense data. Valid mappings to the same place will see a different page. Andrea implemented two working fixes, one using a real seqlock, another using a page->flags bit. He also proposed using the page lock in do_no_page, but that was initially considered too heavyweight. However, it is not a global or per-file lock, and the page cacheline is modified in do_no_page to increment _count and _mapcount anyway, so a further modification should not be a large performance hit. Scalability is not an issue. This patch implements this latter approach. ->nopage implementations return with the page locked if it is possible for their underlying file to be invalidated (in that case, they must set a special vm_flags bit to indicate so). do_no_page only unlocks the page after setting up the mapping completely. invalidation is excluded because it holds the page lock during invalidation of each page (and ensures that the page is not mapped while holding the lock). This also allows significant simplifications in do_no_page, because we have the page locked in the right place in the pagecache from the start. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 7月, 2007 4 次提交
-
-
由 Christoph Lameter 提交于
Detect slab objects being passed to the page oriented functions of the VM. It is not sufficient to simply return NULL because the functions calling page_mapping may depend on other items of the page_struct also to be setup properly. Moreover slab object may not be properly aligned. The page oriented functions of the VM expect to operate on page aligned, page sized objects. Operations on object straddling page boundaries may only affect the objects partially which may lead to surprising results. It is better to detect eventually remaining uses and eliminate them. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rusty Russell 提交于
I can never remember what the function to register to receive VM pressure is called. I have to trace down from __alloc_pages() to find it. It's called "set_shrinker()", and it needs Your Help. 1) Don't hide struct shrinker. It contains no magic. 2) Don't allocate "struct shrinker". It's not helpful. 3) Call them "register_shrinker" and "unregister_shrinker". 4) Call the function "shrink" not "shrinker". 5) Reduce the 17 lines of waffly comments to 13, but document it properly. Signed-off-by: NRusty Russell <rusty@rustcorp.com.au> Cc: David Chinner <dgc@sgi.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
This patch adds the kernelcore= parameter for x86. Once all patches are applied, a new command-line parameter exist and a new sysctl. This patch adds the necessary documentation. From: Yasunori Goto <y-goto@jp.fujitsu.com> When "kernelcore" boot option is specified, kernel can't boot up on ia64 because of an infinite loop. In addition, the parsing code can be handled in an architecture-independent manner. This patch uses common code to handle the kernelcore= parameter. It is only available to architectures that support arch-independent zone-sizing (i.e. define CONFIG_ARCH_POPULATES_NODE_MAP). Other architectures will ignore the boot parameter. [bunk@stusta.de: make cmdline_parse_kernelcore() static] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com> Acked-by: NAndy Whitcroft <apw@shadowen.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE that is only usable by allocations that specify both __GFP_HIGHMEM and __GFP_MOVABLE. This has the effect of keeping all non-movable pages within a single memory partition while allowing movable allocations to be satisfied from either partition. The patches may be applied with the list-based anti-fragmentation patches that groups pages together based on mobility. The size of the zone is determined by a kernelcore= parameter specified at boot-time. This specifies how much memory is usable by non-movable allocations and the remainder is used for ZONE_MOVABLE. Any range of pages within ZONE_MOVABLE can be released by migrating the pages or by reclaiming. When selecting a zone to take pages from for ZONE_MOVABLE, there are two things to consider. First, only memory from the highest populated zone is used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second, the amount of memory usable by the kernel will be spread evenly throughout NUMA nodes where possible. If the nodes are not of equal size, the amount of memory usable by the kernel on some nodes may be greater than others. By default, the zone is not as useful for hugetlb allocations because they are pinned and non-migratable (currently at least). A sysctl is provided that allows huge pages to be allocated from that zone. This means that the huge page pool can be resized to the size of ZONE_MOVABLE during the lifetime of the system assuming that pages are not mlocked. Despite huge pages being non-movable, we do not introduce additional external fragmentation of note as huge pages are always the largest contiguous block we care about. Credit goes to Andy Whitcroft for catching a large variety of problems during review of the patches. This patch creates an additional zone, ZONE_MOVABLE. This zone is only usable by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE. Hot-added memory continues to be placed in their existing destination as there is no mechanism to redirect them to a specific zone. [y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code] [akpm@linux-foundation.org: various fixes] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 7月, 2007 3 次提交
-
-
由 Alexey Dobriyan 提交于
I forgot to remove capability.h from mm.h while removing sched.h! This patch remedies that, because the only inline function which was using CAP_something was made out of line. Cross-compile tested without regressions on: all powerpc defconfigs all mips defconfigs all m68k defconfigs all arm defconfigs all ia64 defconfigs alpha alpha-allnoconfig alpha-defconfig alpha-up arm i386 i386-allnoconfig i386-defconfig i386-up ia64 ia64-allnoconfig ia64-defconfig ia64-up m68k mips parisc parisc-allnoconfig parisc-defconfig parisc-up powerpc powerpc-up s390 s390-allnoconfig s390-defconfig s390-up sparc sparc-allnoconfig sparc-defconfig sparc-up sparc64 sparc64-allnoconfig sparc64-defconfig sparc64-up um-x86_64 x86_64 x86_64-allnoconfig x86_64-defconfig x86_64-up as well as my two usual configs. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
This is a straightforward split of do_mmap_pgoff() into two functions: - do_mmap_pgoff() checks the parameters, and calculates the vma flags. Then it calls - mmap_region(), which does the actual mapping Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Beulich 提交于
This symbol got orphaned quite a while ago. Signed-off-by: NJan Beulich <jbeulich@novell.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 6月, 2007 1 次提交
-
-
由 Hugh Dickins 提交于
Nicolas Ferre reports oops from flush_dcache_page() on ARM when using SLUB: which reuses page->mapping as page->slab. The page_mapping() function, used by ARM and PA-RISC flush_dcache_page() implementations, must not confuse SLUB pages with those which have page->mapping set. Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: Nicolas Ferre <nicolas.ferre@rfo.atmel.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 5月, 2007 1 次提交
-
-
由 Alexey Dobriyan 提交于
First thing mm.h does is including sched.h solely for can_do_mlock() inline function which has "current" dereference inside. By dealing with can_do_mlock() mm.h can be detached from sched.h which is good. See below, why. This patch a) removes unconditional inclusion of sched.h from mm.h b) makes can_do_mlock() normal function in mm/mlock.c c) exports can_do_mlock() to not break compilation d) adds sched.h inclusions back to files that were getting it indirectly. e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were getting them indirectly Net result is: a) mm.h users would get less code to open, read, preprocess, parse, ... if they don't need sched.h b) sched.h stops being dependency for significant number of files: on x86_64 allmodconfig touching sched.h results in recompile of 4083 files, after patch it's only 3744 (-8.3%). Cross-compile tested on all arm defconfigs, all mips defconfigs, all powerpc defconfigs, alpha alpha-up arm i386 i386-up i386-defconfig i386-allnoconfig ia64 ia64-up m68k mips parisc parisc-up powerpc powerpc-up s390 s390-up sparc sparc-up sparc64 sparc64-up um-x86_64 x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig as well as my two usual configs. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 5月, 2007 5 次提交
-
-
由 Christoph Lameter 提交于
Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
The patch adds PageTail(page) and PageHead(page) to check if a page is the head or the tail of a compound page. This is done by masking the two bits describing the state of a compound page and then comparing them. So one comparision and a branch instead of two bit checks and two branches. Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
If we add a new flag so that we can distinguish between the first page and the tail pages then we can avoid to use page->private in the first page. page->private == page for the first page, so there is no real information in there. Freeing up page->private makes the use of compound pages more transparent. They become more usable like real pages. Right now we have to be careful f.e. if we are going beyond PAGE_SIZE allocations in the slab on i386 because we can then no longer use the private field. This is one of the issues that cause us not to support debugging for page size slabs in SLAB. Having page->private available for SLUB would allow more meta information in the page struct. I can probably avoid the 16 bit ints that I have in there right now. Also if page->private is available then a compound page may be equipped with buffer heads. This may free up the way for filesystems to support larger blocks than page size. We add PageTail as an alias of PageReclaim. Compound pages cannot currently be reclaimed. Because of the alias one needs to check PageCompound first. The RFC for the this approach was discussed at http://marc.info/?t=117574302800001&r=1&w=2 [nacc@us.ibm.com: fix hugetlbfs] Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Minimum gcc version is 3.2 now. However, with likely profiling, even modern gcc versions cannot always eliminate the call. Replace the placeholder functions with the more conventional empty static inlines, which should be optimal for everyone. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jeremy Fitzhardinge 提交于
Add a new mm function apply_to_page_range() which applies a given function to every pte in a given virtual address range in a given mm structure. This is a generic alternative to cut-and-pasting the Linux idiomatic pagetable walking code in every place that a sequence of PTEs must be accessed. Although this interface is intended to be useful in a wide range of situations, it is currently used specifically by several Xen subsystems, for example: to ensure that pagetables have been allocated for a virtual address range, and to construct batched special pagetable update requests to map I/O memory (in ioremap()). [akpm@linux-foundation.org: fix warning, unpleasantly] Signed-off-by: NIan Pratt <ian.pratt@xensource.com> Signed-off-by: NChristian Limpach <Christian.Limpach@cl.cam.ac.uk> Signed-off-by: NChris Wright <chrisw@sous-sol.org> Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Matt Mackall <mpm@waste.org> Acked-by: NIngo Molnar <mingo@elte.hu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 3月, 2007 1 次提交
-
-
由 Adrian Bunk 提交于
shmem_{nopage,mmap} are no longer used in ipc/shm.c Signed-off-by: NAdrian Bunk <bunk@stusta.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 2月, 2007 2 次提交
-
-
由 Benjamin Herrenschmidt 提交于
Add a NOPFN_REFAULT return code for vm_ops->nopfn() equivalent to NOPAGE_REFAULT for vmops->nopage() indicating that the handler requests a re-execution of the faulting instruction Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Arnd Bergmann <arnd.bergmann@de.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Add a vm_insert_pfn helper, so that ->fault handlers can have nopfn functionality by installing their own pte and returning NULL. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Arnd Bergmann <arnd.bergmann@de.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 2月, 2007 2 次提交
-
-
由 Ken Chen 提交于
shmem backed file does not have page writeback, nor it participates in backing device's dirty or writeback accounting. So using generic __set_page_dirty_nobuffers() for its .set_page_dirty aops method is a bit overkill. It unnecessarily prolongs shm unmap latency. For example, on a densely populated large shm segment (sevearl GBs), the unmapping operation becomes painfully long. Because at unmap, kernel transfers dirty bit in PTE into page struct and to the radix tree tag. The operation of tagging the radix tree is particularly expensive because it has to traverse the tree from the root to the leaf node on every dirty page. What's bothering is that radix tree tag is used for page write back. However, shmem is memory backed and there is no page write back for such file system. And in the end, we spend all that time tagging radix tree and none of that fancy tagging will be used. So let's simplify it by introduce a new aops __set_page_dirty_no_writeback and this will speed up shm unmap. Signed-off-by: NKen Chen <kenchen@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Whitcroft 提交于
Currently if we have a non-zero ZONES_SHIFT we assume we are able to rely on that as the bottom edge of the ZONEID, if not then we use the NODES_PGOFF as the right end of either NODES _or_ SECTION. This latter is more luck than judgement and would be incorrect if we reordered the SECTION,NODE,ZONE options in the fields space. Really what we want is the lower of the right hand end of the two fields we are using (either NODE,ZONE or SECTION,ZONE). Codify that explicitly. As always allow for there being no bits in either of the fields, such as might be valid in a non-numa machine with only a zone NORMAL. I have checked that the compiler is still able to constant fold all of this away correctly. Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 2月, 2007 1 次提交
-
-
由 Roland McGrath 提交于
This patch adds a utility function install_special_mapping, for creating a special vma using a fixed set of preallocated pages as backing, such as for a vDSO. This consolidates some nearly identical code used for vDSO mapping reimplemented for different architectures. Signed-off-by: NRoland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 1月, 2007 1 次提交
-
-
由 Roland McGrath 提交于
This patch adds the VM_ALWAYSDUMP flag for vm_flags in vm_area_struct. This provides a clean explicit way to have a vma always included in core dumps, as is needed for vDSO's. Signed-off-by: NRoland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 1月, 2007 1 次提交
-
-
由 Dave Hansen 提交于
Fix an oops experienced on the Cell architecture when init-time functions, early_*(), are called at runtime. It alters the call paths to make sure that the callers explicitly say whether the call is being made on behalf of a hotplug even, or happening at boot-time. It has been compile tested on ppc64, ia64, s390, i386 and x86_64. Acked-by: NArnd Bergmann <arndb@de.ibm.com> Signed-off-by: NDave Hansen <haveblue@us.ibm.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Acked-by: NAndy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 08 12月, 2006 4 次提交
-
-
由 Andy Whitcroft 提交于
Currently we we use the lru head link of the second page of a compound page to hold its destructor. This was ok when it was purely an internal implmentation detail. However, hugetlbfs overrides this destructor violating the layering. Abstract this out as explicit calls, also introduce a type for the callback function allowing them to be type checked. For each callback we pre-declare the function, causing a type error on definition rather than on use elsewhere. [akpm@osdl.org: cleanups] Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Andy Whitcroft 提交于
NUMA node ids are passed as either int or unsigned int almost exclusivly page_to_nid and zone_to_nid both return unsigned long. This is a throw back to when page_to_nid was a #define and was thus exposing the real type of the page flags field. In addition to fixing up the definitions of page_to_nid and zone_to_nid I audited the users of these functions identifying the following incorrect uses: 1) mm/page_alloc.c show_node() -- printk dumping the node id, 2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison against numa_node_id() which returns an int from cpu_to_node(), and 3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which uses bit_set which in generic code takes an int. Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
vm_area_cachep is used to store vm_area_structs. So move to mm.h. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
The zone table is mostly not needed. If we have a node in the page flags then we can get to the zone via NODE_DATA() which is much more likely to be already in the cpu cache. In case of SMP and UP NODE_DATA() is a constant pointer which allows us to access an exact replica of zonetable in the node_zones field. In all of the above cases there will be no need at all for the zone table. The only remaining case is if in a NUMA system the node numbers do not fit into the page flags. In that case we make sparse generate a table that maps sections to nodes and use that table to to figure out the node number. This table is sized to fit in a single cache line for the known 32 bit NUMA platform which makes it very likely that the information can be obtained without a cache miss. For sparsemem the zone table seems to be have been fairly large based on the maximum possible number of sections and the number of zones per node. There is some memory saving by removing zone_table. The main benefit is to reduce the cache foootprint of the VM from the frequent lookups of zones. Plus it simplifies the page allocator. [akpm@osdl.org: build fix] Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 21 10月, 2006 1 次提交
-
-
由 Alexey Dobriyan 提交于
Despite mm.h is not being exported header, it does contain one thing which is part of userspace ABI -- value disabling OOM killer for given process. So, a) create and export include/linux/oom.h b) move OOM_DISABLE define there. c) turn bounding values of /proc/$PID/oom_adj into defines and export them too. Note: mass __KERNEL__ removal will be done later. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 12 10月, 2006 1 次提交
-
-
由 Nick Piggin 提交于
Move the lock debug checks below the page reserved checks. Also, having debug_check_no_locks_freed in kernel_map_pages is wrong. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 06 10月, 2006 1 次提交
-
-
由 Benjamin Herrenschmidt 提交于
Add a way for a no_page() handler to request a retry of the faulting instruction. It goes back to userland on page faults and just tries again in get_user_pages(). I added a cond_resched() in the loop in that later case. The problem I have with signal and spufs is an actual bug affecting apps and I don't see other ways of fixing it. In addition, we are having issues with infiniband and 64k pages (related to the way the hypervisor deals with some HV cards) that will require us to muck around with the MMU from within the IB driver's no_page() (it's a pSeries specific driver) and return to the caller the same way using NOPAGE_REFAULT. And to add to this, the graphics folks have been following a new approach of memory management that involves transparently swapping objects between video ram and main meory. To do that, they need installing PTEs from a no_page() handler as well and that also requires returning with NOPAGE_REFAULT. (For the later, they are currently using io_remap_pfn_range to install one PTE from no_page() which is a bit racy, we need to add a check for the PTE having already been installed afer taking the lock, but that's ok, they are only at the proof-of-concept stage. I'll send a patch adding a "clean" function to do that, we can use that from spufs too and get rid of the sparsemem hacks we do to create struct page for SPEs. Basically, that provides a generic solution for being able to have no_page() map hardware devices, which is something that I think sound driver folks have been asking for some time too). All of these things depend on having the NOPAGE_REFAULT exit path from no_page() handlers. Signed-off-by: NBenjamin Herrenchmidt <benh@kernel.crashing.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 01 10月, 2006 2 次提交
-
-
由 Keith Mannthey 提交于
Fix up externs in memory_hotplug.c. Cleanup. Signed-off-by: NKeith Mannthey <kmannth@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 David Howells 提交于
Move some functions out of the buffering code that aren't strictly buffering specific. This is a precursor to being able to disable the block layer. (*) Moved some stuff out of fs/buffer.c: (*) The file sync and general sync stuff moved to fs/sync.c. (*) The superblock sync stuff moved to fs/super.c. (*) do_invalidatepage() moved to mm/truncate.c. (*) try_to_release_page() moved to mm/filemap.c. (*) Moved some related declarations between header files: (*) declarations for do_invalidatepage() and try_to_release_page() moved to linux/mm.h. (*) __set_page_dirty_buffers() moved to linux/buffer_head.h. Signed-Off-By: NDavid Howells <dhowells@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 9月, 2006 4 次提交
-
-
由 David Howells 提交于
Move the fallback arch_vma_name() to a sensible place (kernel/signal.c). Currently it's in fs/proc/task_mmu.c, a file that is dependent on both CONFIG_PROC_FS and CONFIG_MMU being enabled, but it's used from kernel/signal.c from where it is called unconditionally. [akpm@osdl.org: build fix] Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Jes Sorensen 提交于
Implement do_no_pfn() for handling mapping of memory without a struct page backing it. This avoids creating fake page table entries for regions which are not backed by real memory. This feature is used by the MSPEC driver and other users, where it is highly undesirable to have a struct page sitting behind the page (for instance if the page is accessed in cached mode via the struct page in parallel to the the driver accessing it uncached, which can result in data corruption on some architectures, such as ia64). This version uses specific NOPFN_{SIGBUS,OOM} return values, rather than expect all negative pfn values would be an error. It also bugs on cow mappings as this would not work with the VM. [akpm@osdl.org: micro-optimise] Signed-off-by: NJes Sorensen <jes@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
Add the node in order to optimize zone_to_nid. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NPaul Jackson <pj@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Heiko Carstens 提交于
This moves the definition of struct page from mm.h to its own header file page-struct.h. This is a prereq to fix SetPageUptodate which is broken on s390: #define SetPageUptodate(_page) do { struct page *__page = (_page); if (!test_and_set_bit(PG_uptodate, &__page->flags)) page_test_and_clear_dirty(_page); } while (0) _page gets used twice in this macro which can cause subtle bugs. Using __page for the page_test_and_clear_dirty call doesn't work since it causes yet another problem with the page_test_and_clear_dirty macro as well. In order to avoid all these problems caused by macros it seems to be a good idea to get rid of them and convert them to static inline functions. Because of header file include order it's necessary to have a seperate header file for the struct page definition. Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-