1. 31 7月, 2007 1 次提交
  2. 30 7月, 2007 3 次提交
    • A
      Remove fs.h from mm.h · 4e950f6f
      Alexey Dobriyan 提交于
      Remove fs.h from mm.h. For this,
       1) Uninline vma_wants_writenotify(). It's pretty huge anyway.
       2) Add back fs.h or less bloated headers (err.h) to files that need it.
      
      As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files
      rebuilt down to 3444 (-12.3%).
      
      Cross-compile tested without regressions on my two usual configs and (sigh):
      
      alpha              arm-mx1ads        mips-bigsur          powerpc-ebony
      alpha-allnoconfig  arm-neponset      mips-capcella        powerpc-g5
      alpha-defconfig    arm-netwinder     mips-cobalt          powerpc-holly
      alpha-up           arm-netx          mips-db1000          powerpc-iseries
      arm                arm-ns9xxx        mips-db1100          powerpc-linkstation
      arm-assabet        arm-omap_h2_1610  mips-db1200          powerpc-lite5200
      arm-at91rm9200dk   arm-onearm        mips-db1500          powerpc-maple
      arm-at91rm9200ek   arm-picotux200    mips-db1550          powerpc-mpc7448_hpc2
      arm-at91sam9260ek  arm-pleb          mips-ddb5477         powerpc-mpc8272_ads
      arm-at91sam9261ek  arm-pnx4008       mips-decstation      powerpc-mpc8313_rdb
      arm-at91sam9263ek  arm-pxa255-idp    mips-e55             powerpc-mpc832x_mds
      arm-at91sam9rlek   arm-realview      mips-emma2rh         powerpc-mpc832x_rdb
      arm-ateb9200       arm-realview-smp  mips-excite          powerpc-mpc834x_itx
      arm-badge4         arm-rpc           mips-fulong          powerpc-mpc834x_itxgp
      arm-carmeva        arm-s3c2410       mips-ip22            powerpc-mpc834x_mds
      arm-cerfcube       arm-shannon       mips-ip27            powerpc-mpc836x_mds
      arm-clps7500       arm-shark         mips-ip32            powerpc-mpc8540_ads
      arm-collie         arm-simpad        mips-jazz            powerpc-mpc8544_ds
      arm-corgi          arm-spitz         mips-jmr3927         powerpc-mpc8560_ads
      arm-csb337         arm-trizeps4      mips-malta           powerpc-mpc8568mds
      arm-csb637         arm-versatile     mips-mipssim         powerpc-mpc85xx_cds
      arm-ebsa110        i386              mips-mpc30x          powerpc-mpc8641_hpcn
      arm-edb7211        i386-allnoconfig  mips-msp71xx         powerpc-mpc866_ads
      arm-em_x270        i386-defconfig    mips-ocelot          powerpc-mpc885_ads
      arm-ep93xx         i386-up           mips-pb1100          powerpc-pasemi
      arm-footbridge     ia64              mips-pb1500          powerpc-pmac32
      arm-fortunet       ia64-allnoconfig  mips-pb1550          powerpc-ppc64
      arm-h3600          ia64-bigsur       mips-pnx8550-jbs     powerpc-prpmc2800
      arm-h7201          ia64-defconfig    mips-pnx8550-stb810  powerpc-ps3
      arm-h7202          ia64-gensparse    mips-qemu            powerpc-pseries
      arm-hackkit        ia64-sim          mips-rbhma4200       powerpc-up
      arm-integrator     ia64-sn2          mips-rbhma4500       s390
      arm-iop13xx        ia64-tiger        mips-rm200           s390-allnoconfig
      arm-iop32x         ia64-up           mips-sb1250-swarm    s390-defconfig
      arm-iop33x         ia64-zx1          mips-sead            s390-up
      arm-ixp2000        m68k              mips-tb0219          sparc
      arm-ixp23xx        m68k-amiga        mips-tb0226          sparc-allnoconfig
      arm-ixp4xx         m68k-apollo       mips-tb0287          sparc-defconfig
      arm-jornada720     m68k-atari        mips-workpad         sparc-up
      arm-kafa           m68k-bvme6000     mips-wrppmc          sparc64
      arm-kb9202         m68k-hp300        mips-yosemite        sparc64-allnoconfig
      arm-ks8695         m68k-mac          parisc               sparc64-defconfig
      arm-lart           m68k-mvme147      parisc-allnoconfig   sparc64-up
      arm-lpd270         m68k-mvme16x      parisc-defconfig     um-x86_64
      arm-lpd7a400       m68k-q40          parisc-up            x86_64
      arm-lpd7a404       m68k-sun3         powerpc              x86_64-allnoconfig
      arm-lubbock        m68k-sun3x        powerpc-cell         x86_64-defconfig
      arm-lusl7200       mips              powerpc-celleb       x86_64-up
      arm-mainstone      mips-atlas        powerpc-chrp32
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e950f6f
    • R
      Introduce CONFIG_SUSPEND for suspend-to-Ram and standby · 296699de
      Rafael J. Wysocki 提交于
      Introduce CONFIG_SUSPEND representing the ability to enter system sleep
      states, such as the ACPI S3 state, and allow the user to choose SUSPEND
      and HIBERNATION independently of each other.
      
      Make HOTPLUG_CPU be selected automatically if SUSPEND or HIBERNATION has
      been chosen and the kernel is intended for SMP systems.
      
      Also, introduce CONFIG_PM_SLEEP which is automatically selected if
      CONFIG_SUSPEND or CONFIG_HIBERNATION is set and use it to select the
      code needed for both suspend and hibernation.
      
      The top-level power management headers and the ACPI code related to
      suspend and hibernation are modified to use the new definitions (the
      changes in drivers/acpi/sleep/main.c are, mostly, moving code to reduce
      the number of ifdefs).
      
      There are many other files in which CONFIG_PM can be replaced with
      CONFIG_PM_SLEEP or even with CONFIG_SUSPEND, but they can be updated in
      the future.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      296699de
    • R
      Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION · b0cb1a19
      Rafael J. Wysocki 提交于
      Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION to avoid
      confusion (among other things, with CONFIG_SUSPEND introduced in the
      next patch).
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0cb1a19
  3. 27 7月, 2007 3 次提交
  4. 25 7月, 2007 2 次提交
  5. 24 7月, 2007 1 次提交
  6. 23 7月, 2007 1 次提交
  7. 22 7月, 2007 3 次提交
  8. 20 7月, 2007 26 次提交
    • P
      mm: fix memory hotplug oops from ZONE_MOVABLE changes. · e228929b
      Paul Mundt 提交于
      zone_movable_pfn is presently marked as __initdata and referenced from
      adjust_zone_range_for_zone_movable(), which in turn is referenced by
      zone_spanned_pages_in_node().  Both of these are __meminit annotated.  When
      memory hotplug is enabled, this will oops on a hot-add, due to
      zone_movable_pfn having been freed.
      
      __meminitdata annotation gives the desired behaviour.
      
      This will only impact platforms that enable both memory hotplug
      and ARCH_POPULATES_NODE_MAP.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e228929b
    • P
      mm: Remove slab destructors from kmem_cache_create(). · 20c2df83
      Paul Mundt 提交于
      Slab destructors were no longer supported after Christoph's
      c59def9f change. They've been
      BUGs for both slab and slub, and slob never supported them
      either.
      
      This rips out support for the dtor pointer from kmem_cache_create()
      completely and fixes up every single callsite in the kernel (there were
      about 224, not including the slab allocator definitions themselves,
      or the documentation references).
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      20c2df83
    • L
      slub: fix ksize() for zero-sized pointers · 9550b105
      Linus Torvalds 提交于
      The slab and slob allocators already did this right, but slub would call
      "get_object_page()" on the magic ZERO_SIZE_PTR, with all kinds of nasty
      end results.
      
      Noted by Ingo Molnar.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9550b105
    • L
      Fix up non-NUMA SLAB configuration for zero-sized allocations · a5c96d8a
      Linus Torvalds 提交于
      I suspect Christoph tested his code only in the NUMA configuration, for
      the combination of SLAB+non-NUMA the zero-sized kmalloc's would not work.
      
      Of course, this would only trigger in configurations where those zero-
      sized allocations happen (not very common), so that may explain why it
      wasn't more widely noticed.
      
      Seen by by Andi Kleen under qemu, and there seems to be a report by
      Michael Tsirkin on it too.
      
      Cc: Andi Kleen <ak@suse.de>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: Michael S. Tsirkin <mst@dev.mellanox.co.il>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5c96d8a
    • R
      lguest: export symbols for lguest as a module · 5992b6da
      Rusty Russell 提交于
      lguest does some fairly lowlevel things to support a host, which
      normal modules don't need:
      
      math_state_restore:
      	When the guest triggers a Device Not Available fault, we need
      	to be able to restore the FPU
      
      __put_task_struct:
      	We need to hold a reference to another task for inter-guest
      	I/O, and put_task_struct() is an inline function which calls
      	__put_task_struct.
      
      access_process_vm:
      	We need to access another task for inter-guest I/O.
      
      map_vm_area & __get_vm_area:
      	We need to map the switcher shim (ie. monitor) at 0xFFC01000.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5992b6da
    • A
      move page writeback acounting out of macros · d688abf5
      Andrew Morton 提交于
      page-writeback accounting is presently performed in the page-flags macros.
      This is inconsistent and a bit ugly and makes it awkward to implement
      per-backing_dev under-writeback page accounting.
      
      So move this accounting down to the callsite(s).
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d688abf5
    • A
      hugetlb: use set_compound_page_dtor · f8af0bb8
      Akinobu Mita 提交于
      Use appropriate accessor function to set compound page destructor
      function.
      
      Cc:  William Irwin <wli@holomorphy.com>
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8af0bb8
    • H
      Remove nid_lock from alloc_fresh_huge_page · 7ed5cb2b
      Hugh Dickins 提交于
      The fix to that race in alloc_fresh_huge_page() which could give an illegal
      node ID did not need nid_lock at all: the fix was to replace static int nid
      by static int prev_nid and do the work on local int nid.  nid_lock did make
      sure that racers strictly roundrobin the nodes, but that's not something we
      need to enforce strictly.  Kill nid_lock.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ed5cb2b
    • B
      vmalloc_32 should use GFP_KERNEL · 7ac674f5
      Benjamin Herrenschmidt 提交于
      I've noticed lots of failures of vmalloc_32 on machines where it
      shouldn't have failed unless it was doing an atomic operation.
      
      Looking closely, I noticed that:
      
      #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
      #define GFP_VMALLOC32 GFP_DMA32
      #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
      #define GFP_VMALLOC32 GFP_DMA
      #else
      #define GFP_VMALLOC32 GFP_KERNEL
      #endif
      
      Which seems to be incorrect, it should always -or- in the DMA flags
      on top of GFP_KERNEL, thus this patch.
      
      This fixes frequent errors launchin X with the nouveau DRM for example.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Dave Airlie <airlied@linux.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ac674f5
    • D
      FRV: work around a possible compiler bug · ea02e3dd
      David Howells 提交于
      Work around a possible bug in the FRV compiler.
      
      What appears to be happening is that gcc resolves the
      __builtin_constant_p() in kmalloc() to true, but then fails to reduce the
      therefore constant conditions in the if-statements it guards to constant
      results.
      
      When compiling with -O2 or -Os, one single spurious error crops up in
      cpuup_callback() in mm/slab.c.  This can be avoided by making the memsize
      variable const.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea02e3dd
    • A
      dequeue_huge_page() warning fix · 3abf7afd
      Andrew Morton 提交于
      mm/hugetlb.c: In function `dequeue_huge_page':
      mm/hugetlb.c:72: warning: 'nid' might be used uninitialized in this function
      
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3abf7afd
    • O
      mm: variable length argument support · b6a2fea3
      Ollie Wild 提交于
      Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
      the old mm into the new mm.
      
      We create the new mm before the binfmt code runs, and place the new stack at
      the very top of the address space.  Once the binfmt code runs and figures out
      where the stack should be, we move it downwards.
      
      It is a bit peculiar in that we have one task with two mm's, one of which is
      inactive.
      
      [a.p.zijlstra@chello.nl: limit stack size]
      Signed-off-by: NOllie Wild <aaw@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      [bunk@stusta.de: unexport bprm_mm_init]
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6a2fea3
    • F
      readahead: sanify file_ra_state names · f9acc8c7
      Fengguang Wu 提交于
      Rename some file_ra_state variables and remove some accessors.
      
      It results in much simpler code.
      Kudos to Rusty!
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9acc8c7
    • R
      readahead: split ondemand readahead interface into two functions · cf914a7d
      Rusty Russell 提交于
      Split ondemand readahead interface into two functions.  I think this makes it
      a little clearer for non-readahead experts (like Rusty).
      
      Internally they both call ondemand_readahead(), but the page argument is
      changed to an obvious boolean flag.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf914a7d
    • F
      mm: share PG_readahead and PG_reclaim · fe3cba17
      Fengguang Wu 提交于
      Share the same page flag bit for PG_readahead and PG_reclaim.
      
      One is used only on file reads, another is only for emergency writes.  One
      is used mostly for fresh/young pages, another is for old pages.
      
      Combinations of possible interactions are:
      
      a) clear PG_reclaim => implicit clear of PG_readahead
      	it will delay an asynchronous readahead into a synchronous one
      	it actually does _good_ for readahead:
      		the pages will be reclaimed soon, it's readahead thrashing!
      		in this case, synchronous readahead makes more sense.
      
      b) clear PG_readahead => implicit clear of PG_reclaim
      	one(and only one) page will not be reclaimed in time
      	it can be avoided by checking PageWriteback(page) in readahead first
      
      c) set PG_reclaim => implicit set of PG_readahead
      	will confuse readahead and make it restart the size rampup process
      	it's a trivial problem, and can mostly be avoided by checking
      	PageWriteback(page) first in readahead
      
      d) set PG_readahead => implicit set of PG_reclaim
      	PG_readahead will never be set on already cached pages.
      	PG_reclaim will always be cleared on dirtying a page.
      	so not a problem.
      
      In summary,
      	a)   we get better behavior
      	b,d) possible interactions can be avoided
      	c)   racy condition exists that might affect readahead, but the chance
      	     is _really_ low, and the hurt on readahead is trivial.
      
      Compound pages also use PG_reclaim, but for now they do not interact with
      reclaim/readahead code.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe3cba17
    • F
      readahead: remove the old algorithm · c743d96b
      Fengguang Wu 提交于
      Remove the old readahead algorithm.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c743d96b
    • F
      readahead: convert filemap invocations · 3ea89ee8
      Fengguang Wu 提交于
      Convert filemap reads to use on-demand readahead.
      
      The new call scheme is to
      - call readahead on non-cached page
      - call readahead on look-ahead page
      - update prev_index when finished with the read request
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea89ee8
    • F
      readahead: on-demand readahead logic · 122a21d1
      Fengguang Wu 提交于
      This is a minimal readahead algorithm that aims to replace the current one.
      It is more flexible and reliable, while maintaining almost the same behavior
      and performance.  Also it is full integrated with adaptive readahead.
      
      It is designed to be called on demand:
      	- on a missing page, to do synchronous readahead
      	- on a lookahead page, to do asynchronous readahead
      
      In this way it eliminated the awkward workarounds for cache hit/miss,
      readahead thrashing, retried read, and unaligned read.  It also adopts the
      data structure introduced by adaptive readahead, parameterizes readahead
      pipelining with `lookahead_index', and reduces the current/ahead windows to
      one single window.
      
      HEURISTICS
      
      The logic deals with four cases:
      
      	- sequential-next
      		found a consistent readahead window, so push it forward
      
      	- random
      		standalone small read, so read as is
      
      	- sequential-first
      		create a new readahead window for a sequential/oversize request
      
      	- lookahead-clueless
      		hit a lookahead page not associated with the readahead window,
      		so create a new readahead window and ramp it up
      
      In each case, three parameters are determined:
      
      	- readahead index: where the next readahead begins
      	- readahead size:  how much to readahead
      	- lookahead size:  when to do the next readahead (for pipelining)
      
      BEHAVIORS
      
      The old behaviors are maximally preserved for trivial sequential/random reads.
      Notable changes are:
      
      	- It no longer imposes strict sequential checks.
      	  It might help some interleaved cases, and clustered random reads.
      	  It does introduce risks of a random lookahead hit triggering an
      	  unexpected readahead. But in general it is more likely to do good
      	  than to do evil.
      
      	- Interleaved reads are supported in a minimal way.
      	  Their chances of being detected and proper handled are still low.
      
      	- Readahead thrashings are better handled.
      	  The current readahead leads to tiny average I/O sizes, because it
      	  never turn back for the thrashed pages.  They have to be fault in
      	  by do_generic_mapping_read() one by one.  Whereas the on-demand
      	  readahead will redo readahead for them.
      
      OVERHEADS
      
      The new code reduced the overheads of
      
      	- excessively calling the readahead routine on small sized reads
      	  (the current readahead code insists on seeing all requests)
      
      	- doing a lot of pointless page-cache lookups for small cached files
      	  (the current readahead only turns itself off after 256 cache hits,
      	  unfortunately most files are < 1MB, so never see that chance)
      
      That accounts for speedup of
      	- 0.3% on 1-page sequential reads on sparse file
      	- 1.2% on 1-page cache hot sequential reads
      	- 3.2% on 256-page cache hot sequential reads
      	- 1.3% on cache hot `tar /lib`
      
      However, it does introduce one extra page-cache lookup per cache miss, which
      impacts random reads slightly. That's 1% overheads for 1-page random reads on
      sparse file.
      
      PERFORMANCE
      
      The basic benchmark setup is
      	- 2.6.20 kernel with on-demand readahead
      	- 1MB max readahead size
      	- 2.9GHz Intel Core 2 CPU
      	- 2GB memory
      	- 160G/8M Hitachi SATA II 7200 RPM disk
      
      The benchmarks show that
      	- it maintains the same performance for trivial sequential/random reads
      	- sysbench/OLTP performance on MySQL gains up to 8%
      	- performance on readahead thrashing gains up to 3 times
      
      iozone throughput (KB/s): roughly the same
      ==========================================
      iozone -c -t1 -s 4096m -r 64k
      
      			       2.6.20          on-demand      gain
      first run
      	  "  Initial write "   61437.27        64521.53      +5.0%
      	  "        Rewrite "   47893.02        48335.20      +0.9%
      	  "           Read "   62111.84        62141.49      +0.0%
      	  "        Re-read "   62242.66        62193.17      -0.1%
      	  "   Reverse Read "   50031.46        49989.79      -0.1%
      	  "    Stride read "    8657.61         8652.81      -0.1%
      	  "    Random read "   13914.28        13898.23      -0.1%
      	  " Mixed workload "   19069.27        19033.32      -0.2%
      	  "   Random write "   14849.80        14104.38      -5.0%
      	  "         Pwrite "   62955.30        65701.57      +4.4%
      	  "          Pread "   62209.99        62256.26      +0.1%
      
      second run
      	  "  Initial write "   60810.31        66258.69      +9.0%
      	  "        Rewrite "   49373.89        57833.66     +17.1%
      	  "           Read "   62059.39        62251.28      +0.3%
      	  "        Re-read "   62264.32        62256.82      -0.0%
      	  "   Reverse Read "   49970.96        50565.72      +1.2%
      	  "    Stride read "    8654.81         8638.45      -0.2%
      	  "    Random read "   13901.44        13949.91      +0.3%
      	  " Mixed workload "   19041.32        19092.04      +0.3%
      	  "   Random write "   14019.99        14161.72      +1.0%
      	  "         Pwrite "   64121.67        68224.17      +6.4%
      	  "          Pread "   62225.08        62274.28      +0.1%
      
      In summary, writes are unstable, reads are pretty close on average:
      
      			  access pattern  2.6.20  on-demand   gain
      				   Read  62085.61  62196.38  +0.2%
      				Re-read  62253.49  62224.99  -0.0%
      			   Reverse Read  50001.21  50277.75  +0.6%
      			    Stride read   8656.21   8645.63  -0.1%
      			    Random read  13907.86  13924.07  +0.1%
      	 		 Mixed workload  19055.29  19062.68  +0.0%
      				  Pread  62217.53  62265.27  +0.1%
      
      aio-stress: roughly the same
      ============================
      aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
      aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
      
      					2.6.20      on-demand  delta
      			sequential	 92.57s      92.54s    -0.0%
      			random		311.87s     312.15s    +0.1%
      
      sysbench fileio: roughly the same
      =================================
      sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
      	 --file-total-size=4G --file-block-size=64K \
      	 --num-threads=001 --max-requests=10000 --max-time=900 run
      
      				threads    2.6.20   on-demand    delta
      		first run
      				      1   59.1974s    59.2262s  +0.0%
      				      2   58.0575s    58.2269s  +0.3%
      				      4   48.0545s    47.1164s  -2.0%
      				      8   41.0684s    41.2229s  +0.4%
      				     16   35.8817s    36.4448s  +1.6%
      				     32   32.6614s    32.8240s  +0.5%
      				     64   23.7601s    24.1481s  +1.6%
      				    128   24.3719s    23.8225s  -2.3%
      				    256   23.2366s    22.0488s  -5.1%
      
      		second run
      				      1   59.6720s    59.5671s  -0.2%
      				      8   41.5158s    41.9541s  +1.1%
      				     64   25.0200s    23.9634s  -4.2%
      				    256   22.5491s    20.9486s  -7.1%
      
      Note that the numbers are not very stable because of the writes.
      The overall performance is close when we sum all seconds up:
      
                      sum all up               495.046s    491.514s   -0.7%
      
      sysbench oltp (trans/sec): up to 8% gain
      ========================================
      sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
      	 --mysql-socket=/var/run/mysqld/mysqld.sock \
      	 --mysql-user=root --mysql-password=readahead \
      	 --num-threads=064 --max-requests=10000 --max-time=900 run
      
      	10000-transactions run
      				threads    2.6.20   on-demand    gain
      				      1     62.81       64.56   +2.8%
      				      2     67.97       70.93   +4.4%
      				      4     81.81       85.87   +5.0%
      				      8     94.60       97.89   +3.5%
      				     16     99.07      104.68   +5.7%
      				     32     95.93      104.28   +8.7%
      				     64     96.48      103.68   +7.5%
      	5000-transactions run
      				      1     48.21       48.65   +0.9%
      				      8     68.60       70.19   +2.3%
      				     64     70.57       74.72   +5.9%
      	2000-transactions run
      				      1     37.57       38.04   +1.3%
      				      2     38.43       38.99   +1.5%
      				      4     45.39       46.45   +2.3%
      				      8     51.64       52.36   +1.4%
      				     16     54.39       55.18   +1.5%
      				     32     52.13       54.49   +4.5%
      				     64     54.13       54.61   +0.9%
      
      That's interesting results. Some investigations show that
      	- MySQL is accessing the db file non-uniformly: some parts are
      	  more hot than others
      	- It is mostly doing 4-page random reads, and sometimes doing two
      	  reads in a row, the latter one triggers a 16-page readahead.
      	- The on-demand readahead leaves many lookahead pages (flagged
      	  PG_readahead) there. Many of them will be hit, and trigger
      	  more readahead pages. Which might save more seeks.
      	- Naturally, the readahead windows tend to lie in hot areas,
      	  and the lookahead pages in hot areas is more likely to be hit.
      	- The more overall read density, the more possible gain.
      
      That also explains the adaptive readahead tricks for clustered random reads.
      
      readahead thrashing: 3 times better
      ===================================
      We boot kernel with "mem=128m single", and start a 100KB/s stream on every
      second, until reaching 200 streams.
      
      			      max throughput     min avg I/O size
      		2.6.20:            5MB/s               16KB
      		on-demand:        15MB/s              140KB
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      122a21d1
    • F
      readahead: data structure and routines · 5ce1110b
      Fengguang Wu 提交于
      Extend struct file_ra_state to support the on-demand readahead logic.  Also
      define some helpers for it.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ce1110b
    • F
      readahead: MIN_RA_PAGES/MAX_RA_PAGES macros · f615bfca
      Fengguang Wu 提交于
      Define two convenient macros for read-ahead:
      	- MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD
      	- MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD
      
      Note that the rounded up MIN_RA_PAGES will work flawlessly with _large_
      page sizes like 64k.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f615bfca
    • F
      readahead: add look-ahead support to __do_page_cache_readahead() · 46fc3e7b
      Fengguang Wu 提交于
      Add look-ahead support to __do_page_cache_readahead().
      
      It works by
      	- mark the Nth backwards page with PG_readahead,
      	(which instructs the page's first reader to invoke readahead)
      	- and only do the marking for newly allocated pages.
      	(to prevent blindly doing readahead on already cached pages)
      
      Look-ahead is a technique to achieve I/O pipelining:
      
      While the application is working through a chunk of cached pages, the kernel
      reads-ahead the next chunk of pages _before_ time of need.  It effectively
      hides low level I/O latencies to high level applications.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46fc3e7b
    • F
      readahead: introduce PG_readahead · d77c2d7c
      Fengguang Wu 提交于
      Introduce a new page flag: PG_readahead.
      
      It acts as a look-ahead mark, which tells the page reader: Hey, it's time to
      invoke the read-ahead logic.  For the sake of I/O pipelining, don't wait until
      it runs out of cached pages!
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d77c2d7c
    • M
      only allow nonlinear vmas for ram backed filesystems · 3ee6dafc
      Miklos Szeredi 提交于
      page_mkclean() doesn't re-protect ptes for non-linear mappings, so a later
      re-dirty through such a mapping will not generate a fault, PG_dirty will
      not reflect the dirty state and the dirty count will be skewed.  This
      implies that msync() is also currently broken for nonlinear mappings.
      
      The easiest solution is to emulate remap_file_pages on non-linear mappings
      with simple mmap() for non ram-backed filesystems.  Applications continue
      to work (albeit slower), as long as the number of remappings remain below
      the maximum vma count.
      
      However all currently known real uses of non-linear mappings are for ram
      backed filesystems, which this patch doesn't affect.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ee6dafc
    • N
      mm: fix clear_page_dirty_for_io vs fault race · 79352894
      Nick Piggin 提交于
      Fix msync data loss and (less importantly) dirty page accounting
      inaccuracies due to the race remaining in clear_page_dirty_for_io().
      
      The deleted comment explains what the race was, and the added comments
      explain how it is fixed.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79352894
    • N
      mm: fault feedback #2 · 83c54070
      Nick Piggin 提交于
      This patch completes Linus's wish that the fault return codes be made into
      bit flags, which I agree makes everything nicer.  This requires requires
      all handle_mm_fault callers to be modified (possibly the modifications
      should go further and do things like fault accounting in handle_mm_fault --
      however that would be for another patch).
      
      [akpm@linux-foundation.org: fix alpha build]
      [akpm@linux-foundation.org: fix s390 build]
      [akpm@linux-foundation.org: fix sparc build]
      [akpm@linux-foundation.org: fix sparc64 build]
      [akpm@linux-foundation.org: fix ia64 build]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Bryan Wu <bryan.wu@analog.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NAndi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Still apparently needs some ARM and PPC loving - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83c54070
    • N
      mm: fault feedback #1 · d0217ac0
      Nick Piggin 提交于
      Change ->fault prototype.  We now return an int, which contains
      VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
       FAULT_RET_ code tells the VM whether a page was found, whether it has been
      locked, and potentially other things.  This is not quite the way he wanted
      it yet, but that's changed in the next patch (which requires changes to
      arch code).
      
      This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
      that a page is locked which requires filemap_nopage to go away (because we
      can no longer remain backward compatible without that flag), but we were
      going to do that anyway.
      
      struct fault_data is renamed to struct vm_fault as Linus asked. address
      is now a void __user * that we should firmly encourage drivers not to use
      without really good reason.
      
      The page is now returned via a page pointer in the vm_fault struct.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0217ac0