1. 17 6月, 2009 7 次提交
  2. 03 4月, 2009 2 次提交
  3. 26 3月, 2009 1 次提交
  4. 20 10月, 2008 1 次提交
    • R
      vmscan: split LRU lists into anon & file sets · 4f98a2fe
      Rik van Riel 提交于
      Split the LRU lists in two, one set for pages that are backed by real file
      systems ("file") and one for pages that are backed by memory and swap
      ("anon").  The latter includes tmpfs.
      
      The advantage of doing this is that the VM will not have to scan over lots
      of anonymous pages (which we generally do not want to swap out), just to
      find the page cache pages that it should evict.
      
      This patch has the infrastructure and a basic policy to balance how much
      we scan the anon lists and how much we scan the file lists.  The big
      policy changes are in separate patches.
      
      [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
      [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
      [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
      [hugh@veritas.com: memcg swapbacked pages active]
      [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
      [akpm@linux-foundation.org: fix /proc/vmstat units]
      [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
      [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
      [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f98a2fe
  5. 17 10月, 2008 1 次提交
  6. 27 7月, 2008 1 次提交
  7. 30 4月, 2008 1 次提交
  8. 20 3月, 2008 1 次提交
  9. 17 10月, 2007 7 次提交
  10. 10 10月, 2007 1 次提交
  11. 20 7月, 2007 8 次提交
    • F
      readahead: sanify file_ra_state names · f9acc8c7
      Fengguang Wu 提交于
      Rename some file_ra_state variables and remove some accessors.
      
      It results in much simpler code.
      Kudos to Rusty!
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9acc8c7
    • R
      readahead: split ondemand readahead interface into two functions · cf914a7d
      Rusty Russell 提交于
      Split ondemand readahead interface into two functions.  I think this makes it
      a little clearer for non-readahead experts (like Rusty).
      
      Internally they both call ondemand_readahead(), but the page argument is
      changed to an obvious boolean flag.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf914a7d
    • F
      mm: share PG_readahead and PG_reclaim · fe3cba17
      Fengguang Wu 提交于
      Share the same page flag bit for PG_readahead and PG_reclaim.
      
      One is used only on file reads, another is only for emergency writes.  One
      is used mostly for fresh/young pages, another is for old pages.
      
      Combinations of possible interactions are:
      
      a) clear PG_reclaim => implicit clear of PG_readahead
      	it will delay an asynchronous readahead into a synchronous one
      	it actually does _good_ for readahead:
      		the pages will be reclaimed soon, it's readahead thrashing!
      		in this case, synchronous readahead makes more sense.
      
      b) clear PG_readahead => implicit clear of PG_reclaim
      	one(and only one) page will not be reclaimed in time
      	it can be avoided by checking PageWriteback(page) in readahead first
      
      c) set PG_reclaim => implicit set of PG_readahead
      	will confuse readahead and make it restart the size rampup process
      	it's a trivial problem, and can mostly be avoided by checking
      	PageWriteback(page) first in readahead
      
      d) set PG_readahead => implicit set of PG_reclaim
      	PG_readahead will never be set on already cached pages.
      	PG_reclaim will always be cleared on dirtying a page.
      	so not a problem.
      
      In summary,
      	a)   we get better behavior
      	b,d) possible interactions can be avoided
      	c)   racy condition exists that might affect readahead, but the chance
      	     is _really_ low, and the hurt on readahead is trivial.
      
      Compound pages also use PG_reclaim, but for now they do not interact with
      reclaim/readahead code.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe3cba17
    • F
      readahead: remove the old algorithm · c743d96b
      Fengguang Wu 提交于
      Remove the old readahead algorithm.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c743d96b
    • F
      readahead: on-demand readahead logic · 122a21d1
      Fengguang Wu 提交于
      This is a minimal readahead algorithm that aims to replace the current one.
      It is more flexible and reliable, while maintaining almost the same behavior
      and performance.  Also it is full integrated with adaptive readahead.
      
      It is designed to be called on demand:
      	- on a missing page, to do synchronous readahead
      	- on a lookahead page, to do asynchronous readahead
      
      In this way it eliminated the awkward workarounds for cache hit/miss,
      readahead thrashing, retried read, and unaligned read.  It also adopts the
      data structure introduced by adaptive readahead, parameterizes readahead
      pipelining with `lookahead_index', and reduces the current/ahead windows to
      one single window.
      
      HEURISTICS
      
      The logic deals with four cases:
      
      	- sequential-next
      		found a consistent readahead window, so push it forward
      
      	- random
      		standalone small read, so read as is
      
      	- sequential-first
      		create a new readahead window for a sequential/oversize request
      
      	- lookahead-clueless
      		hit a lookahead page not associated with the readahead window,
      		so create a new readahead window and ramp it up
      
      In each case, three parameters are determined:
      
      	- readahead index: where the next readahead begins
      	- readahead size:  how much to readahead
      	- lookahead size:  when to do the next readahead (for pipelining)
      
      BEHAVIORS
      
      The old behaviors are maximally preserved for trivial sequential/random reads.
      Notable changes are:
      
      	- It no longer imposes strict sequential checks.
      	  It might help some interleaved cases, and clustered random reads.
      	  It does introduce risks of a random lookahead hit triggering an
      	  unexpected readahead. But in general it is more likely to do good
      	  than to do evil.
      
      	- Interleaved reads are supported in a minimal way.
      	  Their chances of being detected and proper handled are still low.
      
      	- Readahead thrashings are better handled.
      	  The current readahead leads to tiny average I/O sizes, because it
      	  never turn back for the thrashed pages.  They have to be fault in
      	  by do_generic_mapping_read() one by one.  Whereas the on-demand
      	  readahead will redo readahead for them.
      
      OVERHEADS
      
      The new code reduced the overheads of
      
      	- excessively calling the readahead routine on small sized reads
      	  (the current readahead code insists on seeing all requests)
      
      	- doing a lot of pointless page-cache lookups for small cached files
      	  (the current readahead only turns itself off after 256 cache hits,
      	  unfortunately most files are < 1MB, so never see that chance)
      
      That accounts for speedup of
      	- 0.3% on 1-page sequential reads on sparse file
      	- 1.2% on 1-page cache hot sequential reads
      	- 3.2% on 256-page cache hot sequential reads
      	- 1.3% on cache hot `tar /lib`
      
      However, it does introduce one extra page-cache lookup per cache miss, which
      impacts random reads slightly. That's 1% overheads for 1-page random reads on
      sparse file.
      
      PERFORMANCE
      
      The basic benchmark setup is
      	- 2.6.20 kernel with on-demand readahead
      	- 1MB max readahead size
      	- 2.9GHz Intel Core 2 CPU
      	- 2GB memory
      	- 160G/8M Hitachi SATA II 7200 RPM disk
      
      The benchmarks show that
      	- it maintains the same performance for trivial sequential/random reads
      	- sysbench/OLTP performance on MySQL gains up to 8%
      	- performance on readahead thrashing gains up to 3 times
      
      iozone throughput (KB/s): roughly the same
      ==========================================
      iozone -c -t1 -s 4096m -r 64k
      
      			       2.6.20          on-demand      gain
      first run
      	  "  Initial write "   61437.27        64521.53      +5.0%
      	  "        Rewrite "   47893.02        48335.20      +0.9%
      	  "           Read "   62111.84        62141.49      +0.0%
      	  "        Re-read "   62242.66        62193.17      -0.1%
      	  "   Reverse Read "   50031.46        49989.79      -0.1%
      	  "    Stride read "    8657.61         8652.81      -0.1%
      	  "    Random read "   13914.28        13898.23      -0.1%
      	  " Mixed workload "   19069.27        19033.32      -0.2%
      	  "   Random write "   14849.80        14104.38      -5.0%
      	  "         Pwrite "   62955.30        65701.57      +4.4%
      	  "          Pread "   62209.99        62256.26      +0.1%
      
      second run
      	  "  Initial write "   60810.31        66258.69      +9.0%
      	  "        Rewrite "   49373.89        57833.66     +17.1%
      	  "           Read "   62059.39        62251.28      +0.3%
      	  "        Re-read "   62264.32        62256.82      -0.0%
      	  "   Reverse Read "   49970.96        50565.72      +1.2%
      	  "    Stride read "    8654.81         8638.45      -0.2%
      	  "    Random read "   13901.44        13949.91      +0.3%
      	  " Mixed workload "   19041.32        19092.04      +0.3%
      	  "   Random write "   14019.99        14161.72      +1.0%
      	  "         Pwrite "   64121.67        68224.17      +6.4%
      	  "          Pread "   62225.08        62274.28      +0.1%
      
      In summary, writes are unstable, reads are pretty close on average:
      
      			  access pattern  2.6.20  on-demand   gain
      				   Read  62085.61  62196.38  +0.2%
      				Re-read  62253.49  62224.99  -0.0%
      			   Reverse Read  50001.21  50277.75  +0.6%
      			    Stride read   8656.21   8645.63  -0.1%
      			    Random read  13907.86  13924.07  +0.1%
      	 		 Mixed workload  19055.29  19062.68  +0.0%
      				  Pread  62217.53  62265.27  +0.1%
      
      aio-stress: roughly the same
      ============================
      aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
      aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
      
      					2.6.20      on-demand  delta
      			sequential	 92.57s      92.54s    -0.0%
      			random		311.87s     312.15s    +0.1%
      
      sysbench fileio: roughly the same
      =================================
      sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
      	 --file-total-size=4G --file-block-size=64K \
      	 --num-threads=001 --max-requests=10000 --max-time=900 run
      
      				threads    2.6.20   on-demand    delta
      		first run
      				      1   59.1974s    59.2262s  +0.0%
      				      2   58.0575s    58.2269s  +0.3%
      				      4   48.0545s    47.1164s  -2.0%
      				      8   41.0684s    41.2229s  +0.4%
      				     16   35.8817s    36.4448s  +1.6%
      				     32   32.6614s    32.8240s  +0.5%
      				     64   23.7601s    24.1481s  +1.6%
      				    128   24.3719s    23.8225s  -2.3%
      				    256   23.2366s    22.0488s  -5.1%
      
      		second run
      				      1   59.6720s    59.5671s  -0.2%
      				      8   41.5158s    41.9541s  +1.1%
      				     64   25.0200s    23.9634s  -4.2%
      				    256   22.5491s    20.9486s  -7.1%
      
      Note that the numbers are not very stable because of the writes.
      The overall performance is close when we sum all seconds up:
      
                      sum all up               495.046s    491.514s   -0.7%
      
      sysbench oltp (trans/sec): up to 8% gain
      ========================================
      sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
      	 --mysql-socket=/var/run/mysqld/mysqld.sock \
      	 --mysql-user=root --mysql-password=readahead \
      	 --num-threads=064 --max-requests=10000 --max-time=900 run
      
      	10000-transactions run
      				threads    2.6.20   on-demand    gain
      				      1     62.81       64.56   +2.8%
      				      2     67.97       70.93   +4.4%
      				      4     81.81       85.87   +5.0%
      				      8     94.60       97.89   +3.5%
      				     16     99.07      104.68   +5.7%
      				     32     95.93      104.28   +8.7%
      				     64     96.48      103.68   +7.5%
      	5000-transactions run
      				      1     48.21       48.65   +0.9%
      				      8     68.60       70.19   +2.3%
      				     64     70.57       74.72   +5.9%
      	2000-transactions run
      				      1     37.57       38.04   +1.3%
      				      2     38.43       38.99   +1.5%
      				      4     45.39       46.45   +2.3%
      				      8     51.64       52.36   +1.4%
      				     16     54.39       55.18   +1.5%
      				     32     52.13       54.49   +4.5%
      				     64     54.13       54.61   +0.9%
      
      That's interesting results. Some investigations show that
      	- MySQL is accessing the db file non-uniformly: some parts are
      	  more hot than others
      	- It is mostly doing 4-page random reads, and sometimes doing two
      	  reads in a row, the latter one triggers a 16-page readahead.
      	- The on-demand readahead leaves many lookahead pages (flagged
      	  PG_readahead) there. Many of them will be hit, and trigger
      	  more readahead pages. Which might save more seeks.
      	- Naturally, the readahead windows tend to lie in hot areas,
      	  and the lookahead pages in hot areas is more likely to be hit.
      	- The more overall read density, the more possible gain.
      
      That also explains the adaptive readahead tricks for clustered random reads.
      
      readahead thrashing: 3 times better
      ===================================
      We boot kernel with "mem=128m single", and start a 100KB/s stream on every
      second, until reaching 200 streams.
      
      			      max throughput     min avg I/O size
      		2.6.20:            5MB/s               16KB
      		on-demand:        15MB/s              140KB
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      122a21d1
    • F
      readahead: data structure and routines · 5ce1110b
      Fengguang Wu 提交于
      Extend struct file_ra_state to support the on-demand readahead logic.  Also
      define some helpers for it.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ce1110b
    • F
      readahead: MIN_RA_PAGES/MAX_RA_PAGES macros · f615bfca
      Fengguang Wu 提交于
      Define two convenient macros for read-ahead:
      	- MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD
      	- MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD
      
      Note that the rounded up MIN_RA_PAGES will work flawlessly with _large_
      page sizes like 64k.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f615bfca
    • F
      readahead: add look-ahead support to __do_page_cache_readahead() · 46fc3e7b
      Fengguang Wu 提交于
      Add look-ahead support to __do_page_cache_readahead().
      
      It works by
      	- mark the Nth backwards page with PG_readahead,
      	(which instructs the page's first reader to invoke readahead)
      	- and only do the marking for newly allocated pages.
      	(to prevent blindly doing readahead on already cached pages)
      
      Look-ahead is a technique to achieve I/O pipelining:
      
      While the application is working through a chunk of cached pages, the kernel
      reads-ahead the next chunk of pages _before_ time of need.  It effectively
      hides low level I/O latencies to high level applications.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46fc3e7b
  12. 08 5月, 2007 2 次提交
  13. 12 2月, 2007 1 次提交
  14. 11 12月, 2006 1 次提交
  15. 09 12月, 2006 1 次提交
  16. 08 12月, 2006 1 次提交
  17. 04 11月, 2006 1 次提交
  18. 27 6月, 2006 1 次提交
  19. 26 6月, 2006 1 次提交