1. 17 6月, 2009 33 次提交
    • M
      page allocator: calculate the preferred zone for allocation only once · 5117f45d
      Mel Gorman 提交于
      get_page_from_freelist() can be called multiple times for an allocation.
      Part of this calculates the preferred_zone which is the first usable zone
      in the zonelist but the zone depends on the GFP flags specified at the
      beginning of the allocation call.  This patch calculates preferred_zone
      once.  It's safe to do this because if preferred_zone is NULL at the start
      of the call, no amount of direct reclaim or other actions will change the
      fact the allocation will fail.
      
      [akpm@linux-foundation.org: remove (void) casts]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5117f45d
    • M
      page allocator: move check for disabled anti-fragmentation out of fastpath · 49255c61
      Mel Gorman 提交于
      On low-memory systems, anti-fragmentation gets disabled as there is
      nothing it can do and it would just incur overhead shuffling pages between
      lists constantly.  Currently the check is made in the free page fast path
      for every page.  This patch moves it to a slow path.  On machines with low
      memory, there will be small amount of additional overhead as pages get
      shuffled between lists but it should quickly settle.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49255c61
    • M
      page allocator: break up the allocator entry point into fast and slow paths · 11e33f6a
      Mel Gorman 提交于
      The core of the page allocator is one giant function which allocates
      memory on the stack and makes calculations that may not be needed for
      every allocation.  This patch breaks up the allocator path into fast and
      slow paths for clarity.  Note the slow paths are still inlined but the
      entry is marked unlikely.  If they were not inlined, it actally increases
      text size to generate the as there is only one call site.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11e33f6a
    • M
      page allocator: check only once if the zonelist is suitable for the allocation · 7f82af97
      Mel Gorman 提交于
      It is possible with __GFP_THISNODE that no zones are suitable.  This patch
      makes sure the check is only made once.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f82af97
    • M
      page allocator: do not check NUMA node ID when the caller knows the node is valid · 6484eb3e
      Mel Gorman 提交于
      Callers of alloc_pages_node() can optionally specify -1 as a node to mean
      "allocate from the current node".  However, a number of the callers in
      fast paths know for a fact their node is valid.  To avoid a comparison and
      branch, this patch adds alloc_pages_exact_node() that only checks the nid
      with VM_BUG_ON().  Callers that know their node is valid are then
      converted.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: Paul Mundt <lethal@linux-sh.org>	[for the SLOB NUMA bits]
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6484eb3e
    • M
      page allocator: do not sanity check order in the fast path · b3c466ce
      Mel Gorman 提交于
      No user of the allocator API should be passing in an order >= MAX_ORDER
      but we check for it on each and every allocation.  Delete this check and
      make it a VM_BUG_ON check further down the call path.
      
      [akpm@linux-foundation.org: s/VM_BUG_ON/WARN_ON_ONCE/]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3c466ce
    • M
      page allocator: replace __alloc_pages_internal() with __alloc_pages_nodemask() · d239171e
      Mel Gorman 提交于
      The start of a large patch series to clean up and optimise the page
      allocator.
      
      The performance improvements are in a wide range depending on the exact
      machine but the results I've seen so fair are approximately;
      
      kernbench:	0	to	 0.12% (elapsed time)
      		0.49%	to	 3.20% (sys time)
      aim9:		-4%	to	30% (for page_test and brk_test)
      tbench:		-1%	to	 4%
      hackbench:	-2.5%	to	 3.45% (mostly within the noise though)
      netperf-udp	-1.34%  to	 4.06% (varies between machines a bit)
      netperf-tcp	-0.44%  to	 5.22% (varies between machines a bit)
      
      I haven't sysbench figures at hand, but previously they were within the
      -0.5% to 2% range.
      
      On netperf, the client and server were bound to opposite number CPUs to
      maximise the problems with cache line bouncing of the struct pages so I
      expect different people to report different results for netperf depending
      on their exact machine and how they ran the test (different machines, same
      cpus client/server, shared cache but two threads client/server, different
      socket client/server etc).
      
      I also measured the vmlinux sizes for a single x86-based config with
      CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM.  The core of the
      .config is based on the Debian Lenny kernel config so I expect it to be
      reasonably typical.
      
      This patch:
      
      __alloc_pages_internal is the core page allocator function but essentially
      it is an alias of __alloc_pages_nodemask.  Naming a publicly available and
      exported function "internal" is also a big ugly.  This patch renames
      __alloc_pages_internal() to __alloc_pages_nodemask() and deletes the old
      nodemask function.
      
      Warning - This patch renames an exported symbol.  No kernel driver is
      affected by external drivers calling __alloc_pages_internal() should
      change the call to __alloc_pages_nodemask() without any alteration of
      parameters.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d239171e
    • H
      mm: alloc_large_system_hash check order · 6c0db466
      Hugh Dickins 提交于
      On an x86_64 with 4GB ram, tcp_init()'s call to alloc_large_system_hash(),
      to allocate tcp_hashinfo.ehash, is now triggering an mmotm WARN_ON_ONCE on
      order >= MAX_ORDER - it's hoping for order 11.  alloc_large_system_hash()
      had better make its own check on the order.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: David Miller <davem@davemloft.net>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c0db466
    • M
      cpuset,mm: update tasks' mems_allowed in time · 58568d2a
      Miao Xie 提交于
      Fix allocating page cache/slab object on the unallowed node when memory
      spread is set by updating tasks' mems_allowed after its cpuset's mems is
      changed.
      
      In order to update tasks' mems_allowed in time, we must modify the code of
      memory policy.  Because the memory policy is applied in the process's
      context originally.  After applying this patch, one task directly
      manipulates anothers mems_allowed, and we use alloc_lock in the
      task_struct to protect mems_allowed and memory policy of the task.
      
      But in the fast path, we didn't use lock to protect them, because adding a
      lock may lead to performance regression.  But if we don't add a lock,the
      task might see no nodes when changing cpuset's mems_allowed to some
      non-overlapping set.  In order to avoid it, we set all new allowed nodes,
      then clear newly disallowed ones.
      
      [lee.schermerhorn@hp.com:
        The rework of mpol_new() to extract the adjusting of the node mask to
        apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
        with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
        allocation.  Fix this by adding the check for MPOL_PREFERRED and empty
        node mask to mpol_new_mpolicy().
      
        Remove the now unneeded 'nodes = NULL' from mpol_new().
      
        Note that mpol_new_mempolicy() is always called with a non-NULL
        'nodes' parameter now that it has been removed from mpol_new().
        Therefore, we don't need to test nodes for NULL before testing it for
        'empty'.  However, just to be extra paranoid, add a VM_BUG_ON() to
        verify this assumption.]
      [lee.schermerhorn@hp.com:
      
        I don't think the function name 'mpol_new_mempolicy' is descriptive
        enough to differentiate it from mpol_new().
      
        This function applies cpuset set context, usually constraining nodes
        to those allowed by the cpuset.  However, when the 'RELATIVE_NODES flag
        is set, it also translates the nodes.  So I settled on
        'mpol_set_nodemask()', because the comment block for mpol_new() mentions
        that we need to call this function to "set nodes".
      
        Some additional minor line length, whitespace and typo cleanup.]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58568d2a
    • M
      cpusets: update tasks' page/slab spread flags in time · 950592f7
      Miao Xie 提交于
      Fix the bug that the kernel didn't spread page cache/slab object evenly
      over all the allowed nodes when spread flags were set by updating tasks'
      page/slab spread flags in time.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      950592f7
    • M
      cpusets: restructure the function cpuset_update_task_memory_state() · f3b39d47
      Miao Xie 提交于
      The kernel still allocates the page caches on old node after modifying its
      cpuset's mems when 'memory_spread_page' was set, or it didn't spread the
      page cache evenly over all the nodes that faulting task is allowed to usr
      after memory_spread_page was set.  it is caused by the old mem_allowed and
      flags of the task, the current kernel doesn't updates them unless some
      function invokes cpuset_update_task_memory_state(), it is too late
      sometimes.We must update the mem_allowed and the flags of the tasks in
      time.
      
      Slab has the same problem.
      
      The following patches fix this bug by updating tasks' mem_allowed and
      spread flag after its cpuset's mems or spread flag is changed.
      
      This patch:
      
      Extract a function from cpuset_update_task_memory_state().  It will be
      used later for update tasks' page/slab spread flags after its cpuset's
      flag is set
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3b39d47
    • H
      mm/page-writeback.c: dirty limit type should be unsigned long · dcf975d5
      H Hartley Sweeten 提交于
      get_dirty_limits() calls clip_bdi_dirty_limit() and task_dirty_limit()
      with variable pbdi_dirty as one of the arguments.  This variable is an
      unsigned long * but both functions expect it to be a long *.  This causes
      the following sparse warnings:
      
        warning: incorrect type in argument 3 (different signedness)
           expected long *pbdi_dirty
           got unsigned long *pbdi_dirty
        warning: incorrect type in argument 2 (different signedness)
           expected long *pdirty
           got unsigned long *pbdi_dirty
      
      Fix the warnings by changing the long * to unsigned long * in both
      functions.
      Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcf975d5
    • K
      vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC · 78dc583d
      KOSAKI Motohiro 提交于
      Commit 33c120ed ("more aggressively use
      lumpy reclaim") increased how aggressive lumpy reclaim was by isolating
      both active and inactive pages for asynchronous lumpy reclaim on
      costly-high-order pages and for cheap-high-order when memory pressure is
      high.  However, if the system is under heavy pressure and there are dirty
      pages, asynchronous IO may not be sufficient to reclaim a suitable page in
      time.
      
      This patch causes the caller to enter synchronous lumpy reclaim for
      costly-high-order pages and for cheap-high-order pages when under memory
      pressure.
      
      Minchan.kim@gmail.com said:
      
      Andy added synchronous lumpy reclaim with
      c661b078.  At that time, lumpy reclaim is
      not agressive.  His intension is just for high-order users.(above
      PAGE_ALLOC_COSTLY_ORDER).
      
      After some time, Rik added aggressive lumpy reclaim with
      33c120ed.  His intention was to do lumpy
      reclaim when high-order users and trouble getting a small set of
      contiguous pages.
      
      So we also have to add synchronous pageout for small set of contiguous
      pages.
      
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <Minchan.kim@gmail.com>
      Reviewed-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78dc583d
    • N
      mm: clean up get_user_pages_fast() documentation · d2bf6be8
      Nick Piggin 提交于
      Move more documentation for get_user_pages_fast into the new kerneldoc comment.
      Add some comments for get_user_pages as well.
      
      Also, move get_user_pages_fast declaration up to get_user_pages. It wasn't
      there initially because it was once a static inline function.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Andy Grover <andy.grover@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2bf6be8
    • W
      readahead: enforce full sync mmap readahead size · 7ffc59b4
      Wu Fengguang 提交于
      Now that we do readahead for sequential mmap reads, here is a simple
      evaluation of the impacts, and one further optimization.
      
      It's an NFS-root debian desktop system, readahead size = 60 pages.
      The numbers are grabbed after a fresh boot into console.
      
      approach        pgmajfault      RA miss ratio   mmap IO count   avg IO size(pages)
         A            383             31.6%           383             11
         B            225             32.4%           390             11
         C            224             32.6%           307             13
      
      case A: mmap sync/async readahead disabled
      case B: mmap sync/async readahead enabled, with enforced full async readahead size
      case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
      or:
      A = vanilla 2.6.30-rc1
      B = A plus mmap readahead
      C = B plus this patch
      
      The numbers show that
      - there are good possibilities for random mmap reads to trigger readahead
      - 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
      - case C can further reduce IO count by 1/4
      - readahead miss ratios are not quite affected
      
      The theory is
      - readahead is _good_ for clustered random reads, and can perform
        _better_ than readaround because they could be _async_.
      - async readahead size is guaranteed to be larger than readaround
        size, and they are _async_, hence will mostly behave better
      However for B
      - sync readahead size could be smaller than readaround size, hence may
        make things worse by produce more smaller IOs
      which will be fixed by this patch.
      
      Final conclusion:
      - mmap readahead reduced major faults by 1/3 and no obvious overheads;
      - mmap io can be further reduced by 1/4 with this patch.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ffc59b4
    • W
    • W
      readahead: introduce context readahead algorithm · 10be0b37
      Wu Fengguang 提交于
      Introduce page cache context based readahead algorithm.
      This is to better support concurrent read streams in general.
      
      RATIONALE
      ---------
      The current readahead algorithm detects interleaved reads in a _passive_ way.
      Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,...
      By checking for (offset == prev_offset + 1), it will discover the sequentialness
      between 3,4 and between 1004,1005, and start doing sequential readahead for the
      individual streams since page 4 and page 1005.
      
      The context readahead algorithm guarantees to discover the sequentialness no
      matter how the streams are interleaved. For the above example, it will start
      sequential readahead since page 2 and 1002.
      
      The trick is to poke for page @offset-1 in the page cache when it has no other
      clues on the sequentialness of request @offset: if the current requenst belongs
      to a sequential stream, that stream must have accessed page @offset-1 recently,
      and the page will still be cached now. So if page @offset-1 is there, we can
      take request @offset as a sequential access.
      
      BENEFICIARIES
      -------------
      - strictly interleaved reads  i.e. 1,1001,2,1002,3,1003,...
        the current readahead will take them as silly random reads;
        the context readahead will take them as two sequential streams.
      
      - cooperative IO processes   i.e. NFS and SCST
        They create a thread pool, farming off (sequential) IO requests to different
        threads which will be performing interleaved IO.
      
        It was not easy(or possible) to reliably tell from file->f_ra all those
        cooperative processes working on the same sequential stream, since they will
        have different file->f_ra instances. And NFSD's file->f_ra is particularly
        unusable, since their file objects are dynamically created for each request.
        The nfsd does have code trying to restore the f_ra bits, but not satisfactory.
      
        The new scheme is to detect the sequential pattern via looking up the page
        cache, which provides one single and consistent view of the pages recently
        accessed. That makes sequential detection for cooperative processes possible.
      
      USER REPORT
      -----------
      Vladislav recommends the addition of context readahead as a result of his SCST
      benchmarks. It leads to 6%~40% performance gains in various cases and achieves
      equal performance in others.                http://lkml.org/lkml/2009/3/19/239
      
      OVERHEADS
      ---------
      In theory, it introduces one extra page cache lookup per random read.  However
      the below benchmark shows context readahead to be slightly faster, wondering..
      
      Randomly reading 200MB amount of data on a sparse file, repeat 20 times for
      each block size. The average throughputs are:
      
                             	original ra	context ra	gain
       4K random reads:	 65.561MB/s	 65.648MB/s	+0.1%
      16K random reads:	124.767MB/s	124.951MB/s	+0.1%
      64K random reads: 	162.123MB/s	162.278MB/s	+0.1%
      
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Tested-by: NVladislav Bolkhovitin <vst@vlnb.net>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10be0b37
    • W
      readahead: move the random read case to bottom · 045a2529
      Wu Fengguang 提交于
      Split all readahead cases, and move the random one to bottom.
      
      No behavior changes.
      
      This is to prepare for the introduction of context readahead, and make it
      easy for inserting accounting/tracing points for each case.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      045a2529
    • W
      radix-tree: add radix_tree_prev_hole() · dc566127
      Wu Fengguang 提交于
      The counterpart of radix_tree_next_hole(). To be used by context readahead.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc566127
    • W
      readahead: record mmap read-around states in file_ra_state · d30a1100
      Wu Fengguang 提交于
      Mmap read-around now shares the same code style and data structure with
      readahead code.
      
      This also removes do_page_cache_readahead().  Its last user, mmap
      read-around, has been changed to call ra_submit().
      
      The no-readahead-if-congested logic is dumped by the way.  Users will be
      pretty sensitive about the slow loading of executables.  So it's
      unfavorable to disabled mmap read-around on a congested queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d30a1100
    • W
      readahead: enforce full readahead size on async mmap readahead · 2fad6f5d
      Wu Fengguang 提交于
      We need this in one particular case and two more general ones.
      
      Now we do async readahead for sequential mmap reads, and do it with the
      help of PG_readahead.  For normal reads, PG_readahead is the sufficient
      condition to do a sequential readahead.  But unfortunately, for mmap
      reads, there is a tiny nuisance:
      
      [11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
      [11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
      [11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
      [11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                 ~~~~~~~~~~~~~
      
      An unfavorably small readahead.  The original dumb read-around size could
      be more efficient.
      
      That happened because ld-linux.so does a read(832) in L1 before mmap(),
      which triggers a 4-page readahead, with the second page tagged
      PG_readahead.
      
      L0: open("/lib/libc.so.6", O_RDONLY)        = 3
      L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
      L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
      L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
      L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
      L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
      L7: close(3)                                = 0
      
      In general, the PG_readahead flag will also be hit in cases
      
      - sequential reads
      
      - clustered random reads
      
      A full readahead size is desirable in both cases.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2fad6f5d
    • W
      readahead: sequential mmap readahead · 70ac23cf
      Wu Fengguang 提交于
      Auto-detect sequential mmap reads and do readahead for them.
      
      The sequential mmap readahead will be triggered when
      - sync readahead: it's a major fault and (prev_offset == offset-1);
      - async readahead: minor fault on PG_readahead page with valid readahead state.
      
      The benefits of doing readahead instead of read-around:
      - less I/O wait thanks to async readahead
      - double real I/O size and no more cache hits
      
      The single stream case is improved a little.
      For 100,000 sequential mmap reads:
      
                                          user       system    cpu        total
      (1-1)  plain -mm, 128KB readaround: 3.224      2.554     48.40%     11.838
      (1-2)  plain -mm, 256KB readaround: 3.170      2.392     46.20%     11.976
      (2)  patched -mm, 128KB readahead:  3.117      2.448     47.33%     11.607
      
      The patched (2) has smallest total time, since it has no cache hit overheads
      and less I/O block time(thanks to async readahead). Here the I/O size
      makes no much difference, since there's only one single stream.
      
      Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
      since the half of the read-around pages will be readahead cache hits.
      
      This is going to make _real_ differences for _concurrent_ IO streams.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ac23cf
    • L
      readahead: clean up and simplify the code for filemap page fault readahead · ef00e08e
      Linus Torvalds 提交于
      This shouldn't really change behavior all that much, but the single rather
      complex function with read-ahead inside a loop etc is broken up into more
      manageable pieces.
      
      The behaviour is also less subtle, with the read-ahead being done up-front
      rather than inside some subtle loop and thus avoiding the now unnecessary
      extra state variables (ie "did_readaround" is gone).
      
      Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
      the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
      which case the original code will directly jump to no_cached_page reading.
      
      Cc: Pavel Levshin <lpk@581.spb.su>
      Cc: <wli@movementarian.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef00e08e
    • W
      readahead: remove sync/async readahead call dependency · 51daa88e
      Wu Fengguang 提交于
      The readahead call scheme is error-prone in that it expects the call sites
      to check for async readahead after doing a sync one.  I.e.
      
      			if (!page)
      				page_cache_sync_readahead();
      			page = find_get_page();
      			if (page && PageReadahead(page))
      				page_cache_async_readahead();
      
      This is because PG_readahead could be set by a sync readahead for the
      _current_ newly faulted in page, and the readahead code simply expects one
      more callback on the same page to start the async readahead.  If the
      caller fails to do so, it will miss the PG_readahead bits and never able
      to start an async readahead.
      
      Eliminate this insane constraint by piggy-backing the async part into the
      current readahead window.
      
      Now if an async readahead should be started immediately after a sync one,
      the readahead logic itself will do it.  So the following code becomes
      valid: (the 'else' in particular)
      
      			if (!page)
      				page_cache_sync_readahead();
      			else if (PageReadahead(page))
      				page_cache_async_readahead();
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51daa88e
    • W
      readahead: increase interleaved readahead size · 160334a0
      Wu Fengguang 提交于
      Make sure interleaved readahead size is larger than request size.  This
      also makes the readahead window grow up more quickly.
      Reported-by: NXu Chenfeng <xcf@ustc.edu.cn>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      160334a0
    • W
      readahead: remove one unnecessary radix tree lookup · caca7cb7
      Wu Fengguang 提交于
      (hit_readahead_marker != 0) means the page at @offset is present, so we
      can search for non-present page starting from @offset+1.
      Reported-by: NXu Chenfeng <xcf@ustc.edu.cn>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      caca7cb7
    • W
      readahead: apply max_sane_readahead() limit in ondemand_readahead() · fc31d16a
      Wu Fengguang 提交于
      Just in case someone aggressively sets a huge readahead size.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc31d16a
    • W
      readahead: move max_sane_readahead() calls into force_page_cache_readahead() · f7e839dd
      Wu Fengguang 提交于
      Impact: code simplification.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7e839dd
    • W
      readahead: make mmap_miss an unsigned int · 1ebf26a9
      Wu Fengguang 提交于
      This makes the performance impact of possible mmap_miss wrap around to be
      temporary and tolerable: i.e.  MMAP_LOTSAMISS=100 extra readarounds.
      
      Otherwise if ever mmap_miss wraps around to negative, it takes INT_MAX
      cache misses to bring it back to normal state.  During the time mmap
      readaround will be _enabled_ for whatever wild random workload.  That's
      almost permanent performance impact.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ebf26a9
    • A
      mm: consolidate init_mm definition · bb1f17b0
      Alexey Dobriyan 提交于
      * create mm/init-mm.c, move init_mm there
      * remove INIT_MM, initialize init_mm with C99 initializer
      * unexport init_mm on all arches:
      
        init_mm is already unexported on x86.
      
        One strange place is some OMAP driver (drivers/video/omap/) which
        won't build modular, but it's already wants get_vm_area() export.
        Somebody should look there.
      
      [akpm@linux-foundation.org: add missing #includes]
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Frysinger <vapier.adi@gmail.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb1f17b0
    • Y
      firmware_map: fix hang with x86/32bit · 3b0fde0f
      Yinghai Lu 提交于
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13484
      
      Peer reported:
      | The bug is introduced from kernel 2.6.27, if E820 table reserve the memory
      | above 4G in 32bit OS(BIOS-e820: 00000000fff80000 - 0000000120000000
      | (reserved)), system will report Int 6 error and hang up. The bug is caused by
      | the following code in drivers/firmware/memmap.c, the resource_size_t is 32bit
      | variable in 32bit OS, the BUG_ON() will be invoked to result in the Int 6
      | error. I try the latest 32bit Ubuntu and Fedora distributions, all hit this
      | bug.
      |======
      |static int firmware_map_add_entry(resource_size_t start, resource_size_t end,
      |                  const char *type,
      |                  struct firmware_map_entry *entry)
      
      and it only happen with CONFIG_PHYS_ADDR_T_64BIT is not set.
      
      it turns out we need to pass u64 instead of resource_size_t for that.
      
      [akpm@linux-foundation.org: add comment]
      Reported-and-tested-by: NPeer Chen <pchen@nvidia.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b0fde0f
    • R
      spi: takes size of a pointer to determine the size of the pointed-to type · 02141546
      Roel Kluin 提交于
      Do not take the size of a pointer to determine the size of the pointed-to
      type.
      Signed-off-by: NRoel Kluin <roel.kluin@gmail.com>
      Acked-by: NAnton Vorontsov <avorontsov@ru.mvista.com>
      Cc: David Brownell <david-b@pacbell.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Kumar Gala <galak@gate.crashing.org>
      Acked-by: NGrant Likely <grant.likely@secretlab.ca>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02141546
    • A
      time: move PIT_TICK_RATE to linux/timex.h · 08604bd9
      Arnd Bergmann 提交于
      PIT_TICK_RATE is currently defined in four architectures, but in three
      different places.  While linux/timex.h is not the perfect place for it, it
      is still a reasonable replacement for those drivers that traditionally use
      asm/timex.h to get CLOCK_TICK_RATE and expect it to be the PIT frequency.
      
      Note that for Alpha, the actual value changed from 1193182UL to 1193180UL.
       This is unlikely to make a difference, and probably can only improve
      accuracy.  There was a discussion on the correct value of CLOCK_TICK_RATE
      a few years ago, after which every existing instance was getting changed
      to 1193182.  According to the specification, it should be
      1193181.818181...
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Cc: Dmitry Torokhov <dtor@mail.ru>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08604bd9
  2. 16 6月, 2009 7 次提交