1. 17 6月, 2009 19 次提交
    • W
      readahead: enforce full sync mmap readahead size · 7ffc59b4
      Wu Fengguang 提交于
      Now that we do readahead for sequential mmap reads, here is a simple
      evaluation of the impacts, and one further optimization.
      
      It's an NFS-root debian desktop system, readahead size = 60 pages.
      The numbers are grabbed after a fresh boot into console.
      
      approach        pgmajfault      RA miss ratio   mmap IO count   avg IO size(pages)
         A            383             31.6%           383             11
         B            225             32.4%           390             11
         C            224             32.6%           307             13
      
      case A: mmap sync/async readahead disabled
      case B: mmap sync/async readahead enabled, with enforced full async readahead size
      case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
      or:
      A = vanilla 2.6.30-rc1
      B = A plus mmap readahead
      C = B plus this patch
      
      The numbers show that
      - there are good possibilities for random mmap reads to trigger readahead
      - 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
      - case C can further reduce IO count by 1/4
      - readahead miss ratios are not quite affected
      
      The theory is
      - readahead is _good_ for clustered random reads, and can perform
        _better_ than readaround because they could be _async_.
      - async readahead size is guaranteed to be larger than readaround
        size, and they are _async_, hence will mostly behave better
      However for B
      - sync readahead size could be smaller than readaround size, hence may
        make things worse by produce more smaller IOs
      which will be fixed by this patch.
      
      Final conclusion:
      - mmap readahead reduced major faults by 1/3 and no obvious overheads;
      - mmap io can be further reduced by 1/4 with this patch.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ffc59b4
    • W
    • W
      readahead: introduce context readahead algorithm · 10be0b37
      Wu Fengguang 提交于
      Introduce page cache context based readahead algorithm.
      This is to better support concurrent read streams in general.
      
      RATIONALE
      ---------
      The current readahead algorithm detects interleaved reads in a _passive_ way.
      Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,...
      By checking for (offset == prev_offset + 1), it will discover the sequentialness
      between 3,4 and between 1004,1005, and start doing sequential readahead for the
      individual streams since page 4 and page 1005.
      
      The context readahead algorithm guarantees to discover the sequentialness no
      matter how the streams are interleaved. For the above example, it will start
      sequential readahead since page 2 and 1002.
      
      The trick is to poke for page @offset-1 in the page cache when it has no other
      clues on the sequentialness of request @offset: if the current requenst belongs
      to a sequential stream, that stream must have accessed page @offset-1 recently,
      and the page will still be cached now. So if page @offset-1 is there, we can
      take request @offset as a sequential access.
      
      BENEFICIARIES
      -------------
      - strictly interleaved reads  i.e. 1,1001,2,1002,3,1003,...
        the current readahead will take them as silly random reads;
        the context readahead will take them as two sequential streams.
      
      - cooperative IO processes   i.e. NFS and SCST
        They create a thread pool, farming off (sequential) IO requests to different
        threads which will be performing interleaved IO.
      
        It was not easy(or possible) to reliably tell from file->f_ra all those
        cooperative processes working on the same sequential stream, since they will
        have different file->f_ra instances. And NFSD's file->f_ra is particularly
        unusable, since their file objects are dynamically created for each request.
        The nfsd does have code trying to restore the f_ra bits, but not satisfactory.
      
        The new scheme is to detect the sequential pattern via looking up the page
        cache, which provides one single and consistent view of the pages recently
        accessed. That makes sequential detection for cooperative processes possible.
      
      USER REPORT
      -----------
      Vladislav recommends the addition of context readahead as a result of his SCST
      benchmarks. It leads to 6%~40% performance gains in various cases and achieves
      equal performance in others.                http://lkml.org/lkml/2009/3/19/239
      
      OVERHEADS
      ---------
      In theory, it introduces one extra page cache lookup per random read.  However
      the below benchmark shows context readahead to be slightly faster, wondering..
      
      Randomly reading 200MB amount of data on a sparse file, repeat 20 times for
      each block size. The average throughputs are:
      
                             	original ra	context ra	gain
       4K random reads:	 65.561MB/s	 65.648MB/s	+0.1%
      16K random reads:	124.767MB/s	124.951MB/s	+0.1%
      64K random reads: 	162.123MB/s	162.278MB/s	+0.1%
      
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Tested-by: NVladislav Bolkhovitin <vst@vlnb.net>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10be0b37
    • W
      readahead: move the random read case to bottom · 045a2529
      Wu Fengguang 提交于
      Split all readahead cases, and move the random one to bottom.
      
      No behavior changes.
      
      This is to prepare for the introduction of context readahead, and make it
      easy for inserting accounting/tracing points for each case.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      045a2529
    • W
      radix-tree: add radix_tree_prev_hole() · dc566127
      Wu Fengguang 提交于
      The counterpart of radix_tree_next_hole(). To be used by context readahead.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc566127
    • W
      readahead: record mmap read-around states in file_ra_state · d30a1100
      Wu Fengguang 提交于
      Mmap read-around now shares the same code style and data structure with
      readahead code.
      
      This also removes do_page_cache_readahead().  Its last user, mmap
      read-around, has been changed to call ra_submit().
      
      The no-readahead-if-congested logic is dumped by the way.  Users will be
      pretty sensitive about the slow loading of executables.  So it's
      unfavorable to disabled mmap read-around on a congested queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d30a1100
    • W
      readahead: enforce full readahead size on async mmap readahead · 2fad6f5d
      Wu Fengguang 提交于
      We need this in one particular case and two more general ones.
      
      Now we do async readahead for sequential mmap reads, and do it with the
      help of PG_readahead.  For normal reads, PG_readahead is the sufficient
      condition to do a sequential readahead.  But unfortunately, for mmap
      reads, there is a tiny nuisance:
      
      [11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
      [11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
      [11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
      [11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                 ~~~~~~~~~~~~~
      
      An unfavorably small readahead.  The original dumb read-around size could
      be more efficient.
      
      That happened because ld-linux.so does a read(832) in L1 before mmap(),
      which triggers a 4-page readahead, with the second page tagged
      PG_readahead.
      
      L0: open("/lib/libc.so.6", O_RDONLY)        = 3
      L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
      L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
      L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
      L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
      L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
      L7: close(3)                                = 0
      
      In general, the PG_readahead flag will also be hit in cases
      
      - sequential reads
      
      - clustered random reads
      
      A full readahead size is desirable in both cases.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2fad6f5d
    • W
      readahead: sequential mmap readahead · 70ac23cf
      Wu Fengguang 提交于
      Auto-detect sequential mmap reads and do readahead for them.
      
      The sequential mmap readahead will be triggered when
      - sync readahead: it's a major fault and (prev_offset == offset-1);
      - async readahead: minor fault on PG_readahead page with valid readahead state.
      
      The benefits of doing readahead instead of read-around:
      - less I/O wait thanks to async readahead
      - double real I/O size and no more cache hits
      
      The single stream case is improved a little.
      For 100,000 sequential mmap reads:
      
                                          user       system    cpu        total
      (1-1)  plain -mm, 128KB readaround: 3.224      2.554     48.40%     11.838
      (1-2)  plain -mm, 256KB readaround: 3.170      2.392     46.20%     11.976
      (2)  patched -mm, 128KB readahead:  3.117      2.448     47.33%     11.607
      
      The patched (2) has smallest total time, since it has no cache hit overheads
      and less I/O block time(thanks to async readahead). Here the I/O size
      makes no much difference, since there's only one single stream.
      
      Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
      since the half of the read-around pages will be readahead cache hits.
      
      This is going to make _real_ differences for _concurrent_ IO streams.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ac23cf
    • L
      readahead: clean up and simplify the code for filemap page fault readahead · ef00e08e
      Linus Torvalds 提交于
      This shouldn't really change behavior all that much, but the single rather
      complex function with read-ahead inside a loop etc is broken up into more
      manageable pieces.
      
      The behaviour is also less subtle, with the read-ahead being done up-front
      rather than inside some subtle loop and thus avoiding the now unnecessary
      extra state variables (ie "did_readaround" is gone).
      
      Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
      the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
      which case the original code will directly jump to no_cached_page reading.
      
      Cc: Pavel Levshin <lpk@581.spb.su>
      Cc: <wli@movementarian.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef00e08e
    • W
      readahead: remove sync/async readahead call dependency · 51daa88e
      Wu Fengguang 提交于
      The readahead call scheme is error-prone in that it expects the call sites
      to check for async readahead after doing a sync one.  I.e.
      
      			if (!page)
      				page_cache_sync_readahead();
      			page = find_get_page();
      			if (page && PageReadahead(page))
      				page_cache_async_readahead();
      
      This is because PG_readahead could be set by a sync readahead for the
      _current_ newly faulted in page, and the readahead code simply expects one
      more callback on the same page to start the async readahead.  If the
      caller fails to do so, it will miss the PG_readahead bits and never able
      to start an async readahead.
      
      Eliminate this insane constraint by piggy-backing the async part into the
      current readahead window.
      
      Now if an async readahead should be started immediately after a sync one,
      the readahead logic itself will do it.  So the following code becomes
      valid: (the 'else' in particular)
      
      			if (!page)
      				page_cache_sync_readahead();
      			else if (PageReadahead(page))
      				page_cache_async_readahead();
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51daa88e
    • W
      readahead: increase interleaved readahead size · 160334a0
      Wu Fengguang 提交于
      Make sure interleaved readahead size is larger than request size.  This
      also makes the readahead window grow up more quickly.
      Reported-by: NXu Chenfeng <xcf@ustc.edu.cn>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      160334a0
    • W
      readahead: remove one unnecessary radix tree lookup · caca7cb7
      Wu Fengguang 提交于
      (hit_readahead_marker != 0) means the page at @offset is present, so we
      can search for non-present page starting from @offset+1.
      Reported-by: NXu Chenfeng <xcf@ustc.edu.cn>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      caca7cb7
    • W
      readahead: apply max_sane_readahead() limit in ondemand_readahead() · fc31d16a
      Wu Fengguang 提交于
      Just in case someone aggressively sets a huge readahead size.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc31d16a
    • W
      readahead: move max_sane_readahead() calls into force_page_cache_readahead() · f7e839dd
      Wu Fengguang 提交于
      Impact: code simplification.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7e839dd
    • W
      readahead: make mmap_miss an unsigned int · 1ebf26a9
      Wu Fengguang 提交于
      This makes the performance impact of possible mmap_miss wrap around to be
      temporary and tolerable: i.e.  MMAP_LOTSAMISS=100 extra readarounds.
      
      Otherwise if ever mmap_miss wraps around to negative, it takes INT_MAX
      cache misses to bring it back to normal state.  During the time mmap
      readaround will be _enabled_ for whatever wild random workload.  That's
      almost permanent performance impact.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ebf26a9
    • A
      mm: consolidate init_mm definition · bb1f17b0
      Alexey Dobriyan 提交于
      * create mm/init-mm.c, move init_mm there
      * remove INIT_MM, initialize init_mm with C99 initializer
      * unexport init_mm on all arches:
      
        init_mm is already unexported on x86.
      
        One strange place is some OMAP driver (drivers/video/omap/) which
        won't build modular, but it's already wants get_vm_area() export.
        Somebody should look there.
      
      [akpm@linux-foundation.org: add missing #includes]
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Frysinger <vapier.adi@gmail.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb1f17b0
    • Y
      firmware_map: fix hang with x86/32bit · 3b0fde0f
      Yinghai Lu 提交于
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13484
      
      Peer reported:
      | The bug is introduced from kernel 2.6.27, if E820 table reserve the memory
      | above 4G in 32bit OS(BIOS-e820: 00000000fff80000 - 0000000120000000
      | (reserved)), system will report Int 6 error and hang up. The bug is caused by
      | the following code in drivers/firmware/memmap.c, the resource_size_t is 32bit
      | variable in 32bit OS, the BUG_ON() will be invoked to result in the Int 6
      | error. I try the latest 32bit Ubuntu and Fedora distributions, all hit this
      | bug.
      |======
      |static int firmware_map_add_entry(resource_size_t start, resource_size_t end,
      |                  const char *type,
      |                  struct firmware_map_entry *entry)
      
      and it only happen with CONFIG_PHYS_ADDR_T_64BIT is not set.
      
      it turns out we need to pass u64 instead of resource_size_t for that.
      
      [akpm@linux-foundation.org: add comment]
      Reported-and-tested-by: NPeer Chen <pchen@nvidia.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b0fde0f
    • R
      spi: takes size of a pointer to determine the size of the pointed-to type · 02141546
      Roel Kluin 提交于
      Do not take the size of a pointer to determine the size of the pointed-to
      type.
      Signed-off-by: NRoel Kluin <roel.kluin@gmail.com>
      Acked-by: NAnton Vorontsov <avorontsov@ru.mvista.com>
      Cc: David Brownell <david-b@pacbell.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Kumar Gala <galak@gate.crashing.org>
      Acked-by: NGrant Likely <grant.likely@secretlab.ca>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02141546
    • A
      time: move PIT_TICK_RATE to linux/timex.h · 08604bd9
      Arnd Bergmann 提交于
      PIT_TICK_RATE is currently defined in four architectures, but in three
      different places.  While linux/timex.h is not the perfect place for it, it
      is still a reasonable replacement for those drivers that traditionally use
      asm/timex.h to get CLOCK_TICK_RATE and expect it to be the PIT frequency.
      
      Note that for Alpha, the actual value changed from 1193182UL to 1193180UL.
       This is unlikely to make a difference, and probably can only improve
      accuracy.  There was a discussion on the correct value of CLOCK_TICK_RATE
      a few years ago, after which every existing instance was getting changed
      to 1193182.  According to the specification, it should be
      1193181.818181...
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Cc: Dmitry Torokhov <dtor@mail.ru>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08604bd9
  2. 16 6月, 2009 10 次提交
  3. 15 6月, 2009 11 次提交