1. 17 6月, 2009 16 次提交
    • K
      vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC · 78dc583d
      KOSAKI Motohiro 提交于
      Commit 33c120ed ("more aggressively use
      lumpy reclaim") increased how aggressive lumpy reclaim was by isolating
      both active and inactive pages for asynchronous lumpy reclaim on
      costly-high-order pages and for cheap-high-order when memory pressure is
      high.  However, if the system is under heavy pressure and there are dirty
      pages, asynchronous IO may not be sufficient to reclaim a suitable page in
      time.
      
      This patch causes the caller to enter synchronous lumpy reclaim for
      costly-high-order pages and for cheap-high-order pages when under memory
      pressure.
      
      Minchan.kim@gmail.com said:
      
      Andy added synchronous lumpy reclaim with
      c661b078.  At that time, lumpy reclaim is
      not agressive.  His intension is just for high-order users.(above
      PAGE_ALLOC_COSTLY_ORDER).
      
      After some time, Rik added aggressive lumpy reclaim with
      33c120ed.  His intention was to do lumpy
      reclaim when high-order users and trouble getting a small set of
      contiguous pages.
      
      So we also have to add synchronous pageout for small set of contiguous
      pages.
      
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <Minchan.kim@gmail.com>
      Reviewed-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78dc583d
    • N
      mm: clean up get_user_pages_fast() documentation · d2bf6be8
      Nick Piggin 提交于
      Move more documentation for get_user_pages_fast into the new kerneldoc comment.
      Add some comments for get_user_pages as well.
      
      Also, move get_user_pages_fast declaration up to get_user_pages. It wasn't
      there initially because it was once a static inline function.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Andy Grover <andy.grover@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2bf6be8
    • W
      readahead: enforce full sync mmap readahead size · 7ffc59b4
      Wu Fengguang 提交于
      Now that we do readahead for sequential mmap reads, here is a simple
      evaluation of the impacts, and one further optimization.
      
      It's an NFS-root debian desktop system, readahead size = 60 pages.
      The numbers are grabbed after a fresh boot into console.
      
      approach        pgmajfault      RA miss ratio   mmap IO count   avg IO size(pages)
         A            383             31.6%           383             11
         B            225             32.4%           390             11
         C            224             32.6%           307             13
      
      case A: mmap sync/async readahead disabled
      case B: mmap sync/async readahead enabled, with enforced full async readahead size
      case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
      or:
      A = vanilla 2.6.30-rc1
      B = A plus mmap readahead
      C = B plus this patch
      
      The numbers show that
      - there are good possibilities for random mmap reads to trigger readahead
      - 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
      - case C can further reduce IO count by 1/4
      - readahead miss ratios are not quite affected
      
      The theory is
      - readahead is _good_ for clustered random reads, and can perform
        _better_ than readaround because they could be _async_.
      - async readahead size is guaranteed to be larger than readaround
        size, and they are _async_, hence will mostly behave better
      However for B
      - sync readahead size could be smaller than readaround size, hence may
        make things worse by produce more smaller IOs
      which will be fixed by this patch.
      
      Final conclusion:
      - mmap readahead reduced major faults by 1/3 and no obvious overheads;
      - mmap io can be further reduced by 1/4 with this patch.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ffc59b4
    • W
    • W
      readahead: introduce context readahead algorithm · 10be0b37
      Wu Fengguang 提交于
      Introduce page cache context based readahead algorithm.
      This is to better support concurrent read streams in general.
      
      RATIONALE
      ---------
      The current readahead algorithm detects interleaved reads in a _passive_ way.
      Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,...
      By checking for (offset == prev_offset + 1), it will discover the sequentialness
      between 3,4 and between 1004,1005, and start doing sequential readahead for the
      individual streams since page 4 and page 1005.
      
      The context readahead algorithm guarantees to discover the sequentialness no
      matter how the streams are interleaved. For the above example, it will start
      sequential readahead since page 2 and 1002.
      
      The trick is to poke for page @offset-1 in the page cache when it has no other
      clues on the sequentialness of request @offset: if the current requenst belongs
      to a sequential stream, that stream must have accessed page @offset-1 recently,
      and the page will still be cached now. So if page @offset-1 is there, we can
      take request @offset as a sequential access.
      
      BENEFICIARIES
      -------------
      - strictly interleaved reads  i.e. 1,1001,2,1002,3,1003,...
        the current readahead will take them as silly random reads;
        the context readahead will take them as two sequential streams.
      
      - cooperative IO processes   i.e. NFS and SCST
        They create a thread pool, farming off (sequential) IO requests to different
        threads which will be performing interleaved IO.
      
        It was not easy(or possible) to reliably tell from file->f_ra all those
        cooperative processes working on the same sequential stream, since they will
        have different file->f_ra instances. And NFSD's file->f_ra is particularly
        unusable, since their file objects are dynamically created for each request.
        The nfsd does have code trying to restore the f_ra bits, but not satisfactory.
      
        The new scheme is to detect the sequential pattern via looking up the page
        cache, which provides one single and consistent view of the pages recently
        accessed. That makes sequential detection for cooperative processes possible.
      
      USER REPORT
      -----------
      Vladislav recommends the addition of context readahead as a result of his SCST
      benchmarks. It leads to 6%~40% performance gains in various cases and achieves
      equal performance in others.                http://lkml.org/lkml/2009/3/19/239
      
      OVERHEADS
      ---------
      In theory, it introduces one extra page cache lookup per random read.  However
      the below benchmark shows context readahead to be slightly faster, wondering..
      
      Randomly reading 200MB amount of data on a sparse file, repeat 20 times for
      each block size. The average throughputs are:
      
                             	original ra	context ra	gain
       4K random reads:	 65.561MB/s	 65.648MB/s	+0.1%
      16K random reads:	124.767MB/s	124.951MB/s	+0.1%
      64K random reads: 	162.123MB/s	162.278MB/s	+0.1%
      
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Tested-by: NVladislav Bolkhovitin <vst@vlnb.net>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10be0b37
    • W
      readahead: move the random read case to bottom · 045a2529
      Wu Fengguang 提交于
      Split all readahead cases, and move the random one to bottom.
      
      No behavior changes.
      
      This is to prepare for the introduction of context readahead, and make it
      easy for inserting accounting/tracing points for each case.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      045a2529
    • W
      readahead: record mmap read-around states in file_ra_state · d30a1100
      Wu Fengguang 提交于
      Mmap read-around now shares the same code style and data structure with
      readahead code.
      
      This also removes do_page_cache_readahead().  Its last user, mmap
      read-around, has been changed to call ra_submit().
      
      The no-readahead-if-congested logic is dumped by the way.  Users will be
      pretty sensitive about the slow loading of executables.  So it's
      unfavorable to disabled mmap read-around on a congested queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d30a1100
    • W
      readahead: enforce full readahead size on async mmap readahead · 2fad6f5d
      Wu Fengguang 提交于
      We need this in one particular case and two more general ones.
      
      Now we do async readahead for sequential mmap reads, and do it with the
      help of PG_readahead.  For normal reads, PG_readahead is the sufficient
      condition to do a sequential readahead.  But unfortunately, for mmap
      reads, there is a tiny nuisance:
      
      [11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
      [11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
      [11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
      [11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                 ~~~~~~~~~~~~~
      
      An unfavorably small readahead.  The original dumb read-around size could
      be more efficient.
      
      That happened because ld-linux.so does a read(832) in L1 before mmap(),
      which triggers a 4-page readahead, with the second page tagged
      PG_readahead.
      
      L0: open("/lib/libc.so.6", O_RDONLY)        = 3
      L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
      L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
      L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
      L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
      L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
      L7: close(3)                                = 0
      
      In general, the PG_readahead flag will also be hit in cases
      
      - sequential reads
      
      - clustered random reads
      
      A full readahead size is desirable in both cases.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2fad6f5d
    • W
      readahead: sequential mmap readahead · 70ac23cf
      Wu Fengguang 提交于
      Auto-detect sequential mmap reads and do readahead for them.
      
      The sequential mmap readahead will be triggered when
      - sync readahead: it's a major fault and (prev_offset == offset-1);
      - async readahead: minor fault on PG_readahead page with valid readahead state.
      
      The benefits of doing readahead instead of read-around:
      - less I/O wait thanks to async readahead
      - double real I/O size and no more cache hits
      
      The single stream case is improved a little.
      For 100,000 sequential mmap reads:
      
                                          user       system    cpu        total
      (1-1)  plain -mm, 128KB readaround: 3.224      2.554     48.40%     11.838
      (1-2)  plain -mm, 256KB readaround: 3.170      2.392     46.20%     11.976
      (2)  patched -mm, 128KB readahead:  3.117      2.448     47.33%     11.607
      
      The patched (2) has smallest total time, since it has no cache hit overheads
      and less I/O block time(thanks to async readahead). Here the I/O size
      makes no much difference, since there's only one single stream.
      
      Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
      since the half of the read-around pages will be readahead cache hits.
      
      This is going to make _real_ differences for _concurrent_ IO streams.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ac23cf
    • L
      readahead: clean up and simplify the code for filemap page fault readahead · ef00e08e
      Linus Torvalds 提交于
      This shouldn't really change behavior all that much, but the single rather
      complex function with read-ahead inside a loop etc is broken up into more
      manageable pieces.
      
      The behaviour is also less subtle, with the read-ahead being done up-front
      rather than inside some subtle loop and thus avoiding the now unnecessary
      extra state variables (ie "did_readaround" is gone).
      
      Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
      the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
      which case the original code will directly jump to no_cached_page reading.
      
      Cc: Pavel Levshin <lpk@581.spb.su>
      Cc: <wli@movementarian.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef00e08e
    • W
      readahead: remove sync/async readahead call dependency · 51daa88e
      Wu Fengguang 提交于
      The readahead call scheme is error-prone in that it expects the call sites
      to check for async readahead after doing a sync one.  I.e.
      
      			if (!page)
      				page_cache_sync_readahead();
      			page = find_get_page();
      			if (page && PageReadahead(page))
      				page_cache_async_readahead();
      
      This is because PG_readahead could be set by a sync readahead for the
      _current_ newly faulted in page, and the readahead code simply expects one
      more callback on the same page to start the async readahead.  If the
      caller fails to do so, it will miss the PG_readahead bits and never able
      to start an async readahead.
      
      Eliminate this insane constraint by piggy-backing the async part into the
      current readahead window.
      
      Now if an async readahead should be started immediately after a sync one,
      the readahead logic itself will do it.  So the following code becomes
      valid: (the 'else' in particular)
      
      			if (!page)
      				page_cache_sync_readahead();
      			else if (PageReadahead(page))
      				page_cache_async_readahead();
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51daa88e
    • W
      readahead: increase interleaved readahead size · 160334a0
      Wu Fengguang 提交于
      Make sure interleaved readahead size is larger than request size.  This
      also makes the readahead window grow up more quickly.
      Reported-by: NXu Chenfeng <xcf@ustc.edu.cn>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      160334a0
    • W
      readahead: remove one unnecessary radix tree lookup · caca7cb7
      Wu Fengguang 提交于
      (hit_readahead_marker != 0) means the page at @offset is present, so we
      can search for non-present page starting from @offset+1.
      Reported-by: NXu Chenfeng <xcf@ustc.edu.cn>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      caca7cb7
    • W
      readahead: apply max_sane_readahead() limit in ondemand_readahead() · fc31d16a
      Wu Fengguang 提交于
      Just in case someone aggressively sets a huge readahead size.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc31d16a
    • W
      readahead: move max_sane_readahead() calls into force_page_cache_readahead() · f7e839dd
      Wu Fengguang 提交于
      Impact: code simplification.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7e839dd
    • A
      mm: consolidate init_mm definition · bb1f17b0
      Alexey Dobriyan 提交于
      * create mm/init-mm.c, move init_mm there
      * remove INIT_MM, initialize init_mm with C99 initializer
      * unexport init_mm on all arches:
      
        init_mm is already unexported on x86.
      
        One strange place is some OMAP driver (drivers/video/omap/) which
        won't build modular, but it's already wants get_vm_area() export.
        Somebody should look there.
      
      [akpm@linux-foundation.org: add missing #includes]
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Frysinger <vapier.adi@gmail.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb1f17b0
  2. 13 6月, 2009 1 次提交
  3. 12 6月, 2009 18 次提交
    • P
      slab: setup cpu caches later on when interrupts are enabled · 8429db5c
      Pekka Enberg 提交于
      Fixes the following boot-time warning:
      
        [    0.000000] ------------[ cut here ]------------
        [    0.000000] WARNING: at kernel/smp.c:369 smp_call_function_many+0x56/0x1bc()
        [    0.000000] Hardware name:
        [    0.000000] Modules linked in:
        [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.30 #492
        [    0.000000] Call Trace:
        [    0.000000]  [<ffffffff8149e021>] ? _spin_unlock+0x4f/0x5c
        [    0.000000]  [<ffffffff8108f11b>] ? smp_call_function_many+0x56/0x1bc
        [    0.000000]  [<ffffffff81061764>] warn_slowpath_common+0x7c/0xa9
        [    0.000000]  [<ffffffff810617a5>] warn_slowpath_null+0x14/0x16
        [    0.000000]  [<ffffffff8108f11b>] smp_call_function_many+0x56/0x1bc
        [    0.000000]  [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54
        [    0.000000]  [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54
        [    0.000000]  [<ffffffff8108f2be>] smp_call_function+0x3d/0x68
        [    0.000000]  [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54
        [    0.000000]  [<ffffffff81066fd8>] on_each_cpu+0x31/0x7c
        [    0.000000]  [<ffffffff810f64f5>] do_tune_cpucache+0x119/0x454
        [    0.000000]  [<ffffffff81087080>] ? lockdep_init_map+0x94/0x10b
        [    0.000000]  [<ffffffff818133b0>] ? kmem_cache_init+0x421/0x593
        [    0.000000]  [<ffffffff810f69cf>] enable_cpucache+0x68/0xad
        [    0.000000]  [<ffffffff818133c3>] kmem_cache_init+0x434/0x593
        [    0.000000]  [<ffffffff8180987c>] ? mem_init+0x156/0x161
        [    0.000000]  [<ffffffff817f8aae>] start_kernel+0x1cc/0x3b9
        [    0.000000]  [<ffffffff817f829a>] x86_64_start_reservations+0xaa/0xae
        [    0.000000]  [<ffffffff817f837f>] x86_64_start_kernel+0xe1/0xe8
        [    0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---
      
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      8429db5c
    • P
      slab,slub: don't enable interrupts during early boot · 7e85ee0c
      Pekka Enberg 提交于
      As explained by Benjamin Herrenschmidt:
      
        Oh and btw, your patch alone doesn't fix powerpc, because it's missing
        a whole bunch of GFP_KERNEL's in the arch code... You would have to
        grep the entire kernel for things that check slab_is_available() and
        even then you'll be missing some.
      
        For example, slab_is_available() didn't always exist, and so in the
        early days on powerpc, we used a mem_init_done global that is set form
        mem_init() (not perfect but works in practice). And we still have code
        using that to do the test.
      
      Therefore, mask out __GFP_WAIT, __GFP_IO, and __GFP_FS in the slab allocators
      in early boot code to avoid enabling interrupts.
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      7e85ee0c
    • P
      slab: fix gfp flag in setup_cpu_cache() · eb91f1d0
      Pekka Enberg 提交于
      Fixes the following warning during bootup when compiling with CONFIG_SLAB:
      
        [    0.000000] ------------[ cut here ]------------
        [    0.000000] WARNING: at kernel/lockdep.c:2282 lockdep_trace_alloc+0x91/0xb9()
        [    0.000000] Hardware name:
        [    0.000000] Modules linked in:
        [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.30 #491
        [    0.000000] Call Trace:
        [    0.000000]  [<ffffffff81087d84>] ? lockdep_trace_alloc+0x91/0xb9
        [    0.000000]  [<ffffffff81061764>] warn_slowpath_common+0x7c/0xa9
        [    0.000000]  [<ffffffff810617a5>] warn_slowpath_null+0x14/0x16
        [    0.000000]  [<ffffffff81087d84>] lockdep_trace_alloc+0x91/0xb9
        [    0.000000]  [<ffffffff810f5b03>] kmem_cache_alloc_node_notrace+0x26/0xdf
        [    0.000000]  [<ffffffff81487f4e>] ? setup_cpu_cache+0x7e/0x210
        [    0.000000]  [<ffffffff81487fe3>] setup_cpu_cache+0x113/0x210
        [    0.000000]  [<ffffffff810f73ff>] kmem_cache_create+0x409/0x486
        [    0.000000]  [<ffffffff818131c1>] kmem_cache_init+0x232/0x593
        [    0.000000]  [<ffffffff8180987c>] ? mem_init+0x156/0x161
        [    0.000000]  [<ffffffff817f8aae>] start_kernel+0x1cc/0x3b9
        [    0.000000]  [<ffffffff817f829a>] x86_64_start_reservations+0xaa/0xae
        [    0.000000]  [<ffffffff817f837f>] x86_64_start_kernel+0xe1/0xe8
        [    0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      eb91f1d0
    • H
      [S390] maccess: add weak attribute to probe_kernel_write · d93f82b6
      Heiko Carstens 提交于
      probe_kernel_write() gets used to write to the kernel address space.
      E.g. to patch the kernel (kgdb, ftrace, kprobes...). Some architectures
      however enable write protection for the kernel text section, so that
      writes to this region would fault.
      This patch allows to specify an architecture specific version of
      probe_kernel_write() which allows to handle and bypass write protection
      of the text segment.
      That way it is still possible to catch random writes to kernel text
      and explicitly allow writes via this interface.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      d93f82b6
    • K
      memcg: fix page_cgroup fatal error in FLATMEM · ca371c0d
      KAMEZAWA Hiroyuki 提交于
      Now, SLAB is configured in very early stage and it can be used in
      init routine now.
      
      But replacing alloc_bootmem() in FLAT/DISCONTIGMEM's page_cgroup()
      initialization breaks the allocation, now.
      (Works well in SPARSEMEM case...it supports MEMORY_HOTPLUG and
       size of page_cgroup is in reasonable size (< 1 << MAX_ORDER.)
      
      This patch revive FLATMEM+memory cgroup by using alloc_bootmem.
      
      In future,
      We stop to support FLATMEM (if no users) or rewrite codes for flatmem
      completely.But this will adds more messy codes and overheads.
      Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
      Tested-by: NLi Zefan <lizf@cn.fujitsu.com>
      Tested-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      ca371c0d
    • Y
      memcg: don't use bootmem allocator in setup code · 959982fe
      Yinghai Lu 提交于
      The bootmem allocator is no longer available for page_cgroup_init() because we
      set up the kernel slab allocator much earlier now.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      959982fe
    • P
      vmalloc: use kzalloc() instead of alloc_bootmem() · 43ebdac4
      Pekka Enberg 提交于
      We can call vmalloc_init() after kmem_cache_init() and use kzalloc() instead of
      the bootmem allocator when initializing vmalloc data structures.
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      43ebdac4
    • P
      slab: setup allocators earlier in the boot sequence · 83b519e8
      Pekka Enberg 提交于
      This patch makes kmalloc() available earlier in the boot sequence so we can get
      rid of some bootmem allocations. The bulk of the changes are due to
      kmem_cache_init() being called with interrupts disabled which requires some
      changes to allocator boostrap code.
      
      Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps
      before we call mem_init() during boot as reported by Ingo Molnar:
      
        We have a hard crash in the WP-protect code:
      
        [    0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000
        [    0.000000]      EDI 00000188  ESI 00000ac7  EBP c17eaf9c  ESP c17eaf8c
        [    0.000000]      EBX 000014e0  EDX 0000000e  ECX 01856067  EAX 00000001
        [    0.000000]      err 00000003  EIP c10135b1   CS 00000060  flg 00010002
        [    0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc
        [    0.000000]        00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0
        [    0.000000]        c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003
        [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203
        [    0.000000] Call Trace:
        [    0.000000]  [<c15357c2>] ? printk+0x14/0x16
        [    0.000000]  [<c10135b1>] ? do_test_wp_bit+0x19/0x23
        [    0.000000]  [<c17fd410>] ? test_wp_bit+0x26/0x64
        [    0.000000]  [<c17fd7e5>] ? mem_init+0x1ba/0x1d8
        [    0.000000]  [<c17f1668>] ? start_kernel+0x164/0x2f7
        [    0.000000]  [<c17f1322>] ? unknown_bootoption+0x0/0x19c
        [    0.000000]  [<c17f106a>] ? __init_begin+0x6a/0x6f
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      83b519e8
    • P
      bootmem: fix slab fallback on numa · c91c4773
      Pekka Enberg 提交于
      If the user requested bootmem allocation on a specific node, we should use
      kzalloc_node() for the fallback allocation.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      c91c4773
    • P
      bootmem: use slab if bootmem is no longer available · 441c7e0a
      Pekka Enberg 提交于
      As a preparation for initializing the slab allocator early, make sure the
      bootmem allocator does not crash and burn if someone calls it after slab is up;
      otherwise we'd need a flag day for switching to early slab.
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      441c7e0a
    • C
      kmemleak: Simple testing module for kmemleak · 0822ee4a
      Catalin Marinas 提交于
      This patch adds a loadable module that deliberately leaks memory. It
      is used for testing various memory leaking scenarios.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      0822ee4a
    • C
      kmemleak: Enable the building of the memory leak detector · 3bba00d7
      Catalin Marinas 提交于
      This patch adds the Kconfig.debug and Makefile entries needed for
      building kmemleak into the kernel.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      3bba00d7
    • C
      kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash · dbb1f81c
      Catalin Marinas 提交于
      The alloc_large_system_hash function is called from various places in
      the kernel and it contains pointers to other allocated structures. It
      therefore needs to be traced by kmemleak.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      dbb1f81c
    • C
      kmemleak: Add the vmalloc memory allocation/freeing hooks · 89219d37
      Catalin Marinas 提交于
      This patch adds the callbacks to kmemleak_(alloc|free) functions from
      vmalloc/vfree.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      89219d37
    • C
      kmemleak: Add the slub memory allocation/freeing hooks · 06f22f13
      Catalin Marinas 提交于
      This patch adds the callbacks to kmemleak_(alloc|free) functions from the
      slub allocator.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      06f22f13
    • C
      kmemleak: Add the slob memory allocation/freeing hooks · 4374e616
      Catalin Marinas 提交于
      This patch adds the callbacks to kmemleak_(alloc|free) functions from the
      slob allocator.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      4374e616
    • C
      kmemleak: Add the slab memory allocation/freeing hooks · d5cff635
      Catalin Marinas 提交于
      This patch adds the callbacks to kmemleak_(alloc|free) functions from
      the slab allocator. The patch also adds the SLAB_NOLEAKTRACE flag to
      avoid recursive calls to kmemleak when it allocates its own data
      structures.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      d5cff635
    • C
      kmemleak: Add the base support · 3c7b4e6b
      Catalin Marinas 提交于
      This patch adds the base support for the kernel memory leak
      detector. It traces the memory allocation/freeing in a way similar to
      the Boehm's conservative garbage collector, the difference being that
      the unreferenced objects are not freed but only shown in
      /sys/kernel/debug/kmemleak. Enabling this feature introduces an
      overhead to memory allocations.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3c7b4e6b
  4. 10 6月, 2009 2 次提交
    • P
      nommu: Provide mmap_min_addr definition. · 35f2c2f6
      Paul Mundt 提交于
      With the "security: use mmap_min_addr indepedently of security models"
      change, mmap_min_addr is used in common areas, which susbsequently blows
      up the nommu build. This stubs in the definition in the nommu case as
      well.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      
      --
      
       mm/nommu.c |    3 +++
       1 file changed, 3 insertions(+)
      Signed-off-by: NJames Morris <jmorris@namei.org>
      35f2c2f6
    • L
      tracing/events: convert block trace points to TRACE_EVENT() · 55782138
      Li Zefan 提交于
      TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
      these new capabilities to this tracepoint:
      
        - zero-copy and per-cpu splice() tracing
        - binary tracing without printf overhead
        - structured logging records exposed under /debug/tracing/events
        - trace events embedded in function tracer output and other plugins
        - user-defined, per tracepoint filter expressions
        ...
      
      Cons:
      
        - no dev_t info for the output of plug, unplug_timer and unplug_io events.
          no dev_t info for getrq and sleeprq events if bio == NULL.
          no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
      
          This is mainly because we can't get the deivce from a request queue.
          But this may change in the future.
      
        - A packet command is converted to a string in TP_assign, not TP_print.
          While blktrace do the convertion just before output.
      
          Since pc requests should be rather rare, this is not a big issue.
      
        - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
          has a unique format, which means we have some unused data in a trace entry.
      
          The overhead is minimized by using __dynamic_array() instead of __array().
      
      I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
      
            dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
      1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
      2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
      3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s
      
      So the overhead of tracing is very small, and no regression when using
      those trace events vs blktrace.
      
      And the binary output of TRACE_EVENT is much smaller than blktrace:
      
       # ls -l -h
       -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
       -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
       -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
      
      Following are some comparisons between TRACE_EVENT and blktrace:
      
      plug:
        kjournald-480   [000]   303.084981: block_plug: [kjournald]
        kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]
      
      unplug_io:
        kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
        kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1
      
      remap:
        kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
        kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384
      
      bio_backmerge:
        kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
        kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]
      
      getrq:
        kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]
      
        bash-2066  [001]  1072.953770:   8,0    G   N [bash]
        bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
      
      rq_complete:
        konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
        konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]
      
        ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
        ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
      
      rq_insert:
        kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]
      
      Changelog from v2 -> v3:
      
      - use the newly introduced __dynamic_array().
      
      Changelog from v1 -> v2:
      
      - use __string() instead of __array() to minimize the memory required
        to store hex dump of rq->cmd().
      
      - support large pc requests.
      
      - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
      
      - some cleanups.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      55782138
  5. 09 6月, 2009 1 次提交
  6. 05 6月, 2009 1 次提交
  7. 04 6月, 2009 1 次提交