提交 · 2fad6f5deee5556f511eab58da78737a23ddb35d · openeuler / Kernel

17 6月, 2009 9 次提交

readahead: enforce full readahead size on async mmap readahead · 2fad6f5d

由 Wu Fengguang 提交于 6月 16, 2009

We need this in one particular case and two more general ones.

Now we do async readahead for sequential mmap reads, and do it with the
help of PG_readahead.  For normal reads, PG_readahead is the sufficient
condition to do a sequential readahead.  But unfortunately, for mmap
reads, there is a tiny nuisance:

[11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
[11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
[11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
[11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                 ~~~~~~~~~~~~~

An unfavorably small readahead.  The original dumb read-around size could
be more efficient.

That happened because ld-linux.so does a read(832) in L1 before mmap(),
which triggers a 4-page readahead, with the second page tagged
PG_readahead.

L0: open("/lib/libc.so.6", O_RDONLY)        = 3
L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
L7: close(3)                                = 0

In general, the PG_readahead flag will also be hit in cases

- sequential reads

- clustered random reads

A full readahead size is desirable in both cases.

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2fad6f5d

readahead: sequential mmap readahead · 70ac23cf

由 Wu Fengguang 提交于 6月 16, 2009

Auto-detect sequential mmap reads and do readahead for them.

The sequential mmap readahead will be triggered when
- sync readahead: it's a major fault and (prev_offset == offset-1);
- async readahead: minor fault on PG_readahead page with valid readahead state.

The benefits of doing readahead instead of read-around:
- less I/O wait thanks to async readahead
- double real I/O size and no more cache hits

The single stream case is improved a little.
For 100,000 sequential mmap reads:

                                    user       system    cpu        total
(1-1)  plain -mm, 128KB readaround: 3.224      2.554     48.40%     11.838
(1-2)  plain -mm, 256KB readaround: 3.170      2.392     46.20%     11.976
(2)  patched -mm, 128KB readahead:  3.117      2.448     47.33%     11.607

The patched (2) has smallest total time, since it has no cache hit overheads
and less I/O block time(thanks to async readahead). Here the I/O size
makes no much difference, since there's only one single stream.

Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
since the half of the read-around pages will be readahead cache hits.

This is going to make _real_ differences for _concurrent_ IO streams.

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

70ac23cf

readahead: clean up and simplify the code for filemap page fault readahead · ef00e08e

由 Linus Torvalds 提交于 6月 16, 2009

This shouldn't really change behavior all that much, but the single rather
complex function with read-ahead inside a loop etc is broken up into more
manageable pieces.

The behaviour is also less subtle, with the read-ahead being done up-front
rather than inside some subtle loop and thus avoiding the now unnecessary
extra state variables (ie "did_readaround" is gone).

Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
which case the original code will directly jump to no_cached_page reading.

Cc: Pavel Levshin <lpk@581.spb.su>
Cc: <wli@movementarian.org>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ef00e08e

readahead: remove sync/async readahead call dependency · 51daa88e

由 Wu Fengguang 提交于 6月 16, 2009

The readahead call scheme is error-prone in that it expects the call sites
to check for async readahead after doing a sync one.  I.e.

			if (!page)
				page_cache_sync_readahead();
			page = find_get_page();
			if (page && PageReadahead(page))
				page_cache_async_readahead();

This is because PG_readahead could be set by a sync readahead for the
_current_ newly faulted in page, and the readahead code simply expects one
more callback on the same page to start the async readahead.  If the
caller fails to do so, it will miss the PG_readahead bits and never able
to start an async readahead.

Eliminate this insane constraint by piggy-backing the async part into the
current readahead window.

Now if an async readahead should be started immediately after a sync one,
the readahead logic itself will do it.  So the following code becomes
valid: (the 'else' in particular)

			if (!page)
				page_cache_sync_readahead();
			else if (PageReadahead(page))
				page_cache_async_readahead();

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

51daa88e

readahead: increase interleaved readahead size · 160334a0

由 Wu Fengguang 提交于 6月 16, 2009

Make sure interleaved readahead size is larger than request size.  This
also makes the readahead window grow up more quickly.
Reported-by: NXu Chenfeng <xcf@ustc.edu.cn>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

160334a0

readahead: remove one unnecessary radix tree lookup · caca7cb7

由 Wu Fengguang 提交于 6月 16, 2009

(hit_readahead_marker != 0) means the page at @offset is present, so we
can search for non-present page starting from @offset+1.
Reported-by: NXu Chenfeng <xcf@ustc.edu.cn>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

caca7cb7

readahead: apply max_sane_readahead() limit in ondemand_readahead() · fc31d16a

由 Wu Fengguang 提交于 6月 16, 2009

Just in case someone aggressively sets a huge readahead size.

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fc31d16a

readahead: move max_sane_readahead() calls into force_page_cache_readahead() · f7e839dd

由 Wu Fengguang 提交于 6月 16, 2009

Impact: code simplification.

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f7e839dd

mm: consolidate init_mm definition · bb1f17b0

由 Alexey Dobriyan 提交于 6月 16, 2009

* create mm/init-mm.c, move init_mm there
* remove INIT_MM, initialize init_mm with C99 initializer
* unexport init_mm on all arches:

  init_mm is already unexported on x86.

  One strange place is some OMAP driver (drivers/video/omap/) which
  won't build modular, but it's already wants get_vm_area() export.
  Somebody should look there.

[akpm@linux-foundation.org: add missing #includes]
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Frysinger <vapier.adi@gmail.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bb1f17b0

13 6月, 2009 1 次提交

PM/Suspend: Do not shrink memory before suspend · c6f37f12

由 Rafael J. Wysocki 提交于 5月 24, 2009

Remove the shrinking of memory from the suspend-to-RAM code, where
it is not really necessary.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Acked-by: NNigel Cunningham <nigel@tuxonice.net>
Acked-by: NWu Fengguang <fengguang.wu@intel.com>

c6f37f12

12 6月, 2009 18 次提交

slab: setup cpu caches later on when interrupts are enabled · 8429db5c

由 Pekka Enberg 提交于 6月 12, 2009

Fixes the following boot-time warning:

  [    0.000000] ------------[ cut here ]------------
  [    0.000000] WARNING: at kernel/smp.c:369 smp_call_function_many+0x56/0x1bc()
  [    0.000000] Hardware name:
  [    0.000000] Modules linked in:
  [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.30 #492
  [    0.000000] Call Trace:
  [    0.000000]  [<ffffffff8149e021>] ? _spin_unlock+0x4f/0x5c
  [    0.000000]  [<ffffffff8108f11b>] ? smp_call_function_many+0x56/0x1bc
  [    0.000000]  [<ffffffff81061764>] warn_slowpath_common+0x7c/0xa9
  [    0.000000]  [<ffffffff810617a5>] warn_slowpath_null+0x14/0x16
  [    0.000000]  [<ffffffff8108f11b>] smp_call_function_many+0x56/0x1bc
  [    0.000000]  [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54
  [    0.000000]  [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54
  [    0.000000]  [<ffffffff8108f2be>] smp_call_function+0x3d/0x68
  [    0.000000]  [<ffffffff810f3e00>] ? do_ccupdate_local+0x0/0x54
  [    0.000000]  [<ffffffff81066fd8>] on_each_cpu+0x31/0x7c
  [    0.000000]  [<ffffffff810f64f5>] do_tune_cpucache+0x119/0x454
  [    0.000000]  [<ffffffff81087080>] ? lockdep_init_map+0x94/0x10b
  [    0.000000]  [<ffffffff818133b0>] ? kmem_cache_init+0x421/0x593
  [    0.000000]  [<ffffffff810f69cf>] enable_cpucache+0x68/0xad
  [    0.000000]  [<ffffffff818133c3>] kmem_cache_init+0x434/0x593
  [    0.000000]  [<ffffffff8180987c>] ? mem_init+0x156/0x161
  [    0.000000]  [<ffffffff817f8aae>] start_kernel+0x1cc/0x3b9
  [    0.000000]  [<ffffffff817f829a>] x86_64_start_reservations+0xaa/0xae
  [    0.000000]  [<ffffffff817f837f>] x86_64_start_kernel+0xe1/0xe8
  [    0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---

Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

8429db5c

slab,slub: don't enable interrupts during early boot · 7e85ee0c

由 Pekka Enberg 提交于 6月 12, 2009

As explained by Benjamin Herrenschmidt:

  Oh and btw, your patch alone doesn't fix powerpc, because it's missing
  a whole bunch of GFP_KERNEL's in the arch code... You would have to
  grep the entire kernel for things that check slab_is_available() and
  even then you'll be missing some.

  For example, slab_is_available() didn't always exist, and so in the
  early days on powerpc, we used a mem_init_done global that is set form
  mem_init() (not perfect but works in practice). And we still have code
  using that to do the test.

Therefore, mask out __GFP_WAIT, __GFP_IO, and __GFP_FS in the slab allocators
in early boot code to avoid enabling interrupts.
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

7e85ee0c

slab: fix gfp flag in setup_cpu_cache() · eb91f1d0

由 Pekka Enberg 提交于 6月 12, 2009

Fixes the following warning during bootup when compiling with CONFIG_SLAB:

  [    0.000000] ------------[ cut here ]------------
  [    0.000000] WARNING: at kernel/lockdep.c:2282 lockdep_trace_alloc+0x91/0xb9()
  [    0.000000] Hardware name:
  [    0.000000] Modules linked in:
  [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.30 #491
  [    0.000000] Call Trace:
  [    0.000000]  [<ffffffff81087d84>] ? lockdep_trace_alloc+0x91/0xb9
  [    0.000000]  [<ffffffff81061764>] warn_slowpath_common+0x7c/0xa9
  [    0.000000]  [<ffffffff810617a5>] warn_slowpath_null+0x14/0x16
  [    0.000000]  [<ffffffff81087d84>] lockdep_trace_alloc+0x91/0xb9
  [    0.000000]  [<ffffffff810f5b03>] kmem_cache_alloc_node_notrace+0x26/0xdf
  [    0.000000]  [<ffffffff81487f4e>] ? setup_cpu_cache+0x7e/0x210
  [    0.000000]  [<ffffffff81487fe3>] setup_cpu_cache+0x113/0x210
  [    0.000000]  [<ffffffff810f73ff>] kmem_cache_create+0x409/0x486
  [    0.000000]  [<ffffffff818131c1>] kmem_cache_init+0x232/0x593
  [    0.000000]  [<ffffffff8180987c>] ? mem_init+0x156/0x161
  [    0.000000]  [<ffffffff817f8aae>] start_kernel+0x1cc/0x3b9
  [    0.000000]  [<ffffffff817f829a>] x86_64_start_reservations+0xaa/0xae
  [    0.000000]  [<ffffffff817f837f>] x86_64_start_kernel+0xe1/0xe8
  [    0.000000] ---[ end trace 4eaa2a86a8e2da22 ]---
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

eb91f1d0

[S390] maccess: add weak attribute to probe_kernel_write · d93f82b6

由 Heiko Carstens 提交于 6月 12, 2009

probe_kernel_write() gets used to write to the kernel address space.
E.g. to patch the kernel (kgdb, ftrace, kprobes...). Some architectures
however enable write protection for the kernel text section, so that
writes to this region would fault.
This patch allows to specify an architecture specific version of
probe_kernel_write() which allows to handle and bypass write protection
of the text segment.
That way it is still possible to catch random writes to kernel text
and explicitly allow writes via this interface.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

d93f82b6

memcg: fix page_cgroup fatal error in FLATMEM · ca371c0d

由 KAMEZAWA Hiroyuki 提交于 6月 12, 2009

Now, SLAB is configured in very early stage and it can be used in
init routine now.

But replacing alloc_bootmem() in FLAT/DISCONTIGMEM's page_cgroup()
initialization breaks the allocation, now.
(Works well in SPARSEMEM case...it supports MEMORY_HOTPLUG and
 size of page_cgroup is in reasonable size (< 1 << MAX_ORDER.)

This patch revive FLATMEM+memory cgroup by using alloc_bootmem.

In future,
We stop to support FLATMEM (if no users) or rewrite codes for flatmem
completely.But this will adds more messy codes and overheads.
Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
Tested-by: NLi Zefan <lizf@cn.fujitsu.com>
Tested-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

ca371c0d

memcg: don't use bootmem allocator in setup code · 959982fe

由 Yinghai Lu 提交于 5月 28, 2009

The bootmem allocator is no longer available for page_cgroup_init() because we
set up the kernel slab allocator much earlier now.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

959982fe

vmalloc: use kzalloc() instead of alloc_bootmem() · 43ebdac4

由 Pekka Enberg 提交于 5月 25, 2009

We can call vmalloc_init() after kmem_cache_init() and use kzalloc() instead of
the bootmem allocator when initializing vmalloc data structures.
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Acked-by: NNick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

43ebdac4

slab: setup allocators earlier in the boot sequence · 83b519e8

由 Pekka Enberg 提交于 6月 10, 2009

This patch makes kmalloc() available earlier in the boot sequence so we can get
rid of some bootmem allocations. The bulk of the changes are due to
kmem_cache_init() being called with interrupts disabled which requires some
changes to allocator boostrap code.

Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps
before we call mem_init() during boot as reported by Ingo Molnar:

  We have a hard crash in the WP-protect code:

  [    0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000
  [    0.000000]      EDI 00000188  ESI 00000ac7  EBP c17eaf9c  ESP c17eaf8c
  [    0.000000]      EBX 000014e0  EDX 0000000e  ECX 01856067  EAX 00000001
  [    0.000000]      err 00000003  EIP c10135b1   CS 00000060  flg 00010002
  [    0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc
  [    0.000000]        00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0
  [    0.000000]        c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003
  [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203
  [    0.000000] Call Trace:
  [    0.000000]  [<c15357c2>] ? printk+0x14/0x16
  [    0.000000]  [<c10135b1>] ? do_test_wp_bit+0x19/0x23
  [    0.000000]  [<c17fd410>] ? test_wp_bit+0x26/0x64
  [    0.000000]  [<c17fd7e5>] ? mem_init+0x1ba/0x1d8
  [    0.000000]  [<c17f1668>] ? start_kernel+0x164/0x2f7
  [    0.000000]  [<c17f1322>] ? unknown_bootoption+0x0/0x19c
  [    0.000000]  [<c17f106a>] ? __init_begin+0x6a/0x6f
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

83b519e8

bootmem: fix slab fallback on numa · c91c4773

由 Pekka Enberg 提交于 6月 11, 2009

If the user requested bootmem allocation on a specific node, we should use
kzalloc_node() for the fallback allocation.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

c91c4773

bootmem: use slab if bootmem is no longer available · 441c7e0a

由 Pekka Enberg 提交于 6月 10, 2009

As a preparation for initializing the slab allocator early, make sure the
bootmem allocator does not crash and burn if someone calls it after slab is up;
otherwise we'd need a flag day for switching to early slab.
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

441c7e0a

kmemleak: Simple testing module for kmemleak · 0822ee4a

由 Catalin Marinas 提交于 6月 11, 2009

This patch adds a loadable module that deliberately leaks memory. It
is used for testing various memory leaking scenarios.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

0822ee4a

kmemleak: Enable the building of the memory leak detector · 3bba00d7

由 Catalin Marinas 提交于 6月 11, 2009

This patch adds the Kconfig.debug and Makefile entries needed for
building kmemleak into the kernel.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

3bba00d7

kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash · dbb1f81c

由 Catalin Marinas 提交于 6月 11, 2009

The alloc_large_system_hash function is called from various places in
the kernel and it contains pointers to other allocated structures. It
therefore needs to be traced by kmemleak.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

dbb1f81c

kmemleak: Add the vmalloc memory allocation/freeing hooks · 89219d37

由 Catalin Marinas 提交于 6月 11, 2009

This patch adds the callbacks to kmemleak_(alloc|free) functions from
vmalloc/vfree.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

89219d37

kmemleak: Add the slub memory allocation/freeing hooks · 06f22f13

由 Catalin Marinas 提交于 6月 11, 2009

This patch adds the callbacks to kmemleak_(alloc|free) functions from the
slub allocator.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>

06f22f13

kmemleak: Add the slob memory allocation/freeing hooks · 4374e616

由 Catalin Marinas 提交于 6月 11, 2009

This patch adds the callbacks to kmemleak_(alloc|free) functions from the
slob allocator.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: NMatt Mackall <mpm@selenic.com>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>

4374e616

kmemleak: Add the slab memory allocation/freeing hooks · d5cff635

由 Catalin Marinas 提交于 6月 11, 2009

This patch adds the callbacks to kmemleak_(alloc|free) functions from
the slab allocator. The patch also adds the SLAB_NOLEAKTRACE flag to
avoid recursive calls to kmemleak when it allocates its own data
structures.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>

d5cff635

kmemleak: Add the base support · 3c7b4e6b

由 Catalin Marinas 提交于 6月 11, 2009

This patch adds the base support for the kernel memory leak
detector. It traces the memory allocation/freeing in a way similar to
the Boehm's conservative garbage collector, the difference being that
the unreferenced objects are not freed but only shown in
/sys/kernel/debug/kmemleak. Enabling this feature introduces an
overhead to memory allocations.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>

3c7b4e6b

10 6月, 2009 2 次提交

nommu: Provide mmap_min_addr definition. · 35f2c2f6

由 Paul Mundt 提交于 6月 09, 2009

With the "security: use mmap_min_addr indepedently of security models"
change, mmap_min_addr is used in common areas, which susbsequently blows
up the nommu build. This stubs in the definition in the nommu case as
well.
Signed-off-by: NPaul Mundt <lethal@linux-sh.org>

--

 mm/nommu.c |    3 +++
 1 file changed, 3 insertions(+)
Signed-off-by: NJames Morris <jmorris@namei.org>

35f2c2f6

tracing/events: convert block trace points to TRACE_EVENT() · 55782138

由 Li Zefan 提交于 6月 09, 2009

TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:

  - zero-copy and per-cpu splice() tracing
  - binary tracing without printf overhead
  - structured logging records exposed under /debug/tracing/events
  - trace events embedded in function tracer output and other plugins
  - user-defined, per tracepoint filter expressions
  ...

Cons:

  - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

  - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

  - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

      dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s

So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.

And the binary output of TRACE_EVENT is much smaller than blktrace:

 # ls -l -h
 -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
 -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
 -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

Following are some comparisons between TRACE_EVENT and blktrace:

plug:
  kjournald-480   [000]   303.084981: block_plug: [kjournald]
  kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]

unplug_io:
  kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
  kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1

remap:
  kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
  kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384

bio_backmerge:
  kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
  kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]

getrq:
  kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]

  bash-2066  [001]  1072.953770:   8,0    G   N [bash]
  bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]

rq_complete:
  konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
  konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]

  ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
  ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]

rq_insert:
  kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]

Changelog from v2 -> v3:

- use the newly introduced __dynamic_array().

Changelog from v1 -> v2:

- use __string() instead of __array() to minimize the memory required
  to store hex dump of rq->cmd().

- support large pc requests.

- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

- some cleanups.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

55782138

09 6月, 2009 1 次提交

perf_counter: Add mmap event hooks to mprotect() · dab5855b

由 Peter Zijlstra 提交于 6月 08, 2009

Some JIT compilers allocate memory for generated code with
posix_memalign() + mprotect() so we need to hook into mprotect()
to make sure 'perf' is aware that we're executing code in
anonymous memory.

[ penberg@cs.helsinki.fi: move the hook to sys_mprotect() ]
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <Pine.LNX.4.64.0906082111030.12407@melkki.cs.Helsinki.FI>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

dab5855b

05 6月, 2009 1 次提交

perf_counter: Generate mmap events for install_special_mapping() · 089dd79d

由 Peter Zijlstra 提交于 6月 05, 2009

In order to track the vdso also generate mmap events for
install_special_mapping().
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

089dd79d

04 6月, 2009 2 次提交

perf_counter: Remove munmap stuff · d99e9446

由 Peter Zijlstra 提交于 6月 04, 2009

In name of keeping it simple, only track mmap events. Userspace
will have to remove old overlapping maps when it encounters them.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d99e9446

security: use mmap_min_addr indepedently of security models · e0a94c2a

由 Christoph Lameter 提交于 6月 03, 2009

This patch removes the dependency of mmap_min_addr on CONFIG_SECURITY.
It also sets a default mmap_min_addr of 4096.

mmapping of addresses below 4096 will only be possible for processes
with CAP_SYS_RAWIO.
Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
Acked-by: NEric Paris <eparis@redhat.com>
Looks-ok-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJames Morris <jmorris@namei.org>

e0a94c2a

29 5月, 2009 4 次提交

memcg: fix build warning and avoid checking for mem != null again and again · 46f7e602

由 Nikanth Karthikesan 提交于 5月 28, 2009

Fix build warning, "mem_cgroup_is_obsolete defined but not used" when
CONFIG_DEBUG_VM is not set.  Also avoid checking for !mem again and again.
Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

46f7e602

mm: account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in hugetlbfs · f83a275d

由 Mel Gorman 提交于 5月 28, 2009

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13302

hugetlbfs reserves huge pages but does not fault them at mmap() time to
ensure that future faults succeed.  The reservation behaviour differs
depending on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE.
For MAP_SHARED mappings, hugepages are reserved when mmap() is first
called and are tracked based on information associated with the inode.
Other processes mapping MAP_SHARED use the same reservation.  MAP_PRIVATE
track the reservations based on the VMA created as part of the mmap()
operation.  Each process mapping MAP_PRIVATE must make its own
reservation.

hugetlbfs currently checks if a VMA is MAP_SHARED with the VM_SHARED flag
and not VM_MAYSHARE.  For file-backed mappings, such as hugetlbfs,
VM_SHARED is set only if the mapping is MAP_SHARED and the file was opened
read-write.  If a shared memory mapping was mapped shared-read-write for
populating of data and mapped shared-read-only by other processes, then
hugetlbfs would account for the mapping as if it was MAP_PRIVATE.  This
causes processes to fail to map the file MAP_SHARED even though it should
succeed as the reservation is there.

This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE
when the intent of the code was to check whether the VMA was mapped
MAP_SHARED or MAP_PRIVATE.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <starlight@binnacle.cx>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f83a275d

memcg: fix deadlock between lock_page_cgroup and mapping tree_lock · e767e056

由 Daisuke Nishimura 提交于 5月 28, 2009

mapping->tree_lock can be acquired from interrupt context.  Then,
following dead lock can occur.

Assume "A" as a page.

 CPU0:
       lock_page_cgroup(A)
		interrupted
			-> take mapping->tree_lock.
 CPU1:
       take mapping->tree_lock
		-> lock_page_cgroup(A)

This patch tries to fix above deadlock by moving memcg's hook to out of
mapping->tree_lock.  charge/uncharge of pagecache/swapcache is protected
by page lock, not tree_lock.

After this patch, lock_page_cgroup() is not called under mapping->tree_lock.
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e767e056

oom: fix possible oom_dump_tasks NULL pointer · 6d2661ed

由 David Rientjes 提交于 5月 28, 2009

When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL
pointer for tasks that have detached mm's since task_lock() is not held
during the tasklist scan.  Add the task_lock().
Acked-by: NNick Piggin <npiggin@suse.de>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6d2661ed

23 5月, 2009 1 次提交

block: Use accessor functions for queue limits · ae03bf63

由 Martin K. Petersen 提交于 5月 22, 2009

Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

ae03bf63

22 5月, 2009 1 次提交

integrity: move ima_counts_get · c9d9ac52

由 Mimi Zohar 提交于 5月 19, 2009

Based on discussion on lkml (Andrew Morton and Eric Paris),
move ima_counts_get down a layer into shmem/hugetlb__file_setup().
Resolves drm shmem_file_setup() usage case as well.

HD comment:
  I still think you're doing this at the wrong level, but recognize
  that you probably won't be persuaded until a few more users of
  alloc_file() emerge, all wanting your ima_counts_get().

  Resolving GEM's shmem_file_setup() is an improvement, so I'll say
Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: NMimi Zohar <zohar@us.ibm.com>
Signed-off-by: NJames Morris <jmorris@namei.org>

c9d9ac52

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功