提交 · aceda773606f2506a25b91aaafae87b2e4315834 · bug2833 / cloud-kernel

14 9月, 2009 1 次提交

由 Eric Dumazet 提交于 9月 03, 2009

When SLAB_POISON is used and slab_pad_check() finds an overwrite of the
slab padding, we call restore_bytes() on the whole slab, not only
on the padding.
Acked-by: NChristoph Lameer <cl@linux-foundation.org>
Reported-by: NZdenek Kabelac <zdenek.kabelac@gmail.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

8a3d271d

11 9月, 2009 7 次提交

kmemleak: Improve the "Early log buffer exceeded" error message · addd72c1

由 Catalin Marinas 提交于 9月 11, 2009

Based on a suggestion from Jaswinder, clarify what the user would need
to do to avoid this error message from kmemleak.
Reported-by: NJaswinder Singh Rajput <jaswinder@kernel.org>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

addd72c1

writeback: check for registered bdi in flusher add and inode dirty · 500b067c

由 Jens Axboe 提交于 9月 09, 2009

Also a debugging aid. We want to catch dirty inodes being added to
backing devices that don't do writeback.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

500b067c

writeback: add name to backing_dev_info · d993831f

由 Jens Axboe 提交于 6月 12, 2009

This enables us to track who does what and print info. Its main use
is catching dirty inodes on the default_backing_dev_info, so we can
fix that up.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

d993831f

writeback: add some debug inode list counters to bdi stats · f09b00d3

由 Jens Axboe 提交于 5月 25, 2009

Add some debug entries to be able to inspect the internal state of
the writeback details.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

f09b00d3

J
writeback: get rid of pdflush completely · d0bceac7
由 Jens Axboe 提交于 5月 18, 2009
```
It is now unused, so kill it off.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
```
d0bceac7

writeback: switch to per-bdi threads for flushing data · 03ba3782

由 Jens Axboe 提交于 9月 09, 2009

This gets rid of pdflush for bdi writeout and kupdated style cleaning.
pdflush writeout suffers from lack of locality and also requires more
threads to handle the same workload, since it has to work in a
non-blocking fashion against each queue. This also introduces lumpy
behaviour and potential request starvation, since pdflush can be starved
for queue access if others are accessing it. A sample ffsb workload that
does random writes to files is about 8% faster here on a simple SATA drive
during the benchmark phase. File layout also seems a LOT more smooth in
vmstat:

r b swpd free buff cache si so bi bo in cs us sy id wa
0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

where vanilla tends to fluctuate a lot in the creation phase:

r b swpd free buff cache si so bi bo in cs us sy id wa
1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
SSD based writeback test on XFS performs over 20% better as well, with
the throughput being very stable around 1GB/sec, where pdflush only
manages 750MB/sec and fluctuates wildly while doing so. Random buffered
writes to many files behave a lot better as well, as does random mmap'ed
writes.

A separate thread is added to sync the super blocks. In the long term,
adding sync_supers_bdi() functionality could get rid of this thread again.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

03ba3782

writeback: move dirty inodes from super_block to backing_dev_info · 66f3b8e2

由 Jens Axboe 提交于 9月 02, 2009

This is a first step at introducing per-bdi flusher threads. We should
have no change in behaviour, although sb_has_dirty_inodes() is now
ridiculously expensive, as there's no easy way to answer that question.
Not a huge problem, since it'll be deleted in subsequent patches.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

66f3b8e2

09 9月, 2009 4 次提交

shmfs: use 'check_acl' instead of 'permission' · 6d848a48

由 Linus Torvalds 提交于 8月 28, 2009

shmfs wants purely standard POSIX ACL semantics, so we can use the new
generic VFS layer POSIX ACL checking rather than cooking our own
'permission()' function.
Reviewed-by: NJames Morris <jmorris@namei.org>
Acked-by: NSerge Hallyn <serue@us.ibm.com>
Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6d848a48

kmemleak: fix sparse warning for static declarations · 7eb0d5e5

由 Luis R. Rodriguez 提交于 9月 08, 2009

This fixes these sparse warnings:

mm/kmemleak.c:1179:6: warning: symbol 'start_scan_thread' was not declared. Should it be static?
mm/kmemleak.c:1194:6: warning: symbol 'stop_scan_thread' was not declared. Should it be static?
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: NLuis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

7eb0d5e5

kmemleak: fix sparse warning over overshadowed flags · 0580a181

由 Luis R. Rodriguez 提交于 9月 08, 2009

A secondary irq_save is not required as a locking before it was
already disabling irqs.

This fixes this sparse warning:
mm/kmemleak.c:512:31: warning: symbol 'flags' shadows an earlier one
mm/kmemleak.c:448:23: originally declared here
Signed-off-by: NLuis R. Rodriguez <lrodriguez@atheros.com>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

0580a181

kmemleak: move common painting code together · a1084c87

由 Luis R. Rodriguez 提交于 9月 04, 2009

When painting grey or black we do the same thing, bring
this together into a helper and identify coloring grey or
black explicitly with defines. This makes this a little
easier to read.
Signed-off-by: NLuis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

a1084c87

08 9月, 2009 3 次提交

kmemleak: add clear command support · 30b37101

由 Luis R. Rodriguez 提交于 9月 04, 2009

In an ideal world your kmemleak output will be small, when its
not (usually during initial bootup) you can use the clear command
to ingore previously reported and unreferenced kmemleak objects. We
do this by painting all currently reported unreferenced objects grey.
We paint them grey instead of black to allow future scans on the same
objects as such objects could still potentially reference newly
allocated objects in the future.

To test a critical section on demand with a clean
/sys/kernel/debug/kmemleak you can do:

echo clear > /sys/kernel/debug/kmemleak
        test your kernel or modules
echo scan > /sys/kernel/debug/kmemleak

Then as usual to get your report with:

cat /sys/kernel/debug/kmemleak
Signed-off-by: NLuis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

30b37101

kmemleak: use bool for true/false questions · 4a558dd6

由 Luis R. Rodriguez 提交于 9月 08, 2009

Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: NLuis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

4a558dd6

kmemleak: Do no create the clean-up thread during kmemleak_disable() · 179a8100

由 Catalin Marinas 提交于 9月 07, 2009

The kmemleak_disable() function could be called from various contexts
including IRQ. It creates a clean-up thread but the kthread_create()
function has restrictions on which contexts it can be called from,
mainly because of the kthread_create_lock. The patch changes the
kmemleak clean-up thread to a workqueue.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Reported-by: NEric Paris <eparis@redhat.com>

179a8100

06 9月, 2009 2 次提交

page-allocator: always change pageblock ownership when anti-fragmentation is disabled · dd5d241e

由 Mel Gorman 提交于 9月 05, 2009

On low-memory systems, anti-fragmentation gets disabled as fragmentation
cannot be avoided on a sufficiently large boundary to be worthwhile.  Once
disabled, there is a period of time when all the pageblocks are marked
MOVABLE and the expectation is that they get marked UNMOVABLE at each call
to __rmqueue_fallback().

However, when MAX_ORDER is large the pageblocks do not change ownership
because the normal criteria are not met.  This has the effect of
prematurely breaking up too many large contiguous blocks.  This is most
serious on NOMMU systems which depend on high-order allocations to boot.
This patch causes pageblocks to change ownership on every fallback when
anti-fragmentation is disabled.  This prevents the large blocks being
prematurely broken up.

This is a fix to commit 49255c61 [page
allocator: move check for disabled anti-fragmentation out of fastpath] and
the problem affects 2.6.31-rc8.
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Tested-by: NPaul Mundt <lethal@linux-sh.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: NGreg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dd5d241e

nommu: fix error handling in do_mmap_pgoff() · a190887b

由 David Howells 提交于 9月 05, 2009

Fix the error handling in do_mmap_pgoff().  If do_mmap_shared_file() or
do_mmap_private() fail, we jump to the error_put_region label at which
point we cann __put_nommu_region() on the region - but we haven't yet
added the region to the tree, and so __put_nommu_region() may BUG
because the region tree is empty or it may corrupt the region tree.

To get around this, we can afford to add the region to the region tree
before calling do_mmap_shared_file() or do_mmap_private() as we keep
nommu_region_sem write-locked, so no-one can race with us by seeing a
transient region.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Acked-by: NPaul Mundt <lethal@linux-sh.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: NGreg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a190887b

04 9月, 2009 4 次提交

kmemleak: Scan all thread stacks · 43ed5d6e

由 Catalin Marinas 提交于 9月 01, 2009

This patch changes the for_each_process() loop with the
do_each_thread()/while_each_thread() pair.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

43ed5d6e

kmemleak: Don't scan uninitialized memory when kmemcheck is enabled · 8e019366

由 Pekka Enberg 提交于 8月 27, 2009

Ingo Molnar reported the following kmemcheck warning when running both
kmemleak and kmemcheck enabled:

  PM: Adding info for No Bus:vcsa7
  WARNING: kmemcheck: Caught 32-bit read from uninitialized memory
  (f6f6e1a4)
  d873f9f600000000c42ae4c1005c87f70000000070665f666978656400000000
   i i i i u u u u i i i i i i i i i i i i i i i i i i i i i u u u
           ^

  Pid: 3091, comm: kmemleak Not tainted (2.6.31-rc7-tip #1303) P4DC6
  EIP: 0060:[<c110301f>] EFLAGS: 00010006 CPU: 0
  EIP is at scan_block+0x3f/0xe0
  EAX: f40bd700 EBX: f40bd780 ECX: f16b46c0 EDX: 00000001
  ESI: f6f6e1a4 EDI: 00000000 EBP: f10f3f4c ESP: c2605fcc
   DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
  CR0: 8005003b CR2: e89a4844 CR3: 30ff1000 CR4: 000006f0
  DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
  DR6: ffff4ff0 DR7: 00000400
   [<c110313c>] scan_object+0x7c/0xf0
   [<c1103389>] kmemleak_scan+0x1d9/0x400
   [<c1103a3c>] kmemleak_scan_thread+0x4c/0xb0
   [<c10819d4>] kthread+0x74/0x80
   [<c10257db>] kernel_thread_helper+0x7/0x3c
   [<ffffffff>] 0xffffffff
  kmemleak: 515 new suspected memory leaks (see
  /sys/kernel/debug/kmemleak)
  kmemleak: 42 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

The problem here is that kmemleak will scan partially initialized
objects that makes kmemcheck complain. Fix that up by skipping
uninitialized memory regions when kmemcheck is enabled.
Reported-by: NIngo Molnar <mingo@elte.hu>
Acked-by: NIngo Molnar <mingo@elte.hu>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

8e019366

slub: Fix kmem_cache_destroy() with SLAB_DESTROY_BY_RCU · d76b1590

由 Eric Dumazet 提交于 9月 03, 2009

kmem_cache_destroy() should call rcu_barrier() *after* kmem_cache_close() and
*before* sysfs_slab_remove() or risk rcu_free_slab() being called after
kmem_cache is deleted (kfreed).

rmmod nf_conntrack can crash the machine because it has to kmem_cache_destroy()
a SLAB_DESTROY_BY_RCU enabled cache.

Cc: <stable@kernel.org>
Reported-by: NZdenek Kabelac <zdenek.kabelac@gmail.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

d76b1590

slub: release kobject if sysfs_create_group failed in sysfs_slab_add · 5788d8ad

由 Xiaotian Feng 提交于 7月 22, 2009

When CONFIG_SLUB_DEBUG is enabled, sysfs_slab_add should unlink and put the
kobject if sysfs_create_group failed. Otherwise, sysfs_slab_add returns error
then free kmem_cache s, thus memory of s->kobj is leaked.
Acked-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NXiaotian Feng <dfeng@redhat.com>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

5788d8ad

01 9月, 2009 1 次提交

percpu: don't assume existence of cpu0 · 04a13c7c

由 Tejun Heo 提交于 9月 01, 2009

percpu incorrectly assumed that cpu0 was always there which led to the
following warning and eventual oops on sparc machines w/o cpu0.

  WARNING: at mm/percpu.c:651 pcpu_map+0xdc/0x100()
  Modules linked in:
  Call Trace:
    [000000000045eb70] warn_slowpath_common+0x50/0xa0
    [000000000045ebdc] warn_slowpath_null+0x1c/0x40
    [00000000004d493c] pcpu_map+0xdc/0x100
    [00000000004d59a4] pcpu_alloc+0x3e4/0x4e0
    [00000000004d5af8] __alloc_percpu+0x18/0x40
    [00000000005b112c] __percpu_counter_init+0x4c/0xc0
  ...
  Unable to handle kernel NULL pointer dereference
  ...
   I7: <sysfs_new_dirent+0x30/0x120>
   Disabling lock debugging due to kernel taint
   Caller[000000000053c1b0]: sysfs_new_dirent+0x30/0x120
   Caller[000000000053c7a4]: create_dir+0x24/0xc0
   Caller[000000000053c870]: sysfs_create_dir+0x30/0x80
   Caller[00000000005990e8]: kobject_add_internal+0xc8/0x200
  ...
   Kernel panic - not syncing: Attempted to kill the idle task!

This patch fixes the problem by backporting parts from devel branch to
make percpu core not depend on the existence of cpu0.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NMeelis Roos <mroos@linux.ee>
Cc: David Miller <davem@davemloft.net>

04a13c7c

30 8月, 2009 1 次提交

SLUB: fix ARCH_KMALLOC_MINALIGN cases 64 and 256 · acdfcd04

由 Aaro Koskinen 提交于 8月 28, 2009

If the minalign is 64 bytes, then the 96 byte cache should not be created
because it would conflict with the 128 byte cache.

If the minalign is 256 bytes, patching the size_index table should not
result in a buffer overrun.

The calculation "(i - 1) / 8" used to access size_index[] is moved to
a separate function as suggested by Christoph Lameter.
Acked-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NAaro Koskinen <aaro.koskinen@nokia.com>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

acdfcd04

27 8月, 2009 7 次提交

kmemleak: Printing of the objects hex dump · 0494e082

由 Sergey Senozhatsky 提交于 8月 27, 2009

Introducing printing of the objects hex dump to the seq file.
The number of lines to be printed is limited to HEX_MAX_LINES
to prevent seq file spamming. The actual number of printed
bytes is less than or equal to (HEX_MAX_LINES * HEX_ROW_SIZE).

(slight adjustments by Catalin Marinas)
Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@mail.by>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

0494e082

kmemleak: Do not report alloc_bootmem blocks as leaks · 008139d9

由 Catalin Marinas 提交于 8月 27, 2009

This patch sets the min_count for alloc_bootmem objects to 0 so that
they are never reported as leaks. This is because many of these blocks
are only referred via the physical address which is not looked up by
kmemleak.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>

008139d9

kmemleak: Save the stack trace for early allocations · fd678967

由 Catalin Marinas 提交于 8月 27, 2009

Before slab is initialised, kmemleak save the allocations in an early
log buffer. They are later recorded as normal memory allocations. This
patch adds the stack trace saving to the early log buffer, otherwise the
information shown for such objects only refers to the kmemleak_init()
function.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

fd678967

kmemleak: Mark the early log buffer as __initdata · a6186d89

由 Catalin Marinas 提交于 8月 27, 2009

This buffer isn't needed after kmemleak was initialised so it can be
freed together with the .init.data section. This patch also marks
functions conditionally accessing the early log variables with __ref.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

a6186d89

kmemleak: Dump object information on request · 189d84ed

由 Catalin Marinas 提交于 8月 27, 2009

By writing dump=<addr> to the kmemleak file, kmemleak will look up an
object with that address and dump the information it has about it to
syslog. This is useful in debugging memory leaks.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

189d84ed

kmemleak: Allow rescheduling during an object scanning · af98603d

由 Catalin Marinas 提交于 8月 27, 2009

If the object size is bigger than a predefined value (4K in this case),
release the object lock during scanning and call cond_resched().
Re-acquire the lock after rescheduling and test whether the object is
still valid.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

af98603d

mm: fix for infinite churning of mlocked pages · 03ef83af

由 Minchan Kim 提交于 8月 26, 2009

An mlocked page might lose the isolatation race.  This causes the page to
clear PG_mlocked while it remains in a VM_LOCKED vma.  This means it can
be put onto the [in]active list.  We can rescue it by using try_to_unmap()
in shrink_page_list().

But now, As Wu Fengguang pointed out, vmscan has a bug.  If the page has
PG_referenced, it can't reach try_to_unmap() in shrink_page_list() but is
put into the active list.  If the page is referenced repeatedly, it can
remain on the [in]active list without being moving to the unevictable
list.

This patch fixes it.
Reported-by: NWu Fengguang <fengguang.wu@intel.com>
Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
Reviewed-by: NKOSAKI Motohiro <&lt;kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

03ef83af

20 8月, 2009 1 次提交

SLUB: Fix some coding style issues · 5086c389

由 Amerigo Wang 提交于 8月 19, 2009

Signed-off-by: NWANG Cong <amwang@redhat.com>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

5086c389

19 8月, 2009 4 次提交

mm: build_zonelists(): move clear node_load[] to __build_all_zonelists() · 7f9cfb31

由 Bo Liu 提交于 8月 18, 2009

If node_load[] is cleared everytime build_zonelists() is
called,node_load[] will have no help to find the next node that should
appear in the given node's fallback list.

Because of the bug, zonelist's node_order is not calculated as expected.
This bug affects on big machine, which has asynmetric node distance.

[synmetric NUMA's node distance]
     0    1    2
0   10   12   12
1   12   10   12
2   12   12   10

[asynmetric NUMA's node distance]
     0    1    2
0   10   12   20
1   12   10   14
2   20   14   10

This (my bug) is very old but no one has reported this for a long time.
Maybe because the number of asynmetric NUMA is very small and they use
cpuset for customizing node memory allocation fallback.

[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
Signed-off-by: NBo Liu <bo-liu@hotmail.com>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7f9cfb31

nommu: check fd read permission in validate_mmap_request() · 28d7a6ae

由 Graff Yang 提交于 8月 18, 2009

According to the POSIX (1003.1-2008), the file descriptor shall have been
opened with read permission, regardless of the protection options specified to
mmap().  The ltp test cases mmap06/07 need this.
Signed-off-by: NGraff Yang <graff.yang@gmail.com>
Acked-by: NPaul Mundt <lethal@linux-sh.org>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NGreg Ungerer <gerg@snapgear.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

28d7a6ae

mm: revert "oom: move oom_adj value" · 0753ba01

由 KOSAKI Motohiro 提交于 8月 18, 2009

The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
the mm_struct.  It was a very good first step for sanitize OOM.

However Paul Menage reported the commit makes regression to his job
scheduler.  Current OOM logic can kill OOM_DISABLED process.

Why? His program has the code of similar to the following.

	...
	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
	...
	if (vfork() == 0) {
		set_oom_adj(0); /* Invoked child can be killed */
		execve("foo-bar-cmd");
	}
	....

vfork() parent and child are shared the same mm_struct.  then above
set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
change oom_adj for vfork() parent.  Then, vfork() parent (job scheduler)
lost OOM immune and it was killed.

Actually, fork-setting-exec idiom is very frequently used in userland program.
We must not break this assumption.

Then, this patch revert commit 2ff05b2b and related commit.

Reverted commit list
---------------------
- commit 2ff05b2b (oom: move oom_adj value from task_struct to mm_struct)
- commit 4d8b9135 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
- commit 81236810 (oom: only oom kill exiting tasks with attached memory)
- commit 933b787b (mm: copy over oom_adj value at fork time)
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0753ba01

SLUB: Drop write permission to /proc/slabinfo · cf5d1131

由 WANG Cong 提交于 8月 18, 2009

SLUB does not support writes to /proc/slabinfo so there should not be write
permission to do that either.
Signed-off-by: NWANG Cong <amwang@redhat.com>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

cf5d1131

17 8月, 2009 1 次提交

Security/SELinux: seperate lsm specific mmap_min_addr · 788084ab

由 Eric Paris 提交于 7月 31, 2009

Currently SELinux enforcement of controls on the ability to map low memory
is determined by the mmap_min_addr tunable.  This patch causes SELinux to
ignore the tunable and instead use a seperate Kconfig option specific to how
much space the LSM should protect.

The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
permissions will always protect the amount of low memory designated by
CONFIG_LSM_MMAP_MIN_ADDR.

This allows users who need to disable the mmap_min_addr controls (usual reason
being they run WINE as a non-root user) to do so and still have SELinux
controls preventing confined domains (like a web server) from being able to
map some area of low memory.
Signed-off-by: NEric Paris <eparis@redhat.com>
Signed-off-by: NJames Morris <jmorris@namei.org>

788084ab

14 8月, 2009 2 次提交

percpu: use the right flag for get_vm_area() · 142d44b0

由 Amerigo Wang 提交于 8月 13, 2009

get_vm_area() only accepts VM_* flags, not GFP_*.

And according to the doc of get_vm_area(), here should be
VM_ALLOC.
Signed-off-by: NWANG Cong <amwang@redhat.com>
Acked-by: NTejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>

142d44b0

percpu, sparc64: fix sparse possible cpu map handling · 74d46d6b

由 Tejun Heo 提交于 7月 21, 2009

percpu code has been assuming num_possible_cpus() == nr_cpu_ids which
is incorrect if cpu_possible_map contains holes.  This causes percpu
code to access beyond allocated memories and vmalloc areas.  On a
sparc64 machine with cpus 0 and 2 (u60), this triggers the following
warning or fails boot.

 WARNING: at /devel/tj/os/work/mm/vmalloc.c:106 vmap_page_range_noflush+0x1f0/0x240()
 Modules linked in:
 Call Trace:
  [00000000004b17d0] vmap_page_range_noflush+0x1f0/0x240
  [00000000004b1840] map_vm_area+0x20/0x60
  [00000000004b1950] __vmalloc_area_node+0xd0/0x160
  [0000000000593434] deflate_init+0x14/0xe0
  [0000000000583b94] __crypto_alloc_tfm+0xd4/0x1e0
  [00000000005844f0] crypto_alloc_base+0x50/0xa0
  [000000000058b898] alg_test_comp+0x18/0x80
  [000000000058dad4] alg_test+0x54/0x180
  [000000000058af00] cryptomgr_test+0x40/0x60
  [0000000000473098] kthread+0x58/0x80
  [000000000042b590] kernel_thread+0x30/0x60
  [0000000000472fd0] kthreadd+0xf0/0x160
 ---[ end trace 429b268a213317ba ]---

This patch fixes generic percpu functions and sparc64
setup_per_cpu_areas() so that they handle sparse cpu_possible_map
properly.

Please note that on x86, cpu_possible_map() doesn't contain holes and
thus num_possible_cpus() == nr_cpu_ids and this patch doesn't cause
any behavior difference.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>

74d46d6b

10 8月, 2009 1 次提交

mempool.c: clean up type-casting · 5e2f89b5

由 Figo.zhang 提交于 8月 08, 2009

clean up type-casting twice.  "size_t" is typedef as "unsigned long" in
64-bit system, and "unsigned int" in 32-bit system, and the intermediate
cast to 'long' is pointless.
Signed-off-by: NFigo.zhang <figo1802@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5e2f89b5

08 8月, 2009 1 次提交

mm: make set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware · 4bfc4495

由 KAMEZAWA Hiroyuki 提交于 8月 06, 2009

At first, init_task's mems_allowed is initialized as this.
 init_task->mems_allowed == node_state[N_POSSIBLE]

And cpuset's top_cpuset mask is initialized as this
 top_cpuset->mems_allowed = node_state[N_HIGH_MEMORY]

Before 2.6.29:
policy's mems_allowed is initialized as this.

  1. update tasks->mems_allowed by its cpuset->mems_allowed.
  2. policy->mems_allowed = nodes_and(tasks->mems_allowed, user's mask)

Updating task's mems_allowed in reference to top_cpuset's one.
cpuset's mems_allowed is aware of N_HIGH_MEMORY, always.

In 2.6.30: After commit 58568d2a
("cpuset,mm: update tasks' mems_allowed in time"), policy's mems_allowed
is initialized as this.

  1. policy->mems_allowd = nodes_and(task->mems_allowed, user's mask)

Here, if task is in top_cpuset, task->mems_allowed is not updated from
init's one.  Assume user excutes command as #numactrl --interleave=all
,....

  policy->mems_allowd = nodes_and(N_POSSIBLE, ALL_SET_MASK)

Then, policy's mems_allowd can includes a possible node, which has no pgdat.

MPOL's INTERLEAVE just scans nodemask of task->mems_allowd and access this
directly.

  NODE_DATA(nid)->zonelist even if NODE_DATA(nid)==NULL

Then, what's we need is making policy->mems_allowed be aware of
N_HIGH_MEMORY.  This patch does that.  But to do so, extra nodemask will
be on statck.  Because I know cpumask has a new interface of
CPUMASK_ALLOC(), I added it to node.

This patch stands on old behavior.  But I feel this fix itself is just a
Band-Aid.  But to do fundametal fix, we have to take care of memory
hotplug and it takes time.  (task->mems_allowd should be N_HIGH_MEMORY, I
think.)

mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask
should be includes only online nodes.

In old behavior, this is guaranteed by frequent reference to cpuset's
code.  Now, most of them are removed and mempolicy has to check it by
itself.

To do check, a few nodemask_t will be used for calculating nodemask.  But,
size of nodemask_t can be big and it's not good to allocate them on stack.

Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area.
NODEMASK_ALLOC/FREE shoudl be there.

[akpm@linux-foundation.org: cleanups & tweaks]
Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4bfc4495

bug2833 / cloud-kernel 与 Fork 源项目一致

bug2833 / cloud-kernel
与 Fork 源项目一致